Re: FW: Solr and LCF security at query time

2010-04-28 Thread Peter Sturge
Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the
middle of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the
results are filtered 'post-Lucene', but are separately (Solr) cached, so you
get a hit on the first search, but then benefit from cached hits on
subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
applied at the Lucene query directly, so don't have separate Solr caching.
I've not benchmarked the two, so one or other might be slower/faster for
various search scenarios.

In any case, I believe either technique can be employed in either 1834 or
1872.


With regards schema extension, I believe we need to be very careful here, as
requiring index-time storage of access control data will pose a problem for
any use cases where the access control needs to change (maybe often, maybe
only occasionally). I'm trying to think of a use case where this wouldn't at
least potentially be the case, and I can't think of one, but perhaps I'm not
truly understanding what exactly is stored in the __ALLOW_TOKEN__ and
__DENY_TOKEN__ fields, and how/where subsequent acl changes would fit in
(e.g. let's say someone has left my organization, do I have to update
documents to remove his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e.
Would the same type/format of tokens be used for data from different sources
(e.g. NTFS files, network streams, NFS, web pages, etc.). Will the tokens be
compatible with multiple and/or changing authorities (e.g. AD, documentum,
LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had
enough time to look into how this might look at the moment, but it sounds
like it could be a good way to hold generic (authority-agnostic) acl data,
and [hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism,
please correct me. Would the __ALLOW_TOKEN__ et al fields store, for
example, SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, karl.wri...@nokia.com wrote:

 Ok, not hearing back from Peter, I've done some Solr research and written
 some code that might work.  The approach I've taken is most similar to SOLR
 1834, other than the LCF-centric logic.  Hopefully there will be a chance to
 try this out in a full end-to-end way  on the weekend, after which I will
 submit it to the Solr team (where I think it most naturally would be built
 and delivered).

 What it's going to need is either a static or dynamic schema addition to
 define __ALLOW_TOKEN__document, __DENY_TOKEN__document,
 __ALLOW_TOKEN__share, and __DENY_TOKEN__share fields.  These should be
 string, multivalued fields (I think).  It would be great if these could be
 made a default part of Solr; similarly, it would be good if the new search
 component was predelivered with Solr and mentioned (even if commented out)
 in the example solrconfig.xml file.  The only other thing that needs to be
 done to hook up the search component is to include a configuration parameter
 describing the base URL of the LCF authority service.  Plus, as I said
 earlier, we still don't have a canned solution for authentication yet -
 although I feel that will be straightforward.

 Comments welcome...
 Karl


 
 From: Wright Karl (Nokia-S/Cambridge)
 Sent: Tuesday, April 27, 2010 8:20 AM
 To: connectors-dev@incubator.apache.org; d...@lucene.apache.org
 Cc: connectors-u...@incubator.apache.org; lucene-...@apache.org
 Subject: RE: FW: Solr and LCF security at query time

 Hi Peter,

 I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions
 in detail, and have a couple of SOLR-related questions.

 Both contributions rely on a SearchComponent to work their magic.  However,
 it also appears that each modifies the user query in a different way.  1834
 uses MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND
 and OR filterquery clauses.  Both of them are constructed using Solr
 FilterQuery objects.  Here are my questions:

 (1) I am not conversant enough with Solr yet to know the difference between
 the different kinds of clause structure.  Do you know if there is a
 difference?  For example, is there any possibility that AND/OR clauses can
 permit documents to be seen that should not be seen?  (MUST and MUST_NOT
 sound a lot more definite...)

 (2) Are Solr FilterQuery objects applied to constructing the query that
 will be sent to Lucene?  Or are they applied by Solr after-the-fact to the
 resultset?  Or, is it a combination of the two, depending on the details of
 your actual filter clause?

 I also haven't heard much from you

RE: FW: Solr and LCF security at query time

2010-04-28 Thread karl.wright

With regards schema extension, I believe we need to be very careful here, as 
requiring index-time storage of access control data will pose a problem for any 
use cases where the access control needs to change (maybe often, maybe only 
occasionally). I'm trying to think of a use case where this wouldn't at least 
potentially be the case, and I can't think of one, but perhaps I'm not truly 
understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ 
fields, and how/where subsequent acl changes would fit in (e.g. let's say 
someone has left my organization, do I have to update documents to remove 
his/her access?).


Usually the way this works is that the user's account is locked out so they 
can't log in.  The authority service picks up this change, and it therefore 
takes place immediately.

Bear in mind that this particular model has been employed by MetaCarta for more 
than five years in the field with clients such as pretty near all the major oil 
companies, many U.S. government agencies, the U.S. military, etc.  In that time 
we have not heard even one complaint about the security model.

Karl



From: ext Peter Sturge [mailto:peter.stu...@googlemail.com]
Sent: Wednesday, April 28, 2010 7:18 AM
To: d...@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-u...@incubator.apache.org; 
lucene-...@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Apologies for the delayed reply. I've been away on business, and in the middle 
of a product release, so it's been a busy time...

In response to your eariler questions:

The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses, 
although the point at which these are done is slightly different.

I think I am correct in my understanding that with filter queries, the results 
are filtered 'post-Lucene', but are separately (Solr) cached, so you get a hit 
on the first search, but then benefit from cached hits on subsequent searches. 
The lower-level 'MUST NOT/SHOULD' etc. clauses are applied at the Lucene query 
directly, so don't have separate Solr caching. I've not benchmarked the two, so 
one or other might be slower/faster for various search scenarios.

In any case, I believe either technique can be employed in either 1834 or 1872.


With regards schema extension, I believe we need to be very careful here, as 
requiring index-time storage of access control data will pose a problem for any 
use cases where the access control needs to change (maybe often, maybe only 
occasionally). I'm trying to think of a use case where this wouldn't at least 
potentially be the case, and I can't think of one, but perhaps I'm not truly 
understanding what exactly is stored in the __ALLOW_TOKEN__ and __DENY_TOKEN__ 
fields, and how/where subsequent acl changes would fit in (e.g. let's say 
someone has left my organization, do I have to update documents to remove 
his/her access?).

Also, would such indexed tokens be entirely 'document-context-free'? I.e. Would 
the same type/format of tokens be used for data from different sources (e.g. 
NTFS files, network streams, NFS, web pages, etc.). Will the tokens be 
compatible with multiple and/or changing authorities (e.g. AD, documentum, 
LDAP, custom, etc.)?

I like the idea of an LCF plugin to hold the acl data. I admit, I've not had 
enough time to look into how this might look at the moment, but it sounds like 
it could be a good way to hold generic (authority-agnostic) acl data, and 
[hopefully] not have to tie it to document data at index-time.

I hope this makes sense, but if I've misunderstood the proposed mechanism, 
please correct me. Would the __ALLOW_TOKEN__ et al fields store, for example, 
SID information?


Thanks,
Peter



On Tue, Apr 27, 2010 at 10:21 PM, 
karl.wri...@nokia.commailto:karl.wri...@nokia.com wrote:
Ok, not hearing back from Peter, I've done some Solr research and written some 
code that might work.  The approach I've taken is most similar to SOLR 1834, 
other than the LCF-centric logic.  Hopefully there will be a chance to try this 
out in a full end-to-end way  on the weekend, after which I will submit it to 
the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define 
__ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and 
__DENY_TOKEN__share fields.  These should be string, multivalued fields (I 
think).  It would be great if these could be made a default part of Solr; 
similarly, it would be good if the new search component was predelivered with 
Solr and mentioned (even if commented out) in the example solrconfig.xml file.  
The only other thing that needs to be done to hook up the search component is 
to include a configuration parameter describing the base URL of the LCF 
authority service.  Plus, as I said earlier, we still don't have a canned 
solution

Re: FW: Solr and LCF security at query time

2010-04-28 Thread Peter Sturge
Hi Karl,

Yes, I don't doubt that using an external mechanism such as AD lockout will
work for those and other environments. I guess it comes down to the
difference between bespoke consultancy-type solutions and general-purpose
product solutions, of which the requirements are often very different. For a
general Access Control solution integrated into Solr, assumptions on the
presence/type of such external controls can't, and should't be assumed. If
they are/must be assumed, one of the core reasons for adding the new
functionality is missing.

As a starting point, for a general purpose access control system, at least
the following questions need to be addressed:
   * What happens when access control needs to change?
   * What happens if access control needs to change often (e.g. more than
several times a day)?
   * Can the access control cope with multiple data source types, without
the need for custom code, including data with no attached acl information?
   * If I change my access control, how is 'offline' data affected? (e.g.
backed-up data)
   * Will the access control satisfy regulatory compliance specs on it own,
or is an external mechanism required?
  (currently, Solr requires an external mechanism, but so also the
proposed solution)

As you might have guessed, I've been down this road before, and the
productization of security control has many facets, and these, as a general
rule, need to be addressed differently in products than in site-specific
deployments - mainly because products can't assume the envinroment(s) they
will run in (e.g. Active Directory).

The good thing is, there is a good alternative - that is: to store access
control information separately from indexed data and separately from an
authority. To me, that's where the beauty of an LCF plugin architecture
lives. Then, the task is to provide the integration tools (and it sounds
like LCF is very well suited for this) to deliver the 'bridge' between
content and authorization. (as you quite rightly said, authentication is a
separate, albeit related, subject)

Thanks,
Peter




On Wed, Apr 28, 2010 at 12:46 PM, karl.wri...@nokia.com wrote:

  
 With regards schema extension, I believe we need to be very careful here,
 as requiring index-time storage of access control data will pose a problem
 for any use cases where the access control needs to change (maybe often,
 maybe only occasionally). I'm trying to think of a use case where this
 wouldn't at least potentially be the case, and I can't think of one, but
 perhaps I'm not truly understanding what exactly is stored in the
 __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how/where subsequent acl
 changes would fit in (e.g. let's say someone has left my organization, do I
 have to update documents to remove his/her access?).
 

 Usually the way this works is that the user's account is locked out so they
 can't log in.  The authority service picks up this change, and it therefore
 takes place immediately.

 Bear in mind that this particular model has been employed by MetaCarta for
 more than five years in the field with clients such as pretty near all the
 major oil companies, many U.S. government agencies, the U.S. military, etc.
 In that time we have not heard even one complaint about the security model.

 Karl


  --
 *From:* ext Peter Sturge [mailto:peter.stu...@googlemail.com]
 *Sent:* Wednesday, April 28, 2010 7:18 AM

 *To:* d...@lucene.apache.org
 *Cc:* connectors-dev@incubator.apache.org;
 connectors-u...@incubator.apache.org; lucene-...@apache.org
 *Subject:* Re: FW: Solr and LCF security at query time

 Hi Karl,

 Apologies for the delayed reply. I've been away on business, and in the
 middle of a product release, so it's been a busy time...

 In response to your eariler questions:

 The 'AND/OR' filter query, will ultimately map down Lucene Boolean clauses,
 although the point at which these are done is slightly different.

 I think I am correct in my understanding that with filter queries, the
 results are filtered 'post-Lucene', but are separately (Solr) cached, so you
 get a hit on the first search, but then benefit from cached hits on
 subsequent searches. The lower-level 'MUST NOT/SHOULD' etc. clauses are
 applied at the Lucene query directly, so don't have separate Solr caching.
 I've not benchmarked the two, so one or other might be slower/faster for
 various search scenarios.

 In any case, I believe either technique can be employed in either 1834 or
 1872.


 With regards schema extension, I believe we need to be very careful here,
 as requiring index-time storage of access control data will pose a problem
 for any use cases where the access control needs to change (maybe often,
 maybe only occasionally). I'm trying to think of a use case where this
 wouldn't at least potentially be the case, and I can't think of one, but
 perhaps I'm not truly understanding what exactly is stored in the
 __ALLOW_TOKEN__ and __DENY_TOKEN__ fields, and how

Re: FW: Solr and LCF security at query time

2010-04-28 Thread Peter Sturge
Hi Karl,

I wasn't trying to to put pay to your design proposal, really the opposite -
to highlight requirements that have found to be necessary for
customers/users, and to hopefully get the best functionality for the
product. If you feel I've put you out on any of the issues raised, then I
apologize for that, it was certainly not my intention.

Peter


RE: FW: Solr and LCF security at query time

2010-04-27 Thread karl.wright
Ok, not hearing back from Peter, I've done some Solr research and written some 
code that might work.  The approach I've taken is most similar to SOLR 1834, 
other than the LCF-centric logic.  Hopefully there will be a chance to try this 
out in a full end-to-end way  on the weekend, after which I will submit it to 
the Solr team (where I think it most naturally would be built and delivered).

What it's going to need is either a static or dynamic schema addition to define 
__ALLOW_TOKEN__document, __DENY_TOKEN__document, __ALLOW_TOKEN__share, and 
__DENY_TOKEN__share fields.  These should be string, multivalued fields (I 
think).  It would be great if these could be made a default part of Solr; 
similarly, it would be good if the new search component was predelivered with 
Solr and mentioned (even if commented out) in the example solrconfig.xml file.  
The only other thing that needs to be done to hook up the search component is 
to include a configuration parameter describing the base URL of the LCF 
authority service.  Plus, as I said earlier, we still don't have a canned 
solution for authentication yet - although I feel that will be straightforward.

Comments welcome...
Karl



From: Wright Karl (Nokia-S/Cambridge)
Sent: Tuesday, April 27, 2010 8:20 AM
To: connectors-dev@incubator.apache.org; d...@lucene.apache.org
Cc: connectors-u...@incubator.apache.org; lucene-...@apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I finally had a moment to review the SOLR 1872 and SOLR 1834 contributions in 
detail, and have a couple of SOLR-related questions.

Both contributions rely on a SearchComponent to work their magic.  However, it 
also appears that each modifies the user query in a different way.  1834 uses 
MUST, MUST_NOT, and SHOULD filter items, while 1872 uses standard AND and OR 
filterquery clauses.  Both of them are constructed using Solr FilterQuery 
objects.  Here are my questions:

(1) I am not conversant enough with Solr yet to know the difference between the 
different kinds of clause structure.  Do you know if there is a difference?  
For example, is there any possibility that AND/OR clauses can permit documents 
to be seen that should not be seen?  (MUST and MUST_NOT sound a lot more 
definite...)

(2) Are Solr FilterQuery objects applied to constructing the query that will be 
sent to Lucene?  Or are they applied by Solr after-the-fact to the resultset?  
Or, is it a combination of the two, depending on the details of your actual 
filter clause?

I also haven't heard much from you in the last week or so - have you thought 
further about what you intend to do, and can you let me know whether you are 
still interested in developing an LCF plugin for Solr?

Thanks,
Karl

-Original Message-
From: ext Peter Sturge [mailto:peter.stu...@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: d...@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-u...@incubator.apache.org; 
lucene-...@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, karl.wri...@nokia.com wrote:

 Hi Peter,

 The authority connectors don't perform authentication at this time.
 In fact, LCF has nothing to do with authentication at all - just 
 authorization.
  The reason for this is because it is almost never the case that
 somebody wants to provide multiple credentials in order to be able see their 
 results.
  Most enterprises who have multiple repositories authenticate against
 AD and then map AD user names to repository user names in order to
 access those repositories.  If you noted my earlier posts from this
 morning, you may have noted that I'm looking at recommending JAAS plus
 sun's kerb5 login module for handling the authenticate against AD
 case, which would cover some 95%+ of the real world authentication needed out 
 there.


I did read your earlier post regarding this, and I totally agree with you - 
this is best handled 'upstream'. In fact, I use a JAAS plugin in other places 
in the product (not Solr) for authentication.



 Yes, the idea is to store SIDs in solr at index time.  I don't know
 enough about solr to know what kinds of issues this might entail, but
 Lucene certainly has a model of metadata that's pretty flexible, so I
 don't think this would be difficult at all.  Eric Hatcher also seemed
 to confirm my suspicions that this would not be a problem.


It's certainly not a problem to store this data in Solr. The problem is more 
that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with 
documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group 
of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index

RE: FW: Solr and LCF security at query time

2010-04-22 Thread karl.wright
Looking around for no-Apache java-only solutions to the AD authentication 
problem, it seems to me that what we mainly have available is JAAS plus the 
following JAAS login module:

com.sun.security.auth.module.Krb5LoginModule

... which should permit AD authentication to take place,  if properly 
configured.
So, we *could* stipulate that the search component receive credentials, 
somehow, upon being called, and then authenticate each time.  (There's a ticket 
cache involved, so this is not as insane as it sounds).

But this architecture option makes me twitchy because I am unclear how exactly 
this would help Tomcat interact with the browser to manage signon for a web 
application.  So it might be better to push the authentication itself upstream 
into a module meant to be plugged into Tomcat, and have Solr just receive and 
deal with the resulting ticket, and/or an authenticated, domain-qualified user 
name.  The task of the LCF Solr search component or filter would then be to do 
the following:

(1) Get hold of the ticket/authenticated user name, which will probably come in 
as some attribute to the search that's presented to Solr.  (Someone needs to 
specify what this attribute is called still).
(2) Invoke a configured LCF authority service with that user name, via http, 
and get back a list of access tokens for the user
(3) Form the search expression with the user's access tokens (if it's a search 
component), or filter the results using those access tokens (if it's a filter), 
remembering that every document that's participating in security should have 
__ACCESS_TOKEN__document and __DENY_TOKEN__document metadata

I've also been pondering whether which we should build: a search component or 
filter?  I think there are advantages to both, so I think we should build both, 
and let people use what they need.

I think the technical aspects of building the Solr component are well 
understood by this group, so the only open issue remains how to build a 
JAAS-based AD authentication module for tomcat that would do what we needed.  
I'll be doing more research as time permits...

Karl


From: Wright Karl (Nokia-S/Cambridge)
Sent: Wednesday, April 21, 2010 8:02 PM
To: connectors-u...@incubator.apache.org; lucene-...@apache.org; 
connectors-dev@incubator.apache.org
Subject: RE: FW: Solr and LCF security at query time

Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the 
document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two 
fields.

Hope this helps,
Karl


From: ext Peter Sturge [mailto:peter.stu...@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM
To: connectors-u...@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much 
sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of 
implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me 
while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me 
to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its 
parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue, Apr 20, 2010 at 11:05 PM, 
karl.wri...@nokia.commailto:karl.wri...@nokia.com wrote:
Hi Peter,

I'm the principal committer for LCF, but I don't know as much about Solr as I 
ought to, so it sounds like a potentially productive collaboration.

LCF does exactly what you are looking for - the only issue at all is that you 
need to fetch a URL from a webapp to get what you are looking for.  The plugs 
are all inside LCF for different kinds of repositories.  Here's a link that 
might help with drinking the LCF koolaid, as it were: 
https://cwiki.apache.org/confluence/display/CONNECTORS/Lucene+Connectors+Framework+concepts

The url would be something like this (on a locally installed tomcat-based LCF 
instance):

http://localhost:8080/lcf-authority-service/useracls?username=someusern...@somedomain.com

... and this fetch returns something like:

TOKEN:xxx
TOKEN:yyy
TOKEN:zzz


... which represent the amalgamated tokens for all of the defined authorities, 
and by some strange coincidence ( ;-) ) are compatible with certain pieces of 
metadata that have been passed into Solr with each document - one set of Allow 
tokens, and a second set of Deny tokens.  The LCF solr output connection 
doesn't yet do this, but it is trivial for me to make that happen.

Does this sound plausible to you?

Karl



From: ext Peter

RE: FW: Solr and LCF security at query time

2010-04-22 Thread karl.wright
Hi Peter,

I've attached a diagram that is not in the wiki as of yet, and I'll try to 
answer your questions.


Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a 
particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs 
its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in 
order to cater for group membership allows and denies.


The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings 
that represent a contract between an LCF authority connection and the LCF 
repository connection that picks up the documents (from wherever).  These 
tokens thus have no real meaning outside of LCF.  You must regard them as 
opaque.

The contract, however, states that if you use the LCF authority service to 
obtain tokens for an authenticated user, you will get back a set that is 
CONSISTENT with the tokens that were attached to the documents LCF sent to Solr 
for indexing in the first place.  So, you don't have to worry about it, and 
that's kind of the idea.  So you imagine the following flow:

(1) Use LCF to fetch documents and send them to Solr
(2) When searching, use the LCF authority service to get the desired user's 
access tokens
(3) Either filter the results, or modify the query, to be sure the access 
tokens all match up properly

For the AD authority, the LCF access tokens consist, in part, of the user's 
SIDs.  For other authorities, the access tokens are wildly different.  You 
really don't want to know what's in them, since that's the job of the LCF 
authority to determine. ;-)

LCF is not, by the way, joined at the hip with AD.  However, in practice, most 
enterprises in the world use some form of AD single signon for their web 
applications, and even if they're using some repository with its own idea of 
security, there's a mapping between the AD users and the repository's users.  
Doing that mapping is also the job of the LCF authority for that repository.

Hope this helps.  Also, I'm not expecting time miracles here, so don't sweat 
the schedule.


Karl



From: ext Peter Sturge [peter.stu...@googlemail.com]
Sent: Thursday, April 22, 2010 4:27 AM
To: d...@lucene.apache.org
Cc: connectors-u...@incubator.apache.org; lucene-...@apache.org; 
connectors-dev@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the quick turnaround.
I'm in the middle of a product release for us, so I fear I won't be as quick as 
you... :-)

I couldn't find a simple flow diagram or similar for LCF with regards security 
(probably looking in the wrong place).
Perhaps you could help on these questions...?

In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries, 
which are then used as filter queries in a user's search.

Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a 
particular user in the underlying acl store (e.g. Active Directory)?
How does AD and/or LCF handle storing such data in its schema? (does AD needs 
its schema extended?)
Presumably, any such AD fields would need to be queried for effective rights in 
order to cater for group membership allows and denies.

I guess I'm just trying to understand the architectural flow/storage/retrieval 
of data in the various parts of the system, but I admit, I need to do more 
research on this.
After our product release, when I get a few more spare cycles, I can look at it 
in more detail.

Many thanks!
Peter



On Thu, Apr 22, 2010 at 1:02 AM, 
karl.wri...@nokia.commailto:karl.wri...@nokia.com wrote:
Hi Peter,

I just committed the promised changes to the LCF Solr output connector.

ACL metadata will now be posted to the Solr Http interface along with the 
document as the two following fields:

__ACCESS_TOKEN__document
__DENY_TOKEN__document

There will, of course, potentially be multiple values for each of these two 
fields.

Hope this helps,
Karl


From: ext Peter Sturge 
[mailto:peter.stu...@googlemail.commailto:peter.stu...@googlemail.com]
Sent: Tuesday, April 20, 2010 6:51 PM

To: 
connectors-u...@incubator.apache.orgmailto:connectors-u...@incubator.apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

Thanks for the info. I'll have a look at the link and try to take in as much 
sugar as my insulin levels will handle...
It sounds like the necessary interface(s) are already in LCF - just a matter of 
implementing them in the Solr 1872 plugin.
I'll need to digest the LCF stuff to get to grips with it..please bear with me 
while I do that...

When you say:
   The LCF solr output connection doesn't yet do this, but it is trivial for me 
to make that happen.
Do you mean a mechanism by which solr.war can get url et al info from its 
parent container (Tomcat, Jetty etc.), or have I misinterpreted this?


Thanks,
Peter




On Tue

Re: FW: Solr and LCF security at query time

2010-04-22 Thread Peter Sturge
Hi Karl,

Thanks very much for the diagram -
Sorry about all the questions, but this raises a few new ones...

What is the relationship between stored data (documents) and authorities'
access/deny attributes? (do you have any examples of what an access_token
value might contain?)

One of the key requirements I've worked to adhere to in SOLR-1872 is to
ensure there are no security or other dependencies of indexed data with any
external repository - most notably the file system.
There are many reasons for wanting this, but one of the main ones is that
Solr-stored data is not always based on file data (or accessible file data).
In fact, in my particular case, almost none of the indexed data comes from
files.

This is one reason why SOLR-1872 uses filter queries for its access/deny
tokens - so that all the required information for access control completely
resides within the Solr index itself.
Is the LCF architecture acl 'mapping' between Solr fields (queries) and
users, some external 'repository' (files) and users, or arbitrary data (e.g.
either of these)?

I hope that makes sense...

Thanks!
Peter




On Thu, Apr 22, 2010 at 10:25 AM, karl.wri...@nokia.com wrote:

 Hi Peter,

 I've attached a diagram that is not in the wiki as of yet, and I'll try to
 answer your questions.

 
 Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
 particular user in the underlying acl store (e.g. Active Directory)?
 How does AD and/or LCF handle storing such data in its schema? (does AD
 needs its schema extended?)
 Presumably, any such AD fields would need to be queried for effective
 rights in order to cater for group membership allows and denies.
 

 The ACCESS_TOKEN and DENY_TOKEN values are, in one sense, arbitrary strings
 that represent a contract between an LCF authority connection and the LCF
 repository connection that picks up the documents (from wherever).  These
 tokens thus have no real meaning outside of LCF.  You must regard them as
 opaque.

 The contract, however, states that if you use the LCF authority service to
 obtain tokens for an authenticated user, you will get back a set that is
 CONSISTENT with the tokens that were attached to the documents LCF sent to
 Solr for indexing in the first place.  So, you don't have to worry about it,
 and that's kind of the idea.  So you imagine the following flow:

 (1) Use LCF to fetch documents and send them to Solr
 (2) When searching, use the LCF authority service to get the desired user's
 access tokens
 (3) Either filter the results, or modify the query, to be sure the access
 tokens all match up properly

 For the AD authority, the LCF access tokens consist, in part, of the user's
 SIDs.  For other authorities, the access tokens are wildly different.  You
 really don't want to know what's in them, since that's the job of the LCF
 authority to determine. ;-)

 LCF is not, by the way, joined at the hip with AD.  However, in practice,
 most enterprises in the world use some form of AD single signon for their
 web applications, and even if they're using some repository with its own
 idea of security, there's a mapping between the AD users and the
 repository's users.  Doing that mapping is also the job of the LCF authority
 for that repository.

 Hope this helps.  Also, I'm not expecting time miracles here, so don't
 sweat the schedule.


 Karl


 
 From: ext Peter Sturge [peter.stu...@googlemail.com]
 Sent: Thursday, April 22, 2010 4:27 AM
 To: d...@lucene.apache.org
 Cc: connectors-u...@incubator.apache.org; lucene-...@apache.org;
 connectors-dev@incubator.apache.org
 Subject: Re: FW: Solr and LCF security at query time

 Hi Karl,

 Thanks for the quick turnaround.
 I'm in the middle of a product release for us, so I fear I won't be as
 quick as you... :-)

 I couldn't find a simple flow diagram or similar for LCF with regards
 security (probably looking in the wrong place).
 Perhaps you could help on these questions...?

 In SOLR-1872, the allows and denies are stored (in acl.xml) as sub-queries,
 which are then used as filter queries in a user's search.

 Are the ACCESS_TOKEN and DENY_TOKEN values whatever have been stored for a
 particular user in the underlying acl store (e.g. Active Directory)?
 How does AD and/or LCF handle storing such data in its schema? (does AD
 needs its schema extended?)
 Presumably, any such AD fields would need to be queried for effective
 rights in order to cater for group membership allows and denies.

 I guess I'm just trying to understand the architectural
 flow/storage/retrieval of data in the various parts of the system, but I
 admit, I need to do more research on this.
 After our product release, when I get a few more spare cycles, I can look
 at it in more detail.

 Many thanks!
 Peter



 On Thu, Apr 22, 2010 at 1:02 AM, karl.wri...@nokia.commailto:
 karl.wri...@nokia.com wrote:
 Hi Peter,

 I just committed the promised changes to the LCF Solr output connector

RE: FW: Solr and LCF security at query time

2010-04-22 Thread karl.wright
Hi Peter,


  * What happens if/when you want to add explicit user access to some [group 
of] documents ? (i.e. not via a group)


In LCF, you change the permissions on the appropriate resource, and then you 
run your LCF job again to update those permissions.  Since LCF is an 
incremental crawler, it is smart enough to only reindex those documents whose 
permissions have changed, which makes it a fairly fast operation on most 
repositories.  Also, in my experience at MetaCarta, this is a relatively 
infrequent kind of situation, and most enterprises are pretty resilient against 
there being a reasonable delay in getting document permissions updated in an 
index.

However, if this is a concern in your environment, your main alternative is to 
go directly to the repository on every document as you filter a resultset.  
That's slow for most situations, perhaps not for a local acl.xml file.  
Performance might be improved with caching, but only if you knew that the same 
results would be returned for multiple queries.


  * What happens if you need to revoke or change a user's or group's access?


I presume you mean a user/group's access to specific documents - which has the 
same answer as above.  If you actually mean the more typical case, where a user 
is locked out, or loses/gains group access, that of course happens at authority 
time, so it is instantaneous.


  * It's difficult to move/replicate the index to another domain


Sure.  If this is something you intend on doing a lot, this is not a solution 
that will work for you.  I don't think we ever had any of MetaCarta's clients 
even *think* of doing this, however. ;-)  Probably because lots of other stuff 
breaks as well.


  * For AD, SIDs are generally not meant to be stored long term outside of AD, 
as they can be changed (this doesn't happen often, but it can happen after an 
AD rebuild, domain type upgrade, data recovery etc.)


Any infrequent operation is not much of a concern to me, since LCF keeps track 
of any changes and will pick them up on the next crawl (and do the minimum 
possible to update the index, as well).

Thanks,
Karl


-Original Message-
From: ext Peter Sturge [mailto:peter.stu...@googlemail.com]
Sent: Thursday, April 22, 2010 12:23 PM
To: d...@lucene.apache.org
Cc: connectors-dev@incubator.apache.org; connectors-u...@incubator.apache.org; 
lucene-...@apache.org
Subject: Re: FW: Solr and LCF security at query time

Hi Karl,

See inline...

On Thu, Apr 22, 2010 at 4:57 PM, karl.wri...@nokia.com wrote:

 Hi Peter,

 The authority connectors don't perform authentication at this time.
 In fact, LCF has nothing to do with authentication at all - just 
 authorization.
  The reason for this is because it is almost never the case that
 somebody wants to provide multiple credentials in order to be able see their 
 results.
  Most enterprises who have multiple repositories authenticate against
 AD and then map AD user names to repository user names in order to
 access those repositories.  If you noted my earlier posts from this
 morning, you may have noted that I'm looking at recommending JAAS plus
 sun's kerb5 login module for handling the authenticate against AD
 case, which would cover some 95%+ of the real world authentication needed out 
 there.


I did read your earlier post regarding this, and I totally agree with you - 
this is best handled 'upstream'. In fact, I use a JAAS plugin in other places 
in the product (not Solr) for authentication.



 Yes, the idea is to store SIDs in solr at index time.  I don't know
 enough about solr to know what kinds of issues this might entail, but
 Lucene certainly has a model of metadata that's pretty flexible, so I
 don't think this would be difficult at all.  Eric Hatcher also seemed
 to confirm my suspicions that this would not be a problem.


It's certainly not a problem to store this data in Solr. The problem is more 
that you don't really *want* to store this data at index time.
There are lots of reasons for not wanting to 'hard-code' SID data with 
documents in the index. Here's just a few:
  * What happens if/when you want to add explicit user access to some [group 
of] documents ? (i.e. not via a group)
  * What happens if you need to revoke or change a user's or group's access?
  * It's difficult to move/replicate the index to another domain
  * For AD, SIDs are generally not meant to be stored long term outside of AD, 
as they can be changed (this doesn't happen often, but it can happen after an 
AD rebuild, domain type upgrade, data recovery etc.)

These and other senarios mean re-indexing the stored data. When the index is 
huge, this is non-trivial (time-wise). There are not uncommon scenarios where 
user/group access control can change multiple times in one day.

There might be a way of storing acl data in a payload or similar, but I'm not 
sure how that would work across millions of [arbitrarily grouped ] documents 
(I'm not familiar enough with payloads to know