Re: ACL implementation: Pseudo-join performance & Atomic Updates

Oleg Burlaca Sun, 14 Jul 2013 09:58:57 -0700

Hello Jack,

Thanks for so many links, my comments are below, I'll found a way to
rephrase all my questions in one:
How to implement a DAC (Discretionary Access Control) similar to Windows OS
using SOLR?


What we have: a hierarchical filesystem, user and groups, permissions
applied at the level of a file/folder.
What we need: full-text search & restricting access based on ACL.
How to deal with a change in permissions for a big folder?
How to check if the user can delete a folder?  (it means he should have
write access to all files/sub-folders)


> Role-based security is an easier nut to crack
yep, but we need DAC :(

> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
The documentation doesn't reveal what happens when content should be
reindexed, although the last chapter "Document-based Authorization" shows
the same approach: user list specified at the level of the document.

> Karl Wright of ManifoldCF had a Solr patch for document access control at
one point:
> SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
security at search time
> https://issues.apache.org/**jira/browse/SOLR-1895<https://issues.apache.org/jira/browse/SOLR-1895>
It states "LCF SearchComponent which filters returned results based on
access tokens provided by LCF's authority service"
That means filtering is applied on the results only.
Issues: faceting doesn't work correctly (i.e. counting), because the filter
isn't applied yet.
Even worse: you have to scroll through the result set until you find
records accessible by the user (what if the user has access to 10 from
100,000 files?)

> http://www.slideshare.net/**lucenerevolution/wright-nokia-**
manifoldcfeurocon-2011<http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011>
Page 9 says "docs and access tokens".
"Separate bins for "allow" tokens, "deny" tokens for "file" "
It's similar to the approach we use: each record in SOLR has two fields:
"readAccess" and "WriteAccess" both is a multivalued field with userId's.
it allows us to quickly delete a bunch of items the user has access to for
ex. (or checking a hierarchical delete)

> http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security 
> <http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security>
"It works by adding security tokens from the source repositories as
metadata on the indexed documents"
Again, the permission info is stored within the record itself, and if we
change access for big folder, it means reindexing.

> https://issues.apache.org/**jira/browse/SOLR-1913<https://issues.apache.org/jira/browse/SOLR-1913>
Thanks for the link, need to meditate if I can find a way to use it.

> But the bottom line is that clearly updating all documents in the index
is a non-starter.
I have scratched my head, and monitoring SOLR features for a long time,
trying to find something I can use. Today I've watched Yonik Seeley video:
http://vimeopro.com/user11514798/apache-lucene-eurocon-2012/video/55387447
and found PSEUDO-JOINS, nice!!!! This seems a perfect solution, I can have
two indexes, one with full-text and another one with objId and userId's,
the second one should be fast to update I hope.

But the question is: what about performance?

Regards




On Sun, Jul 14, 2013 at 7:05 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> Take a look at LucidWorks Search and its access control:
> http://docs.lucidworks.com/**display/help/Search+Filters+**
> for+Access+Control<http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control>
>
> Role-based security is an easier nut to crack.
>
> Karl Wright of ManifoldCF had a Solr patch for document access control at
> one point:
> SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
> security at search time
> https://issues.apache.org/**jira/browse/SOLR-1895<https://issues.apache.org/jira/browse/SOLR-1895>
>
> http://www.slideshare.net/**lucenerevolution/wright-nokia-**
> manifoldcfeurocon-2011<http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011>
>
> For some other thoughts:
> http://wiki.apache.org/solr/**SolrSecurity#Document_Level_**Security<http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security>
>
> I'm not sure if external file fields will be of any value in this
> situation.
>
> There is also a proposal for bitwise operations:
> SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on
> Bitwise Operations on Integer Fields
> https://issues.apache.org/**jira/browse/SOLR-1913<https://issues.apache.org/jira/browse/SOLR-1913>
>
> But the bottom line is that clearly updating all documents in the index is
> a non-starter.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Oleg Burlaca
> Sent: Sunday, July 14, 2013 11:02 AM
> To: solr-user@lucene.apache.org
> Subject: ACL implementation: Pseudo-join performance & Atomic Updates
>
>
> Hello all,
>
> Situation:
> We have a collection of files in SOLR with ACL applied: each file has a
> multi-valued field that contains the list of userID's that can read it:
>
> here is sample data:
> Id | content  | userId
> 1  | text text | 4,5,6,2
> 2  | text text | 4,5,9
> 3  | text text | 4,2
>
> Problem:
> when ACL is changed for a big folder, we compute the ACL for all child
> items and reindex in SOLR using atomic updates (updating only 'userIds'
> column), but because it deletes/reindexes the record, the performance is
> very poor.
>
> Question:
> I suppose the delete/reindex approach will not change soon (probably it's
> due to actual SOLR architecture), ?
>
> Possible solution: assuming atomic updates will be super fast on an index
> without fulltext, keep a separate ACLIndex and FullTextIndex and use
> Pseudo-Joins:
>
> Example: searching 'foo' as user '999'
> /solr/FullTextIndex/select/?q=**foo&fq{!join fromIndex=ACLIndex from=Id
> to=Id
> }userId:999
>
> Question: what about performance here? what if the index is 100,000
> records?
> notice that the worst situation is when everyone has access to all the
> files, it means the first filter will be the full index.
>
> Would be happy to get any links that deal with the issue of Pseudo-join
> performance for large datasets (i.e. initial filtered set of IDs).
>
> Regards,
> Oleg
>
> P.S. we found that having the list of all users that have access for each
> record is better overall, because there are much more read requests (people
> accessing the library) then write requests (a new user is added/removed).
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to