Re: ACL implementation: Pseudo-join performance & Atomic Updates

Oleg Burlaca Sun, 14 Jul 2013 10:46:58 -0700

Hello Erick,

> Join performance is most sensitive to the number of values
> in the field being joined on. So if you have lots and lots of
> distinct values in the corpus, join performance will be affected.
Yep, we have a list of unique Id's that we get by first searching for
records
where loggedInUser IS IN (userIDs)
This corpus is stored in memory I suppose? (not a problem) and then the
bottleneck is to match this huge set with the core where I'm searching?


Somewhere in maillist archive people were talking about "external list of
Solr unique IDs"
but didn't find if there is a solution.
Back in 2010 Yonik posted a comment:
http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd


> bq: I suppose the delete/reindex approach will not change soon
> There is ongoing work (search the JIRA for "Stacked Segments")
Ah, ok, I was feeling it affects the architecture, ok, now the only hope is
Pseudo-Joins ))

> One way to deal with this is to implement a "post filter", sometimes
called
> a "no cache" filter.
thanks, will have a look, but as you describe it, it's not the best option.

The approach
"too many documents, man. Please refine your query. Partial results below"
means faceting will not work correctly?

... I have in mind a hybrid approach, comments welcome:
Most of the time users are not searching, but browsing content, so our
"virtual filesystem" stored in SOLR will use only the index with the Id of
the file and the list of users that have access to it. i.e. not touching
the fulltext index at all.

Files may have metadata (EXIF info for images for ex) that we'd like to
filter by, calculate facets.
Meta will be stored in both indexes.

In case of a fulltext query:
1. search FT index (the fulltext index), get only the number of search
results, let it be Rf
2. search DAC index (the index with permissions), get number of search
results, let it be Rd

let maxR be the maximum size of the corpus for the pseudo-join.
*That was actually my question: what is a reasonable number? 10, 100, 1000 ?
*

if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto the
second one.
this happens when (only a few documents contains the search query) OR (user
has access to a small number of files).

In case none of these happens, we can use the
"too many documents, man. Please refine your query. Partial results below"
but first searching the FT index, because we want relevant results first.

What do you think?

Regards,
Oleg




On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Join performance is most sensitive to the number of values
> in the field being joined on. So if you have lots and lots of
> distinct values in the corpus, join performance will be affected.
>
> bq: I suppose the delete/reindex approach will not change soon
>
> There is ongoing work (search the JIRA for "Stacked Segments")
> on actually doing something about this, but it's been "under consideration"
> for at least 3 years so your guess is as good as mine.
>
> bq: notice that the worst situation is when everyone has access to all the
> files, it means the first filter will be the full index.
>
> One way to deal with this is to implement a "post filter", sometimes called
> a "no cache" filter. The distinction here is that
> 1> it is not cached (duh!)
> 2> it is only called for documents that have made it through all the
>      other "lower cost" filters (and the main query of course).
> 3> "lower cost" means the filter is either a standard, cached filters
>     and any "no cache" filters with a cost (explicitly stated in the query)
>     lower than this one's.
>
> Critically, and unlike "normal" filter queries, the result set is NOT
> calculated for all documents ahead of time....
>
> You _still_ have to deal with the sysadmin doing a *:* query as you
> are well aware. But one can mitigate that by having the post-filter
> fail all documents after some arbitrary N, and display a message in the
> app like "too many documents, man. Please refine your query. Partial
> results below". Of course this may not be acceptable, but....
>
> HTH
> Erick
>
> On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
> <j...@basetechnology.com> wrote:
> > Take a look at LucidWorks Search and its access control:
> >
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
> >
> > Role-based security is an easier nut to crack.
> >
> > Karl Wright of ManifoldCF had a Solr patch for document access control at
> > one point:
> > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing ManifoldCF
> > security at search time
> > https://issues.apache.org/jira/browse/SOLR-1895
> >
> >
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
> >
> > For some other thoughts:
> > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> >
> > I'm not sure if external file fields will be of any value in this
> situation.
> >
> > There is also a proposal for bitwise operations:
> > SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based on
> > Bitwise Operations on Integer Fields
> > https://issues.apache.org/jira/browse/SOLR-1913
> >
> > But the bottom line is that clearly updating all documents in the index
> is a
> > non-starter.
> >
> > -- Jack Krupansky
> >
> > -----Original Message----- From: Oleg Burlaca
> > Sent: Sunday, July 14, 2013 11:02 AM
> > To: solr-user@lucene.apache.org
> > Subject: ACL implementation: Pseudo-join performance & Atomic Updates
> >
> >
> > Hello all,
> >
> > Situation:
> > We have a collection of files in SOLR with ACL applied: each file has a
> > multi-valued field that contains the list of userID's that can read it:
> >
> > here is sample data:
> > Id | content  | userId
> > 1  | text text | 4,5,6,2
> > 2  | text text | 4,5,9
> > 3  | text text | 4,2
> >
> > Problem:
> > when ACL is changed for a big folder, we compute the ACL for all child
> > items and reindex in SOLR using atomic updates (updating only 'userIds'
> > column), but because it deletes/reindexes the record, the performance is
> > very poor.
> >
> > Question:
> > I suppose the delete/reindex approach will not change soon (probably it's
> > due to actual SOLR architecture), ?
> >
> > Possible solution: assuming atomic updates will be super fast on an index
> > without fulltext, keep a separate ACLIndex and FullTextIndex and use
> > Pseudo-Joins:
> >
> > Example: searching 'foo' as user '999'
> > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex from=Id
> to=Id
> > }userId:999
> >
> > Question: what about performance here? what if the index is 100,000
> > records?
> > notice that the worst situation is when everyone has access to all the
> > files, it means the first filter will be the full index.
> >
> > Would be happy to get any links that deal with the issue of Pseudo-join
> > performance for large datasets (i.e. initial filtered set of IDs).
> >
> > Regards,
> > Oleg
> >
> > P.S. we found that having the list of all users that have access for each
> > record is better overall, because there are much more read requests
> (people
> > accessing the library) then write requests (a new user is added/removed).
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to