Re: ACL implementation: Pseudo-join performance & Atomic Updates

Alexandre Rafalovitch Tue, 16 Jul 2013 05:25:11 -0700

Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?


Regards,
   Alex.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson <erickerick...@gmail.com>wrote:

> Roman:
>
> Did this ever make into a JIRA? Somehow I missed it if it did, and this
> would
> be pretty cool....
>
> Erick
>
> On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
> > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com>
> wrote:
> >
> >> Hello Erick,
> >>
> >> > Join performance is most sensitive to the number of values
> >> > in the field being joined on. So if you have lots and lots of
> >> > distinct values in the corpus, join performance will be affected.
> >> Yep, we have a list of unique Id's that we get by first searching for
> >> records
> >> where loggedInUser IS IN (userIDs)
> >> This corpus is stored in memory I suppose? (not a problem) and then the
> >> bottleneck is to match this huge set with the core where I'm searching?
> >>
> >> Somewhere in maillist archive people were talking about "external list
> of
> >> Solr unique IDs"
> >> but didn't find if there is a solution.
> >> Back in 2010 Yonik posted a comment:
> >> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
> >>
> >
> > sorry, haven't the previous thread in its entirety, but few weeks back
> that
> > Yonik's proposal got implemented, it seems ;)
> >
> >
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
> >
> > You could use this to send very large bitset filter (which can be
> > translated into any integers, if you can come up with a mapping
> function).
> >
> > roman
> >
> >
> >>
> >> > bq: I suppose the delete/reindex approach will not change soon
> >> > There is ongoing work (search the JIRA for "Stacked Segments")
> >> Ah, ok, I was feeling it affects the architecture, ok, now the only
> hope is
> >> Pseudo-Joins ))
> >>
> >> > One way to deal with this is to implement a "post filter", sometimes
> >> called
> >> > a "no cache" filter.
> >> thanks, will have a look, but as you describe it, it's not the best
> option.
> >>
> >> The approach
> >> "too many documents, man. Please refine your query. Partial results
> below"
> >> means faceting will not work correctly?
> >>
> >> ... I have in mind a hybrid approach, comments welcome:
> >> Most of the time users are not searching, but browsing content, so our
> >> "virtual filesystem" stored in SOLR will use only the index with the Id
> of
> >> the file and the list of users that have access to it. i.e. not touching
> >> the fulltext index at all.
> >>
> >> Files may have metadata (EXIF info for images for ex) that we'd like to
> >> filter by, calculate facets.
> >> Meta will be stored in both indexes.
> >>
> >> In case of a fulltext query:
> >> 1. search FT index (the fulltext index), get only the number of search
> >> results, let it be Rf
> >> 2. search DAC index (the index with permissions), get number of search
> >> results, let it be Rd
> >>
> >> let maxR be the maximum size of the corpus for the pseudo-join.
> >> *That was actually my question: what is a reasonable number? 10, 100,
> 1000
> >> ?
> >> *
> >>
> >> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join onto
> the
> >> second one.
> >> this happens when (only a few documents contains the search query) OR
> (user
> >> has access to a small number of files).
> >>
> >> In case none of these happens, we can use the
> >> "too many documents, man. Please refine your query. Partial results
> below"
> >> but first searching the FT index, because we want relevant results
> first.
> >>
> >> What do you think?
> >>
> >> Regards,
> >> Oleg
> >>
> >>
> >>
> >>
> >> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >>
> >> > Join performance is most sensitive to the number of values
> >> > in the field being joined on. So if you have lots and lots of
> >> > distinct values in the corpus, join performance will be affected.
> >> >
> >> > bq: I suppose the delete/reindex approach will not change soon
> >> >
> >> > There is ongoing work (search the JIRA for "Stacked Segments")
> >> > on actually doing something about this, but it's been "under
> >> consideration"
> >> > for at least 3 years so your guess is as good as mine.
> >> >
> >> > bq: notice that the worst situation is when everyone has access to all
> >> the
> >> > files, it means the first filter will be the full index.
> >> >
> >> > One way to deal with this is to implement a "post filter", sometimes
> >> called
> >> > a "no cache" filter. The distinction here is that
> >> > 1> it is not cached (duh!)
> >> > 2> it is only called for documents that have made it through all the
> >> >      other "lower cost" filters (and the main query of course).
> >> > 3> "lower cost" means the filter is either a standard, cached filters
> >> >     and any "no cache" filters with a cost (explicitly stated in the
> >> query)
> >> >     lower than this one's.
> >> >
> >> > Critically, and unlike "normal" filter queries, the result set is NOT
> >> > calculated for all documents ahead of time....
> >> >
> >> > You _still_ have to deal with the sysadmin doing a *:* query as you
> >> > are well aware. But one can mitigate that by having the post-filter
> >> > fail all documents after some arbitrary N, and display a message in
> the
> >> > app like "too many documents, man. Please refine your query. Partial
> >> > results below". Of course this may not be acceptable, but....
> >> >
> >> > HTH
> >> > Erick
> >> >
> >> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
> >> > <j...@basetechnology.com> wrote:
> >> > > Take a look at LucidWorks Search and its access control:
> >> > >
> >> >
> >>
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
> >> > >
> >> > > Role-based security is an easier nut to crack.
> >> > >
> >> > > Karl Wright of ManifoldCF had a Solr patch for document access
> control
> >> at
> >> > > one point:
> >> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing
> ManifoldCF
> >> > > security at search time
> >> > > https://issues.apache.org/jira/browse/SOLR-1895
> >> > >
> >> > >
> >> >
> >>
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
> >> > >
> >> > > For some other thoughts:
> >> > > http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> >> > >
> >> > > I'm not sure if external file fields will be of any value in this
> >> > situation.
> >> > >
> >> > > There is also a proposal for bitwise operations:
> >> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering Based
> on
> >> > > Bitwise Operations on Integer Fields
> >> > > https://issues.apache.org/jira/browse/SOLR-1913
> >> > >
> >> > > But the bottom line is that clearly updating all documents in the
> index
> >> > is a
> >> > > non-starter.
> >> > >
> >> > > -- Jack Krupansky
> >> > >
> >> > > -----Original Message----- From: Oleg Burlaca
> >> > > Sent: Sunday, July 14, 2013 11:02 AM
> >> > > To: solr-user@lucene.apache.org
> >> > > Subject: ACL implementation: Pseudo-join performance & Atomic
> Updates
> >> > >
> >> > >
> >> > > Hello all,
> >> > >
> >> > > Situation:
> >> > > We have a collection of files in SOLR with ACL applied: each file
> has a
> >> > > multi-valued field that contains the list of userID's that can read
> it:
> >> > >
> >> > > here is sample data:
> >> > > Id | content  | userId
> >> > > 1  | text text | 4,5,6,2
> >> > > 2  | text text | 4,5,9
> >> > > 3  | text text | 4,2
> >> > >
> >> > > Problem:
> >> > > when ACL is changed for a big folder, we compute the ACL for all
> child
> >> > > items and reindex in SOLR using atomic updates (updating only
> 'userIds'
> >> > > column), but because it deletes/reindexes the record, the
> performance
> >> is
> >> > > very poor.
> >> > >
> >> > > Question:
> >> > > I suppose the delete/reindex approach will not change soon (probably
> >> it's
> >> > > due to actual SOLR architecture), ?
> >> > >
> >> > > Possible solution: assuming atomic updates will be super fast on an
> >> index
> >> > > without fulltext, keep a separate ACLIndex and FullTextIndex and use
> >> > > Pseudo-Joins:
> >> > >
> >> > > Example: searching 'foo' as user '999'
> >> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex
> from=Id
> >> > to=Id
> >> > > }userId:999
> >> > >
> >> > > Question: what about performance here? what if the index is 100,000
> >> > > records?
> >> > > notice that the worst situation is when everyone has access to all
> the
> >> > > files, it means the first filter will be the full index.
> >> > >
> >> > > Would be happy to get any links that deal with the issue of
> Pseudo-join
> >> > > performance for large datasets (i.e. initial filtered set of IDs).
> >> > >
> >> > > Regards,
> >> > > Oleg
> >> > >
> >> > > P.S. we found that having the list of all users that have access for
> >> each
> >> > > record is better overall, because there are much more read requests
> >> > (people
> >> > > accessing the library) then write requests (a new user is
> >> added/removed).
> >> >
> >>
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to