Re: ACL implementation: Pseudo-join performance & Atomic Updates

Oleg Burlaca Wed, 17 Jul 2013 12:50:52 -0700

Hello Roman and all,

> sorry, haven't the previous thread in its entirety, but few weeks back
that
> Yonik's proposal got implemented, it seems ;)
http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter


In that post I see a reference to your plugin BitSetQParserPlugin, right ?
https://github.com/romanchyla/montysolr/blob/master/contrib/adsabs/src/java/org/apache/solr/search/BitSetQParserPlugin.java

I understood it as follows:
1. query the core and get ALL search results,
   search results == (id1, id2, id7 .. id28263)   // a long arrays of
Unique IDs
2. Generate a bitset from this array of IDs
3. search a core using a bitsetfilter

Correct?

I was thinking that pseudo-joins can help exactly with this situation
(actually didn't even tried yet pseudo-joins, still watching the mail list).
i.e. to make the first step efficient and at the same time perform a second
query without to send a lot of data to the client and then receiving this
data back.

I have a feeling that such a situation: a list of Unique IDs from query1
participates in filter in query2
happens frequently, and would be very useful if SOLR has an optimized
approach to handle it.
mmm, it's transform the pseudo-join in a real JOIN like in SQL world.

I think I'll just test to see the performance of pseudo-joins with large
datasets (was waiting to find the perfect solution).

Thanks for all the ideas/links, now I have a better view of the situation.

Regards.




On Wed, Jul 17, 2013 at 3:34 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Roman:
>
> I think that SOLR-1913 is completely different. It's
> about having a field in a document and being able
> to do bitwise operations on it. So say I have a
> field in a Solr doc with the value 6 in it. I can then
> form a query like
> {!bitwise field=myfield op=AND source=2}
> and it would match.
>
> You're talking about a much different operation as I
> understand it.
>
> In which case, go ahead and open up a JIRA, there's
> no harm in it.
>
> Best
> Erick
>
> On Tue, Jul 16, 2013 at 1:32 PM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
> > Erick,
> >
> > I wasn't sure this issue is important, so I wanted first solicit some
> > feedback. You and Otis expressed interest, and I could create the JIRA -
> > however, as Alexandre, points out, the SOLR-1913 seems similar (actually,
> > closer to the Otis request to have the elasticsearch named filter) but
> the
> > SOLR-1913 was created in 2010 and is not integrated yet, so I am
> wondering
> > whether this new feature (somewhat overlapping, but still different from
> > SOLR-1913) is something people would really want and the effort on the
> JIRA
> > is well spent. What's your view?
> >
> > Thanks,
> >
> >   roman
> >
> >
> >
> >
> > On Tue, Jul 16, 2013 at 8:23 AM, Alexandre Rafalovitch
> > <arafa...@gmail.com>wrote:
> >
> >> Is that this one: https://issues.apache.org/jira/browse/SOLR-1913 ?
> >>
> >> Regards,
> >>    Alex.
> >>
> >> Personal website: http://www.outerthoughts.com/
> >> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >> - Time is the quality of nature that keeps events from happening all at
> >> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> >>
> >>
> >> On Tue, Jul 16, 2013 at 8:01 AM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >>
> >> > Roman:
> >> >
> >> > Did this ever make into a JIRA? Somehow I missed it if it did, and
> this
> >> > would
> >> > be pretty cool....
> >> >
> >> > Erick
> >> >
> >> > On Mon, Jul 15, 2013 at 6:52 PM, Roman Chyla <roman.ch...@gmail.com>
> >> > wrote:
> >> > > On Sun, Jul 14, 2013 at 1:45 PM, Oleg Burlaca <oburl...@gmail.com>
> >> > wrote:
> >> > >
> >> > >> Hello Erick,
> >> > >>
> >> > >> > Join performance is most sensitive to the number of values
> >> > >> > in the field being joined on. So if you have lots and lots of
> >> > >> > distinct values in the corpus, join performance will be affected.
> >> > >> Yep, we have a list of unique Id's that we get by first searching
> for
> >> > >> records
> >> > >> where loggedInUser IS IN (userIDs)
> >> > >> This corpus is stored in memory I suppose? (not a problem) and then
> >> the
> >> > >> bottleneck is to match this huge set with the core where I'm
> >> searching?
> >> > >>
> >> > >> Somewhere in maillist archive people were talking about "external
> list
> >> > of
> >> > >> Solr unique IDs"
> >> > >> but didn't find if there is a solution.
> >> > >> Back in 2010 Yonik posted a comment:
> >> > >> http://find.searchhub.org/document/363a4952446b3cd#363a4952446b3cd
> >> > >>
> >> > >
> >> > > sorry, haven't the previous thread in its entirety, but few weeks
> back
> >> > that
> >> > > Yonik's proposal got implemented, it seems ;)
> >> > >
> >> > >
> >> >
> >>
> http://search-lucene.com/m/Fa3Dg14mqoj/bitset&subj=Re+Solr+large+boolean+filter
> >> > >
> >> > > You could use this to send very large bitset filter (which can be
> >> > > translated into any integers, if you can come up with a mapping
> >> > function).
> >> > >
> >> > > roman
> >> > >
> >> > >
> >> > >>
> >> > >> > bq: I suppose the delete/reindex approach will not change soon
> >> > >> > There is ongoing work (search the JIRA for "Stacked Segments")
> >> > >> Ah, ok, I was feeling it affects the architecture, ok, now the only
> >> > hope is
> >> > >> Pseudo-Joins ))
> >> > >>
> >> > >> > One way to deal with this is to implement a "post filter",
> sometimes
> >> > >> called
> >> > >> > a "no cache" filter.
> >> > >> thanks, will have a look, but as you describe it, it's not the best
> >> > option.
> >> > >>
> >> > >> The approach
> >> > >> "too many documents, man. Please refine your query. Partial results
> >> > below"
> >> > >> means faceting will not work correctly?
> >> > >>
> >> > >> ... I have in mind a hybrid approach, comments welcome:
> >> > >> Most of the time users are not searching, but browsing content, so
> our
> >> > >> "virtual filesystem" stored in SOLR will use only the index with
> the
> >> Id
> >> > of
> >> > >> the file and the list of users that have access to it. i.e. not
> >> touching
> >> > >> the fulltext index at all.
> >> > >>
> >> > >> Files may have metadata (EXIF info for images for ex) that we'd
> like
> >> to
> >> > >> filter by, calculate facets.
> >> > >> Meta will be stored in both indexes.
> >> > >>
> >> > >> In case of a fulltext query:
> >> > >> 1. search FT index (the fulltext index), get only the number of
> search
> >> > >> results, let it be Rf
> >> > >> 2. search DAC index (the index with permissions), get number of
> search
> >> > >> results, let it be Rd
> >> > >>
> >> > >> let maxR be the maximum size of the corpus for the pseudo-join.
> >> > >> *That was actually my question: what is a reasonable number? 10,
> 100,
> >> > 1000
> >> > >> ?
> >> > >> *
> >> > >>
> >> > >> if (Rf < maxR) or (Rd < maxR) then use the smaller corpus to join
> onto
> >> > the
> >> > >> second one.
> >> > >> this happens when (only a few documents contains the search query)
> OR
> >> > (user
> >> > >> has access to a small number of files).
> >> > >>
> >> > >> In case none of these happens, we can use the
> >> > >> "too many documents, man. Please refine your query. Partial results
> >> > below"
> >> > >> but first searching the FT index, because we want relevant results
> >> > first.
> >> > >>
> >> > >> What do you think?
> >> > >>
> >> > >> Regards,
> >> > >> Oleg
> >> > >>
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Sun, Jul 14, 2013 at 7:42 PM, Erick Erickson <
> >> > erickerick...@gmail.com
> >> > >> >wrote:
> >> > >>
> >> > >> > Join performance is most sensitive to the number of values
> >> > >> > in the field being joined on. So if you have lots and lots of
> >> > >> > distinct values in the corpus, join performance will be affected.
> >> > >> >
> >> > >> > bq: I suppose the delete/reindex approach will not change soon
> >> > >> >
> >> > >> > There is ongoing work (search the JIRA for "Stacked Segments")
> >> > >> > on actually doing something about this, but it's been "under
> >> > >> consideration"
> >> > >> > for at least 3 years so your guess is as good as mine.
> >> > >> >
> >> > >> > bq: notice that the worst situation is when everyone has access
> to
> >> all
> >> > >> the
> >> > >> > files, it means the first filter will be the full index.
> >> > >> >
> >> > >> > One way to deal with this is to implement a "post filter",
> sometimes
> >> > >> called
> >> > >> > a "no cache" filter. The distinction here is that
> >> > >> > 1> it is not cached (duh!)
> >> > >> > 2> it is only called for documents that have made it through all
> the
> >> > >> >      other "lower cost" filters (and the main query of course).
> >> > >> > 3> "lower cost" means the filter is either a standard, cached
> >> filters
> >> > >> >     and any "no cache" filters with a cost (explicitly stated in
> the
> >> > >> query)
> >> > >> >     lower than this one's.
> >> > >> >
> >> > >> > Critically, and unlike "normal" filter queries, the result set is
> >> NOT
> >> > >> > calculated for all documents ahead of time....
> >> > >> >
> >> > >> > You _still_ have to deal with the sysadmin doing a *:* query as
> you
> >> > >> > are well aware. But one can mitigate that by having the
> post-filter
> >> > >> > fail all documents after some arbitrary N, and display a message
> in
> >> > the
> >> > >> > app like "too many documents, man. Please refine your query.
> Partial
> >> > >> > results below". Of course this may not be acceptable, but....
> >> > >> >
> >> > >> > HTH
> >> > >> > Erick
> >> > >> >
> >> > >> > On Sun, Jul 14, 2013 at 12:05 PM, Jack Krupansky
> >> > >> > <j...@basetechnology.com> wrote:
> >> > >> > > Take a look at LucidWorks Search and its access control:
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> http://docs.lucidworks.com/display/help/Search+Filters+for+Access+Control
> >> > >> > >
> >> > >> > > Role-based security is an easier nut to crack.
> >> > >> > >
> >> > >> > > Karl Wright of ManifoldCF had a Solr patch for document access
> >> > control
> >> > >> at
> >> > >> > > one point:
> >> > >> > > SOLR-1895 - ManifoldCF SearchComponent plugin for enforcing
> >> > ManifoldCF
> >> > >> > > security at search time
> >> > >> > > https://issues.apache.org/jira/browse/SOLR-1895
> >> > >> > >
> >> > >> > >
> >> > >> >
> >> > >>
> >> >
> >>
> http://www.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
> >> > >> > >
> >> > >> > > For some other thoughts:
> >> > >> > >
> http://wiki.apache.org/solr/SolrSecurity#Document_Level_Security
> >> > >> > >
> >> > >> > > I'm not sure if external file fields will be of any value in
> this
> >> > >> > situation.
> >> > >> > >
> >> > >> > > There is also a proposal for bitwise operations:
> >> > >> > > SOLR-1913 - QParserPlugin plugin for Search Results Filtering
> >> Based
> >> > on
> >> > >> > > Bitwise Operations on Integer Fields
> >> > >> > > https://issues.apache.org/jira/browse/SOLR-1913
> >> > >> > >
> >> > >> > > But the bottom line is that clearly updating all documents in
> the
> >> > index
> >> > >> > is a
> >> > >> > > non-starter.
> >> > >> > >
> >> > >> > > -- Jack Krupansky
> >> > >> > >
> >> > >> > > -----Original Message----- From: Oleg Burlaca
> >> > >> > > Sent: Sunday, July 14, 2013 11:02 AM
> >> > >> > > To: solr-user@lucene.apache.org
> >> > >> > > Subject: ACL implementation: Pseudo-join performance & Atomic
> >> > Updates
> >> > >> > >
> >> > >> > >
> >> > >> > > Hello all,
> >> > >> > >
> >> > >> > > Situation:
> >> > >> > > We have a collection of files in SOLR with ACL applied: each
> file
> >> > has a
> >> > >> > > multi-valued field that contains the list of userID's that can
> >> read
> >> > it:
> >> > >> > >
> >> > >> > > here is sample data:
> >> > >> > > Id | content  | userId
> >> > >> > > 1  | text text | 4,5,6,2
> >> > >> > > 2  | text text | 4,5,9
> >> > >> > > 3  | text text | 4,2
> >> > >> > >
> >> > >> > > Problem:
> >> > >> > > when ACL is changed for a big folder, we compute the ACL for
> all
> >> > child
> >> > >> > > items and reindex in SOLR using atomic updates (updating only
> >> > 'userIds'
> >> > >> > > column), but because it deletes/reindexes the record, the
> >> > performance
> >> > >> is
> >> > >> > > very poor.
> >> > >> > >
> >> > >> > > Question:
> >> > >> > > I suppose the delete/reindex approach will not change soon
> >> (probably
> >> > >> it's
> >> > >> > > due to actual SOLR architecture), ?
> >> > >> > >
> >> > >> > > Possible solution: assuming atomic updates will be super fast
> on
> >> an
> >> > >> index
> >> > >> > > without fulltext, keep a separate ACLIndex and FullTextIndex
> and
> >> use
> >> > >> > > Pseudo-Joins:
> >> > >> > >
> >> > >> > > Example: searching 'foo' as user '999'
> >> > >> > > /solr/FullTextIndex/select/?q=foo&fq{!join fromIndex=ACLIndex
> >> > from=Id
> >> > >> > to=Id
> >> > >> > > }userId:999
> >> > >> > >
> >> > >> > > Question: what about performance here? what if the index is
> >> 100,000
> >> > >> > > records?
> >> > >> > > notice that the worst situation is when everyone has access to
> all
> >> > the
> >> > >> > > files, it means the first filter will be the full index.
> >> > >> > >
> >> > >> > > Would be happy to get any links that deal with the issue of
> >> > Pseudo-join
> >> > >> > > performance for large datasets (i.e. initial filtered set of
> IDs).
> >> > >> > >
> >> > >> > > Regards,
> >> > >> > > Oleg
> >> > >> > >
> >> > >> > > P.S. we found that having the list of all users that have
> access
> >> for
> >> > >> each
> >> > >> > > record is better overall, because there are much more read
> >> requests
> >> > >> > (people
> >> > >> > > accessing the library) then write requests (a new user is
> >> > >> added/removed).
> >> > >> >
> >> > >>
> >> >
> >>
>

Re: ACL implementation: Pseudo-join performance & Atomic Updates

Reply via email to