Re: Faceted Search Implementation

Ian Boston Wed, 25 Aug 2010 04:40:46 -0700

On 25 Aug 2010, at 11:50, Ard Schrijvers wrote:

> On Wed, Aug 25, 2010 at 12:23 PM, Ian Boston <[email protected]> wrote:
>> Ard,
>> Thank you for the guided tour, most informative.
> 
> You're welcome.
> 
>> We have complex ACLs based on the standard Jackrabbit 2 ACLs with some 
>> additions including external lookup, these change rapidly so count by 
>> iteration looks like the only way at the moment, although we have found most 
>> of the time, where there are > 10 pages of results, no one pages that far, 
>> so social engineering is one solution (eg " > 1000 items") so we just count 
>> upto that number...
> 
> 
> Do you also have something like time-based ACLs (like 'now' which
> changes every millisec) or do you have 'static' ACLs. If so, you can
> follow a quite different approach, which however again depends on the
> number of unique ACL rules for jcr sessions and how large you data set
> is whether it is possible (and how much time you want to put in to
> it), but:
> 
> 1) If you extends the existing SearchIndex
> 2) When a search is done, you compute for the jcr session ACL some
> kind of 'token' to identify the ACL rule set for that session (users
> with similar rule sets get the same token)

I can see that the approach will work well where the set of auth tokens is 
small. In our case, I think we would need 1 bit per group in the system, 
although we could compute a hash from the result to accommodate sparseness. We 
know from previous production deployments of Sakai that for 100K users there 
can be 40K groups, which, IIUC, is going to generate too many authtoken lucene 
bitsets to be cached and generated.

The other problem, is although the IndexReader has a static set of documents, 
the ACLs are not static and so each ACL modification will cause the bitset 
derived from that ACL to become invalid. If the root of a sub tree changes, all 
bit sets from the subtree become invalid. Our repositories are write many, most 
if not all of the 100K users can update content and if they have any small 
group management also the ACLs, which means a significant amount of ACL 
modification traffic.

Ours is not the typical ECM use case.

I will think about it some more since I don't really know exactly what the real 
number of unique authtokens is, or the frequency of acl updates.

Thanks
Ian

> 3) For all ReadOnlyIndexReader which contain an in memory deleted
> bitset, you add a 'authorized bitset', which means that every time a
> search comes in with a *new* unique token, you once have to authorize
> every Lucene Document to get the auth bitset for that token: This
> shouldn't be to hard. After this, you associate a cached auth bitset
> with this token. Now every other user having same token also has an in
> memory cached bitset.
> 4) Your searches are done on your 'extended searchindex' which
> consists of an set of Lucene ReadOnlyIndexReader's, which in turn have
> an extra filter that is for the authorization: Thus, Lucene returns
> you authorized hits.
> 5) Add some api call or something that exposes:
> QueryResultImpl#getTotalSize()  : This returns you initially the
> lucene hit count, but, as you already made it 'authorized', it returns
> you the correct hitcount instantly without having to check access for
> every hit. I actually also still have this one open for our Repo [1]
> 
> Note, that if new documents are added to the repository, all existing
> auth bitsets for all existing ReadOnlyIndexReaders are still valid!
> Only, a new index reader is added. For this new one, you'll then need
> to still create the auth bitset when a search comes in. But, this is
> always a small index containing few nodes.
> 
> Regards Ard
> 
> ps it won't be simple to implement it all :)
> 
> [1] https://issues.onehippo.com/browse/HREPTWO-4430
> 
>> 
>> Ian
>> On 25 Aug 2010, at 10:19, Ard Schrijvers wrote:
>> 
>>> Hello Ian et al,

Re: Faceted Search Implementation

Reply via email to