Re: Query totals - approximations.

Simon Gaeremynck Wed, 07 Jul 2010 01:44:57 -0700

Hi Ard,

My apologies for waiting so long to respond.


See response inline

On 1 Jul 2010, at 21:45, Ard Schrijvers wrote:

> hello Simon,
> 
> On Thu, Jul 1, 2010 at 12:16 PM, Simon Gaeremynck <[email protected]> 
> wrote:
>> First off I know the question has been asked many times before whether
>>  it is possible to get an accurate count from query results.
>> I know Jackrabbit only loads the next result when it really has to, which is 
>> fine
>> since it gives a great performance boost.
>> And I also know you can "trick/force" Jackrabbit to return a total by adding 
>> a sort in there but that's not really what we want.
>> 
>> So we thought we might take a Google approach where we say
>>  "Displaying first 10 results of approximately 1400000."
> 
> Note that Google has, apart from the number of data, obviously quite
> an easy job: there is no authorization involved. If you are using
> Gmail, check out the number of results they show you there: first 20
> of hundreds, or 'first 20 of thousands' : Also note, that gmail is an
> entirely domain specific solution, where, it should be easier to show
> actual hitcounts.
> 
> Obviously, I do not want to talk about Google, but want to give you an
> idea about the complexity: Authorized exact counting when you have a
> fine grained accessmanager can only be shown correctly if you
> authorize every Lucene hit. Extrapolating like you do below is imo
> really not a very well solution, see below:

Could you elaborate on "authorizing every Lucene hit"?
AFAIK that is what Jackrabbit does?

> 
>> 
>> Some more info about this:
>> Now, to do this we thought we could get the hit count from Lucene, get the 
>> first 10 nodes,
>> keep a record of how many Lucene Documents we had to iterate over to get 
>> those first 10
>> and then do a very rudimentary approximation of how many nodes the user 
>> would be able to see for this query.
>> 
>> ie:
>> 1.  Lucene returns a total hitcount of 1.523.145
>> 2.  We fetch the first 10 Nodes which results in 452 Documents that needed 
>> to be processed but could not be used because the user doesn't have READ 
>> access.
>> 3.  Based on these 2 numbers we approximate that the user can see 3370 Nodes.
>> 4.  We round this number off to 3300 just to indicate that it's unlikely we 
>> guessed right.
>> 5.  The UI displays a message in likes of:
>>        "
>>          Displaying page 1 of approximately 330
>>          Showing 10 results per page.
>>        "
> 
> imo, you assume that access is evenly scattered over the repository. I
> think this is not a realistic assumption. It might be in your case,
> but it is not very general. Imo, you certainly cannot extrapolate it
> like this.

Yes, I know and it's far from perfect.
It is however a start, at least we would be able to give the user some
indication (however poorly it is.)


> 
>> 
>> 
>> 
>> Now I had a look at how Jackrabbit executes queries and there seem to be 3 
>> ways it gets the QueryHits (in JackrabbitIndexSearcher.evaluate)
>> - Check if it is a JackrabbitQuery and let the Query implementation deal 
>> with it.
>> - It is not a JackrabbitQuery and there is no sort required -- use 
>> LuceneQueryHits
>> - It is not a JackrabbitQuery and there is a sort required -- use 
>> SortedLuceneQueryHits
>> 
>> So far I've only been able to get the Lucene hit count from the 
>> SortedLuceneQueryHits because it uses a TopFieldDocCollector and it's very 
>> simple to get it from there ^-^.
>> All the other ones use the same concept as the Node/Row- Iterators and only 
>> load the next one when asked. (Note: I'm an absolute Lucene novice)
>> Maybe this question should be asked on the Lucene list rather than here, but 
>> is there a way to grab the hitcount from a query? (be it Jackrabbit or 
>> Lucene)
> 
> Getting total hitcount from lucene is really easy, but this is not
> where the pain is. It is about authorization. Fine grained
> authorization is not manageable to index. This is quite a general
> issue between searching and authorization. Caching it is also quite
> hard, as Lucene does not have stable ids. At Hippo we have an
> accessmanager which acts on properties of documents. I was able to
> write to access rules as lucene queries, and used some extra indexing.
> This way, instant authorized counting was achieved, which is
> especially nice for faceted navigation, which is exposed over jcr as
> virtual nodes. But, this all is quite some work, and most likely not
> feasible for you. However, I do understand your issue.
> 
> So, without only trying to disencourage you, what kind of access
> manager do you have? Is it based on properties?
> 

We use the default access manager in Jackrabbit + some extensions of our own.
These extensions include Dynamic ACE.
ie: Date-based ACE.
if currentTime < timeOnAceNode then user has jcr:read=none

We do not know the full extent of these Dynamic rules as they are
hooked up to Drools and we allow admins/managers to write their 
own custom rules.

>> 
>> Having an approximation of a result total really is a blocker for us.
>> Is the above idea doable or is it utter madness?
> 
> As we did not yet hook into the jackrabbit search count part, some
> customer also faced this problem. He in the end agreed on the
> following:
> 
> Showing 10 of more then 200 hits
> 
> we would limit (and thus authorize) the search to 200. When you go to
> 200, you can increase the limit , to say 1000

Can you elaborate on this?
We currently limit the amount of nodes a person can retrieve trough searching.
Are you doing the same then?

> 
> Hope this helps a little,
> 
> Ard
> 
>> 
>> My apologies for this very long email.
>> 
>> 
>> Regards,
>> Simon

Re: Query totals - approximations.

Reply via email to