Hi Ard, My apologies for waiting so long to respond.
See response inline On 1 Jul 2010, at 21:45, Ard Schrijvers wrote: > hello Simon, > > On Thu, Jul 1, 2010 at 12:16 PM, Simon Gaeremynck <[email protected]> > wrote: >> First off I know the question has been asked many times before whether >> it is possible to get an accurate count from query results. >> I know Jackrabbit only loads the next result when it really has to, which is >> fine >> since it gives a great performance boost. >> And I also know you can "trick/force" Jackrabbit to return a total by adding >> a sort in there but that's not really what we want. >> >> So we thought we might take a Google approach where we say >> "Displaying first 10 results of approximately 1400000." > > Note that Google has, apart from the number of data, obviously quite > an easy job: there is no authorization involved. If you are using > Gmail, check out the number of results they show you there: first 20 > of hundreds, or 'first 20 of thousands' : Also note, that gmail is an > entirely domain specific solution, where, it should be easier to show > actual hitcounts. > > Obviously, I do not want to talk about Google, but want to give you an > idea about the complexity: Authorized exact counting when you have a > fine grained accessmanager can only be shown correctly if you > authorize every Lucene hit. Extrapolating like you do below is imo > really not a very well solution, see below: Could you elaborate on "authorizing every Lucene hit"? AFAIK that is what Jackrabbit does? > >> >> Some more info about this: >> Now, to do this we thought we could get the hit count from Lucene, get the >> first 10 nodes, >> keep a record of how many Lucene Documents we had to iterate over to get >> those first 10 >> and then do a very rudimentary approximation of how many nodes the user >> would be able to see for this query. >> >> ie: >> 1. Lucene returns a total hitcount of 1.523.145 >> 2. We fetch the first 10 Nodes which results in 452 Documents that needed >> to be processed but could not be used because the user doesn't have READ >> access. >> 3. Based on these 2 numbers we approximate that the user can see 3370 Nodes. >> 4. We round this number off to 3300 just to indicate that it's unlikely we >> guessed right. >> 5. The UI displays a message in likes of: >> " >> Displaying page 1 of approximately 330 >> Showing 10 results per page. >> " > > imo, you assume that access is evenly scattered over the repository. I > think this is not a realistic assumption. It might be in your case, > but it is not very general. Imo, you certainly cannot extrapolate it > like this. Yes, I know and it's far from perfect. It is however a start, at least we would be able to give the user some indication (however poorly it is.) > >> >> >> >> Now I had a look at how Jackrabbit executes queries and there seem to be 3 >> ways it gets the QueryHits (in JackrabbitIndexSearcher.evaluate) >> - Check if it is a JackrabbitQuery and let the Query implementation deal >> with it. >> - It is not a JackrabbitQuery and there is no sort required -- use >> LuceneQueryHits >> - It is not a JackrabbitQuery and there is a sort required -- use >> SortedLuceneQueryHits >> >> So far I've only been able to get the Lucene hit count from the >> SortedLuceneQueryHits because it uses a TopFieldDocCollector and it's very >> simple to get it from there ^-^. >> All the other ones use the same concept as the Node/Row- Iterators and only >> load the next one when asked. (Note: I'm an absolute Lucene novice) >> Maybe this question should be asked on the Lucene list rather than here, but >> is there a way to grab the hitcount from a query? (be it Jackrabbit or >> Lucene) > > Getting total hitcount from lucene is really easy, but this is not > where the pain is. It is about authorization. Fine grained > authorization is not manageable to index. This is quite a general > issue between searching and authorization. Caching it is also quite > hard, as Lucene does not have stable ids. At Hippo we have an > accessmanager which acts on properties of documents. I was able to > write to access rules as lucene queries, and used some extra indexing. > This way, instant authorized counting was achieved, which is > especially nice for faceted navigation, which is exposed over jcr as > virtual nodes. But, this all is quite some work, and most likely not > feasible for you. However, I do understand your issue. > > So, without only trying to disencourage you, what kind of access > manager do you have? Is it based on properties? > We use the default access manager in Jackrabbit + some extensions of our own. These extensions include Dynamic ACE. ie: Date-based ACE. if currentTime < timeOnAceNode then user has jcr:read=none We do not know the full extent of these Dynamic rules as they are hooked up to Drools and we allow admins/managers to write their own custom rules. >> >> Having an approximation of a result total really is a blocker for us. >> Is the above idea doable or is it utter madness? > > As we did not yet hook into the jackrabbit search count part, some > customer also faced this problem. He in the end agreed on the > following: > > Showing 10 of more then 200 hits > > we would limit (and thus authorize) the search to 200. When you go to > 200, you can increase the limit , to say 1000 Can you elaborate on this? We currently limit the amount of nodes a person can retrieve trough searching. Are you doing the same then? > > Hope this helps a little, > > Ard > >> >> My apologies for this very long email. >> >> >> Regards, >> Simon
