On Wed, Dec 10, 2014 at 10:17 AM, Ard Schrijvers <[email protected]> wrote: > On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <[email protected]> wrote: >> On 09/12/2014 17:10, Michael Marth wrote: >>> ... >>> >>> The use cases problematic case for counting the facets I have in mind are >>> when a query returns millions of results. This is problematic when one >>> wants to retrieve the exact size of the result set (taking ACLs into >>> account, obviously). When facets are to be retrieved this will be an even >>> harder problem (meaning when the exact number is to be calculated per >>> facet). >>> As an illustration consider a digital asset management application that >>> displays mime type as facets. A query could return 1 million images and, >>> say, 10 video. >>> >>> Is there a way we could support such scenarios (while still counting >>> results per facet) and have a performant implementation? >>> >> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If >> we're done within it, then we can output the actual number. In case >> after 1000 nodes checked we still have some left we can leave the number >> either empty or with something like "many", "+", or any other fancy way >> if we want. >> >> In the end is the same approach taken by Amazon (as Tommaso already >> pointed) or for example google. If you run a search, their facets >> (Searches related to...) are never with results. > > I don't think Amazon and Google have customers that can demand them to > show correct facet counts...our customers typically do :). My take on > on this would be to have a configurable option between > > 1) exact and possibly slow counts > 2) unauthorized, possibly incorrect, fast counts > > Obviously, the second just uses the faceted navigation counts from the > backing search implementation (with node by node access manager
Here of course I meant to write: '**without** node by node access manager check' > check), whether it is the internal lucene index, solr or Elastic > Search. If you opt for the second option, then, depending on your > authorization model you can get fast exact authorized counts as well : > When the authorization model can be translated into a search query / > filter that is AND-ed with every normal search. For ES this is briefly > written at [1]. Most likely the filter is internally cached so even > for very large authorization queries (like we have at Hippo because of > fine grained ACL model) it will just perform. Obviously it depends > quite heavily on your authorization model whether it can be translated > to a query. If it relies on an external authorization check or has > many hierarchical constraints, it will be very hard. If you choose to > have it based on, say, nodetype, nodename, node properties and > jcr:path (fake pseudo property) it can be easily translated to a > query. Note that for the jcr:path hierarchical ACL (eg read everything > below /foo) it is not possible to write a lucene query easily unless > you index path information as well....this results in that moves of > large subtree's are slow because the entire subtree needs to be > re-indexed. A different authorization model might be based on groups, > where every node also gets the groups (the token of the group) indexed > that can read that node. Although I never looked much into the code, I > suspect [2] does something like this. > > So, instead of second guessing which might be acceptable (slow > queries, wrong counts, etc) for which customers/users I'd try to keep > the options open, have a default of correct (slow) counts, and make it > easy to flip to 'counts from the indexes without accessmanager > authorization', where depending on the authorization model, the latter > can return correct results. > > For those who are interested, I will be listening to [3] this > afternoon (5 pm GMT). > > Regards Ard > > [1] > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered > [2] http://manifoldcf.apache.org/en_US/index.html > [3] > http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/ > > >> >> D. >> >> >> >> > > > > -- > Amsterdam - Oosteinde 11, 1017 WT Amsterdam > Boston - 1 Broadway, Cambridge, MA 02142 > > US +1 877 414 4776 (toll free) > Europe +31(0)20 522 4466 > www.onehippo.com -- Amsterdam - Oosteinde 11, 1017 WT Amsterdam Boston - 1 Broadway, Cambridge, MA 02142 US +1 877 414 4776 (toll free) Europe +31(0)20 522 4466 www.onehippo.com
