On Wed, Dec 10, 2014 at 10:17 AM, Ard Schrijvers
<[email protected]> wrote:
> On Wed, Dec 10, 2014 at 9:32 AM, Davide Giannella <[email protected]> wrote:
>> On 09/12/2014 17:10, Michael Marth wrote:
>>> ...
>>>
>>> The use cases problematic case for counting the facets I have in mind are 
>>> when a query returns millions of results. This is problematic when one 
>>> wants to retrieve the exact size of the result set (taking ACLs into 
>>> account, obviously). When facets are to be retrieved this will be an even 
>>> harder problem (meaning when the exact number is to be calculated per 
>>> facet).
>>> As an illustration consider a digital asset management application that 
>>> displays mime type as facets. A query could return 1 million images and, 
>>> say, 10 video.
>>>
>>> Is there a way we could support such scenarios (while still counting 
>>> results per facet) and have a performant implementation?
>>>
>> We can opt for ACL-Checking/Parsing at most X (let's say 1000) nodes. If
>> we're done within it, then we can output the actual number. In case
>> after 1000 nodes checked we still have some left we can leave the number
>> either empty or with something like "many", "+", or any other fancy way
>> if we want.
>>
>> In the end is the same approach taken by Amazon (as Tommaso already
>> pointed) or for example google. If you run a search, their facets
>> (Searches related to...) are never with results.
>
> I don't think Amazon and Google have customers that can demand them to
> show correct facet counts...our customers typically do :). My take on
> on this would be to have a configurable option between
>
> 1) exact and possibly slow counts
> 2) unauthorized, possibly incorrect, fast counts
>
> Obviously, the second just uses the faceted navigation counts from the
> backing search implementation (with node by node access manager

Here of course I meant to write: '**without** node by node access manager check'

> check), whether it is the internal lucene index, solr or Elastic
> Search. If you opt for the second option, then, depending on your
> authorization model you can get fast exact authorized counts as well :
> When the authorization model can be translated into a search query /
> filter that is AND-ed with every normal search. For ES this is briefly
> written at [1]. Most likely the filter is internally cached so even
> for very large authorization queries (like we have at Hippo because of
> fine grained ACL model) it will just perform. Obviously it depends
> quite heavily on your authorization model whether it can be translated
> to a query. If  it relies on an external authorization check or has
> many hierarchical constraints, it will be very hard. If you choose to
> have it based on, say, nodetype, nodename, node properties and
> jcr:path (fake pseudo property) it can be easily translated to a
> query. Note that for the jcr:path hierarchical ACL (eg read everything
> below /foo) it is not possible to write a lucene query easily unless
> you index path information as well....this results in that moves of
> large subtree's are slow because the entire subtree needs to be
> re-indexed. A different authorization model might be based on groups,
> where every node also gets the groups (the token of the group) indexed
> that can read that node. Although I never looked much into the code, I
> suspect [2] does something like this.
>
> So, instead of second guessing which might be acceptable (slow
> queries, wrong counts, etc) for which customers/users I'd try to keep
> the options open, have a default of correct (slow) counts, and make it
> easy to flip to 'counts from the indexes without accessmanager
> authorization', where depending on the authorization model, the latter
> can return correct results.
>
> For those who are interested, I will be listening to [3] this
> afternoon (5 pm GMT).
>
> Regards Ard
>
> [1] 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-aliases.html#filtered
> [2] http://manifoldcf.apache.org/en_US/index.html
> [3] 
> http://www.elasticsearch.com/webinars/shield-securing-your-data-in-elasticsearch/
>
>
>>
>> D.
>>
>>
>>
>>
>
>
>
> --
> Amsterdam - Oosteinde 11, 1017 WT Amsterdam
> Boston - 1 Broadway, Cambridge, MA 02142
>
> US +1 877 414 4776 (toll free)
> Europe +31(0)20 522 4466
> www.onehippo.com



-- 
Amsterdam - Oosteinde 11, 1017 WT Amsterdam
Boston - 1 Broadway, Cambridge, MA 02142

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Reply via email to