[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053847#comment-13053847
 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

I would like to contribute IBM's faceted search package (been wanting to do 
that for quite a while). The package supports the following 
features/capabilities (at a high-level):

* Taxonomy index -- manages trees of 'categories'. You can view example of 
taxonomies at e.g. the Open Directory Project.
** It's a Lucene index managed alongside the content index.
** Builds the taxonomy on-the-fly (i.e. as categories are discovered).
** In general it maps a category hierarchy to ordinals (integers). For example, 
the category /date/2011/06/24 will create the following entry in the taxonomy 
index:
*** /date, ordinal=1
*** /date/2011, ordinal=2
*** /date/2011/06, ordinal=3
*** /date/2011/06/24, ordinal=4

* FacetsDocumentBuilder which receives a list of categories that are associated 
w/ the document (can be of several dimensions) and:
** Fetches the ordinals of the category components from the taxonomy index 
(adding them to it on-the-fly).
** Indexes them in a (compressed) payload for the document (so for the above 
category example, 4 payloads will be indexed for the document).
** FDB can be used to augment a Document with other fields for indexing (it 
adds its own Field objects).

* FacetsCollector receives a handle to the taxonomy, a list of facet 'roots' to 
count and returns the top-K categories for each requested facet:
** The root can denote any node in the category tree (e.g., 'count all facets 
under /date/2011')
** top-K can be returned for the top most K immediate children of root, or any 
top-K in the sub-tree of root.

* Counting algorithm (at a high-level):
** Fetch the payload for every matching document.
** Increment by 1 the count of every ordinal that was encountered (even for 
facets that were not requested by the user)
** After all ordinals are counted, compute the top-K on the ones the user 
requested
** Label the result facets

* Miscellaneous features:
** *Sampling* algorithm allows for more efficient facets counting/accumulation, 
while still returning exact counts for the top-K facets.
** *Complements* algorithm allows for more efficient facets 
counting/accumulation, when the number of results is > 50% of the docs in the 
index (we keep a total count of facets, count facets on the docs that did not 
match the query and subtract).
*** Complements can be used to count facets that do not appear in any of the 
matching documents (of this result set). This does not exist in the package 
though ... yet.
** *Facets partitioning* -- if the taxonomy is huge (i.e. millions of 
categories), it is better to partition them at indexing time, so that search 
time is faster and consumes less memory. Note that this is required because of 
the approach of counting all (allocating a count array) and then keeping only 
the results of interest.
** *Category enhancements* allow storing 'metadata' with categories in the 
index, so that more than simple counting can be implemented:
*** *weighted facets* (built on top of enhancements) allows associating a 
weight w/ each category, and use smarter counting techniques at runtime. For 
example, if facets are generated by an analytics component, the confidence 
level can be set as the category's weight. If tags are indexed as facets (for 
e.g. generating a tag cloud for the result set), the number of times the 
document was tagged by the tag can be set as the tag's weight.
** That that facets are indexed in the payloads of documents allows managing 
very large taxonomies and indexes, without blowing up the RAM at runtime (but 
incur some search performance overhead). However, the payloads can be loaded up 
to RAM (like in FieldCache) in which case runtime becomes much faster.
*** However facets are stored is abstracted though by a CountingList API, so we 
can definitely explore other means of storing the facet ordinals. Actually, the 
CountingList API allows us to read the ordinals from disk or RAM w/o affecting 
the rest of the algorithm at all.
** I did not want to dive too deep on the API here, but the runtime API is very 
extensible and allows one to use FacetsCollector for the simple cases, or 
lower-level API to get more control on the process. You can look at: 
FacetRequest, FacetSearchParams, FacetResult, FacetResultNode, FacetsCollector, 
FacetsAccumulator, FacetsAggregator for a more extensive set of API to use.

* The package comes with example code which shows how to use the different 
features I've mentioned. There are also unit tests for ensuring the example 
code works :).

* The package comes with a very extensive tests suite and is in use by many of 
our products for a long time, so I can state that it's very stable.

* Some rough performance numbers:
** Collection of 1M documents, few hierarchical facets per document (we call it 
the 'Amazon' case)
** Query matches 49.8% of the docs (~500K)
** No sampling / complements were used.
** Facets read from disk
** Takes 0.4 seconds to execute the query and count facets.

It will take me a few days to prepare a patch (lots of code) - will upload it 
as soon as it's ready.

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to