[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056673#comment-13056673
 ] 

Shai Erera commented on LUCENE-3079:
------------------------------------

bq. resulting in 1.4M unique paths

AND

bq. LUCENE-3079 requires heap relative to the taxonomy size at indexing time

Ok now I see what's happening. The taxonomy index maintains a category cache, 
which maps a category to its ordinal. It's maintained at both indexing and 
search time. At indexing time, it's used to quickly determine if an incoming 
category already exists, and if so returns its ordinal. At search time its used 
to quickly label ordinals, as well as determining hierarchy information.

The default settings of LuceneTaxonomyWriter uses Cl2oTaxonomyWriterCache, 
which maintains a mapping of *all* categories to their ordinal, although the 
mapping is maintained compact. There is another cache LruTaxonomyWriterCache 
which as its javadocs states "good choice for huge taxonomies".

The taxonomy you create is HUGE by all standards (and I'm not speaking on the 
22M case :)). The largest 'normal' taxonomy we've seen was in the order of few 
100K nodes (when people names were maintained, and social tags), while the 
largest 'abnormal' taxonomy we've seen contained ~5M nodes, and that is 
considered a very extreme case for taxonomies.

Just to be clear, I'm not trying to make excuses. Your perf test is still great 
in that it tests an extreme case. But I'd expect an extreme case to require 
some mods to the defaults, and using LruTWC is one of them (w/ a cacheSize of 
let's say 100/200K). Another thing I've mentioned this package can do is the 
partitions -- because at runtime we allocate an array the size of the taxonomy, 
in your case (1.4M nodes), we'll create an array that is ~6MB size for every 
query. While if you use partitions, and partition the categories into 
10/100K-categories buckets, you'd allocate less RAM, but might incur some 
search performance overhead.

Out of curiosity, w/ 1.4M unique paths, and 1M docs, how many categories are 
assigned to each document and how many documents are associated w/ each 
category? When we ran our test, we used a Zipf distribution for categories. If 
in this test we end up associating only a couple of documents per category, 
then this is not a too realistic scenario. And while the package can handle it, 
by not using defaults, perhaps we should define a scenario that makes sense (a 
common one that is) and run with it?

I don't think there can be one "right" faceted search solution, but rather a 
collection of tools that can match different scenarios. And if it turns out 
that for one case one implementation is better than another, then our job will 
be to create a faceted search layer which allows the user to choose what's best 
for him, and leave the rest of his app code unmodified.

What do you think? Perhaps we should open a separate issue, let's call it 
"facet benchmarking", where we define some scenarios, work on extending the 
benchmark package (we have done some preliminary work there) and then compare 
few approaches?

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, 
> LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to