[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056673#comment-13056673 ]
Shai Erera commented on LUCENE-3079: ------------------------------------ bq. resulting in 1.4M unique paths AND bq. LUCENE-3079 requires heap relative to the taxonomy size at indexing time Ok now I see what's happening. The taxonomy index maintains a category cache, which maps a category to its ordinal. It's maintained at both indexing and search time. At indexing time, it's used to quickly determine if an incoming category already exists, and if so returns its ordinal. At search time its used to quickly label ordinals, as well as determining hierarchy information. The default settings of LuceneTaxonomyWriter uses Cl2oTaxonomyWriterCache, which maintains a mapping of *all* categories to their ordinal, although the mapping is maintained compact. There is another cache LruTaxonomyWriterCache which as its javadocs states "good choice for huge taxonomies". The taxonomy you create is HUGE by all standards (and I'm not speaking on the 22M case :)). The largest 'normal' taxonomy we've seen was in the order of few 100K nodes (when people names were maintained, and social tags), while the largest 'abnormal' taxonomy we've seen contained ~5M nodes, and that is considered a very extreme case for taxonomies. Just to be clear, I'm not trying to make excuses. Your perf test is still great in that it tests an extreme case. But I'd expect an extreme case to require some mods to the defaults, and using LruTWC is one of them (w/ a cacheSize of let's say 100/200K). Another thing I've mentioned this package can do is the partitions -- because at runtime we allocate an array the size of the taxonomy, in your case (1.4M nodes), we'll create an array that is ~6MB size for every query. While if you use partitions, and partition the categories into 10/100K-categories buckets, you'd allocate less RAM, but might incur some search performance overhead. Out of curiosity, w/ 1.4M unique paths, and 1M docs, how many categories are assigned to each document and how many documents are associated w/ each category? When we ran our test, we used a Zipf distribution for categories. If in this test we end up associating only a couple of documents per category, then this is not a too realistic scenario. And while the package can handle it, by not using defaults, perhaps we should define a scenario that makes sense (a common one that is) and run with it? I don't think there can be one "right" faceted search solution, but rather a collection of tools that can match different scenarios. And if it turns out that for one case one implementation is better than another, then our job will be to create a faceted search layer which allows the user to choose what's best for him, and leave the rest of his app code unmodified. What do you think? Perhaps we should open a separate issue, let's call it "facet benchmarking", where we define some scenarios, work on extending the benchmark package (we have done some preliminary work there) and then compare few approaches? > Faceting module > --------------- > > Key: LUCENE-3079 > URL: https://issues.apache.org/jira/browse/LUCENE-3079 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/facet > Reporter: Michael McCandless > Assignee: Shai Erera > Fix For: 3.4, 4.0 > > Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, > LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java > > > Faceting is a hugely important feature, available in Solr today but > not [easily] usable by Lucene-only apps. > We should fix this, by creating a shared faceting module. > Ideally, we factor out Solr's faceting impl, and maybe poach/merge > from other impls (eg Bobo browse). > Hoss describes some important challenges we'll face in doing this > (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: > {noformat} > To look at "faceting" as a concrete example, there are big the reasons > faceting works so well in Solr: Solr has total control over the > index, knows exactly when the index has changed to rebuild caches, has a > strict schema so it can make sense of field types and > pick faceting algos accordingly, has multi-phase distributed search > approach to get exact counts efficiently across multiple shards, etc... > (and there are still a lot of additional enhancements and improvements > that can be made to take even more advantage of knowledge solr has because > it "owns" the index that we no one has had time to tackle) > {noformat} > This is a great list of the things we face in refactoring. It's also > important because, if Solr needed to be so deeply intertwined with > caching, schema, etc., other apps that want to facet will have the > same "needs" and so we really have to address them in creating the > shared module. > I think we should get a basic faceting module started, but should not > cut Solr over at first. We should iterate on the module, fold in > improvements, etc., and then, once we can fully verify that cutting > over doesn't hurt Solr (ie lose functionality or performance) we can > later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org