[
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056687#comment-13056687
]
Shai Erera commented on LUCENE-3079:
------------------------------------
About the TaxonomyWriterCache, looking at its API now, I think that an
FST-based TWC might be a good fit here? FST is known for its performance and
low RAM consumption, and TWC maps from a CategoryPath (or String) to an
integer, which sounds like a typical usage for FST. So we can have both the
keep-all-in-RAM-TWC and LRU use FST and consume less memory.
I'll open an issue for that.
One small correction - TWC is used only at indexing time, mapping from
category->ordinal. For labeling ordinals you use TaxonomyReader which maintains
its own int->String cache (LRU). Can FST aid in that case as well? I assume it
will consume less space than an Integer->String hash map.
-------------
Back to performance -- Toke, did you verify that you get the same top-5
categories from both implementations? Also, can you try running the test asking
for top-5 categories of a node below the root? I.e., if the paths are in the
form /a/b/c/d and you request to count "/a", then I'm interested in how this
performs if you ask to count "/a/b".
> Faceting module
> ---------------
>
> Key: LUCENE-3079
> URL: https://issues.apache.org/jira/browse/LUCENE-3079
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Assignee: Shai Erera
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch,
> LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons
> faceting works so well in Solr: Solr has total control over the
> index, knows exactly when the index has changed to rebuild caches, has a
> strict schema so it can make sense of field types and
> pick faceting algos accordingly, has multi-phase distributed search
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements
> that can be made to take even more advantage of knowledge solr has because
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring. It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first. We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]