[
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056972#comment-13056972
]
Shai Erera commented on LUCENE-3079:
------------------------------------
bq. I fully support the idea of a facet benchmarking issue - perhaps with an
associated wiki page?
Yes. And on the Wiki page you should clearly describe the scenario that is
tested, along with results for 'default config' + 'optimized config'. That way,
a user coming to the page can pick the scenario that best matches his app and
config the facets package (whatever we end up with) accordingly.
bq. Seen from the other side, there were an average of 1M*3.5/1.4M ~= 2.5
documents/tag
That's indeed an extreme case. We've seen it when an analytics module extracted
facets automatically from documents (such as places, people etc.), and in that
case the taxonomy was very 'flat' and wide.
bq. I am a bit confused about your protest on depth=5
I did not protest :). Actually, the common scenario is to count the immediate
children and fetch the top-K. And I thought that that's what you do in
LUCENE-2369. But counting all the way down is a valid scenario - and shows
another reason why we should have a benchmark page with clear description.
bq. Thinking about this, I now have a better understanding of the duplication
of data by indexing all levels of the paths. This speeds up shallow counting
tremendously.
It actually speeds up counting overall. If you think about it, when we
encounter category ordinals, we just increment the count by 1 in the respective
location in the count array. No need to ask whether this is an ordinal the user
asked to count at all. Later when we compute the top-K, we know more
efficiently while root ordinal the user requested to count, and its children,
so it's just a matter of putting everything into a heap and returning the top-K.
bq. All this confusion supports the need for at coordinated effort to get some
test cases with clear goals and realistic data.
Indeed. And we shouldn't pursue only 'realistic data', but edge cases too. As
long as everything is clearly documented, it should be easy to interpret
results.
I think that setting up facet benchmarking is more important than working on
improving any implementation. Mostly because it will allow measuring the how
much the improvements really improved. I'll open an issue for that.
> Faceting module
> ---------------
>
> Key: LUCENE-3079
> URL: https://issues.apache.org/jira/browse/LUCENE-3079
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/facet
> Reporter: Michael McCandless
> Assignee: Shai Erera
> Fix For: 3.4, 4.0
>
> Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch,
> LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons
> faceting works so well in Solr: Solr has total control over the
> index, knows exactly when the index has changed to rebuild caches, has a
> strict schema so it can make sense of field types and
> pick faceting algos accordingly, has multi-phase distributed search
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements
> that can be made to take even more advantage of knowledge solr has because
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring. It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first. We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]