[ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056785#comment-13056785
 ] 

Toke Eskildsen commented on LUCENE-3079:
----------------------------------------

I must admit that my choices of tests were more aimed at probing edge cases 
than simulating real taxonomies. Nevertheless, it is good to hear that 
LUCENE-3079 can be tweaked to handle it. I fully support the idea of a facet 
benchmarking issue -  perhaps with an associated wiki page?

As for the 1M test case, the number of tags/documents were random with the 
average being the stated 3.5 tags/document. Seen from the other side, there 
were an average of 1M*3.5/1.4M ~= 2.5 documents/tag.

I did verify the facet-results and they did fit my expectations. I will try 
requesting from further down the tree later - hopefully tomorrow. I am a bit 
confused about your protest on depth=5, but I suspect that we have different 
ideas of what is relevant when issuing a hierarchical request. The API states 
that specifying a depth of 5 will count all sub-tags until depth 5. I used the 
number 5 to effectively count all the way to the bottom (whoops! It should be 6 
for the second case. That might explain why LUCENE-3079 was faster than 
LUCENE-2369 in that one as LUCENE-2369 counted to the bottom). The reason for 
the complete counting was that I implicitly found this to be the "correct" 
behavior, internally visioning a taxonomy of species or something similar, with 
the wish to get the number of unique elements at the finest level.

Thinking about this, I now have a better understanding of the duplication of 
data by indexing all levels of the paths. This speeds up shallow counting 
tremendously.

All this confusion supports the need for at coordinated effort to get some test 
cases with clear goals and realistic data.

> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, 
> LUCENE-3079.patch, LUCENE-3079.patch, TestPerformanceHack.java
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to