[jira] [Commented] (LUCENE-3079) Facetiing module

Shai Erera (JIRA) Tue, 28 Jun 2011 07:59:44 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056553#comment-13056553
 ]


Shai Erera commented on LUCENE-3079:
------------------------------------

You write LUCENE-3097 which is about "post group faceting", while this issue is 
LUCENE-3079. I assume you meant the latter, but want to confirm :). You also 
write LUCENE-2309 which is about decoupling IW from Analyzers. Are you perhaps 
referring to a Solr issue, or a different Lucene issue? If so can you please 
let me know which one?

This is a great test, and it matches more or less the test we've been running. 
Is it in 'benchmark' form? Can you post it on this issue so I can try the same?

What do you mean by "top 5 facets/tags"? If I were to speak of dimensions, 
where a dimensions is like "tags", "authors", "date", then do you mean you've 
requested to count 5 dimensions, or you indexed just one dimension (i.e. one 
"root") and requested to fetch the top-5 results for it? I assume it's the 
latter, but again, confirming my understanding.

So assuming I understood correctly the terminology and test setup, you execute 
one query which matches 50% of the documents and ask to count the top-5 facets 
under a single "root"/"dimension", and record the time as 'first facet 
request'. And then you execute it 4-5 additional times, and record 'best of 5 
requests'. Do I understand it correctly?

One difference between the two approaches, assuming you're referring to a 
faceting approach that uses the FieldCache is that by default, the faceting 
approach here reads everything from disk. So it would be interesting to run w/ 
the facets-in-memory feature.

I don't know how to relate to the memory usage -- on the last test it consumed 
50% less than the other approach, on the first it consumed nearly the same and 
on the second test it consumed 150% more. This is odd. Do you trust this 
measurement?

The 'first facet request' result is not surprising, because it takes time to 
warm up the FieldCache (assuming that's what you use).

I am interested in the memory observed for indexing because that too seems 
fluctuating? I.e., in the second test the difference is nearly x20 more, which 
is weird.

Also, the difference in indexing time is interesting too, as it too is not very 
consistent. And I find the x2 factor suspicious - would like to understand it 
better. Since trunk reports to improve indexing speed by a large factor (nearly 
200%), I think it will be wise if we wait with this comparison until I bring 
the patch up w/ trunk.

I like it that you test the default behavior. I think it's very important that 
we have the greatest out-of-the-box experience. Since the two approaches read 
from disk/memory, I first would like to test the in-memory facets using this 
approach, so we can at least compare the same thing. I know that trunk plays 
some role here (definitely at indexing time), so we can focus on search time 
for now.

This is great stuff Toke !

> Facetiing module
> ----------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, 
> LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-3079) Facetiing module

Reply via email to