[ https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056517#comment-13056517 ]
Toke Eskildsen commented on LUCENE-3079: ---------------------------------------- Some preliminary performance testing: I hacked together a test where 1M documents were created with an average of 3.5 paths, down to a depth of 4 (resulting in 1.4M unique paths). A search that hit every other document was issued and the top 5 facets/tags was requested. Hopefully this is somewhat similar to your test. || || LUCENE-3097 || LUCENE-2309 || | Index build time | 52 s | 23 s | | Memory required for indexing | 192 MB | 48 MB | | First facet request | 432 ms | 12,000 ms | | Best of 5 requests | 228 ms | 159 ms | | Memory usage after faceting (after gc()) | 21 MB | 22 MB | Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) resulted in || || LUCENE-3097 || LUCENE-2309 || | Index build time | 752 s | 238 s | | Memory required for indexing | 2500 MB | 128 MB | | First facet request | 3370 ms | 147,000 ms | | Best of 5 requests | 2400 ms | 2673 ms | | Memory usage after faceting | 435 MB | 294 MB | Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) resulted in || || LUCENE-3097 || LUCENE-2309 || | Index build time | 5317 ms | 2563 ms | | Memory required for indexing | 48 MB | 32 MB | | First facet request | 245 ms | 1425 ms | | Best of 5 requests | 15 ms | 8 ms | | Memory usage after faceting | 1 MB | 2 MB | Some observations: It seems clear that some trade-offs are very different for the two solutions. LUCENE-3097 has brilliant startup time and slows analyzing time a bit through the whole indexing process. LUCENE-2309 is dog slow at startup but does not impact indexing. They seem similar with regards to search-time speed and memory usage. Now, LUCENE-2309 patches some semi-random Lucene-4, so this is not a fair test. Likewise, the tests were just quick hacks; the disk cache was not flushed, the laptop was used for browsing etc. When LUCENE-3097 patches trunk, a proper comparison can be made. I am a bit worried about the observed memory usage for index build. It seems that LUCENE-3097 uses a lot of heap there? I created the documents one at a time just before adding them to the index, so the memory usage is for the writers and a quick profile told me that it was mainly used for int-arrays. Does that sound right? > Facetiing module > ---------------- > > Key: LUCENE-3079 > URL: https://issues.apache.org/jira/browse/LUCENE-3079 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, > LUCENE-3079.patch, LUCENE-3079.patch > > > Faceting is a hugely important feature, available in Solr today but > not [easily] usable by Lucene-only apps. > We should fix this, by creating a shared faceting module. > Ideally, we factor out Solr's faceting impl, and maybe poach/merge > from other impls (eg Bobo browse). > Hoss describes some important challenges we'll face in doing this > (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here: > {noformat} > To look at "faceting" as a concrete example, there are big the reasons > faceting works so well in Solr: Solr has total control over the > index, knows exactly when the index has changed to rebuild caches, has a > strict schema so it can make sense of field types and > pick faceting algos accordingly, has multi-phase distributed search > approach to get exact counts efficiently across multiple shards, etc... > (and there are still a lot of additional enhancements and improvements > that can be made to take even more advantage of knowledge solr has because > it "owns" the index that we no one has had time to tackle) > {noformat} > This is a great list of the things we face in refactoring. It's also > important because, if Solr needed to be so deeply intertwined with > caching, schema, etc., other apps that want to facet will have the > same "needs" and so we really have to address them in creating the > shared module. > I think we should get a basic faceting module started, but should not > cut Solr over at first. We should iterate on the module, fold in > improvements, etc., and then, once we can fully verify that cutting > over doesn't hurt Solr (ie lose functionality or performance) we can > later cutover. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org