[jira] [Issue Comment Edited] (LUCENE-3079) Faceting module

Toke Eskildsen (JIRA) Tue, 28 Jun 2011 08:19:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-3079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056517#comment-13056517
 ]


Toke Eskildsen edited comment on LUCENE-3079 at 6/28/11 3:17 PM:
-----------------------------------------------------------------

Some preliminary performance testing: I hacked together a test where 1M 
documents were created with an average of 3.5 paths, down to a depth of 4 
(resulting in 1.4M unique paths). A search that hit every other document was 
issued and the top 5 facets/tags was requested. Hopefully this is somewhat 
similar to your test.

||            || LUCENE-3079 || LUCENE-2369 ||
| Index build time |  52 s |  23 s |
| Memory required for indexing | 192 MB | 48 MB |
| First facet request | 432 ms | 12,000 ms |
| Best of 5 requests | 228 ms | 159 ms |
| Memory usage after faceting (after gc()) | 21 MB | 22 MB |

Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) 
resulted in

||            || LUCENE-3079 || LUCENE-2369 ||
| Index build time |  752 s |  238 s |
| Memory required for indexing | 2500 MB | 128 MB |
| First facet request | 3370 ms | 147,000 ms |
| Best of 5 requests | 2400 ms | 2673 ms |
| Memory usage after faceting | 435 MB | 294 MB |

Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) 
resulted in
||            || LUCENE-3079 || LUCENE-2369 ||
| Index build time |  5317 ms |  2563 ms |
| Memory required for indexing | 48 MB | 32 MB |
| First facet request | 245 ms | 1425 ms |
| Best of 5 requests | 15 ms | 8 ms |
| Memory usage after faceting | 1 MB | 2 MB |

Some observations: It seems clear that some trade-offs are very different for 
the two solutions. LUCENE-3079 has brilliant startup time and slows analyzing 
time a bit through the whole indexing process. LUCENE-2369 is dog slow at 
startup but does not impact indexing. They seem similar with regards to 
search-time speed and memory usage.

Now, LUCENE-2369 patches some semi-random Lucene-4, so this is not a fair test. 
Likewise, the tests were just quick hacks; the disk cache was not flushed, the 
laptop was used for browsing etc. When LUCENE-3079 patches trunk, a proper 
comparison can be made.

I am a bit worried about the observed memory usage for index build. It seems 
that LUCENE-3079 uses a lot of heap there? I created the documents one at a 
time just before adding them to the index, so the memory usage is for the 
writers and a quick profile told me that it was mainly used for int-arrays. 
Does that sound right?

      was (Author: toke):
    Some preliminary performance testing: I hacked together a test where 1M 
documents were created with an average of 3.5 paths, down to a depth of 4 
(resulting in 1.4M unique paths). A search that hit every other document was 
issued and the top 5 facets/tags was requested. Hopefully this is somewhat 
similar to your test.

||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  52 s |  23 s |
| Memory required for indexing | 192 MB | 48 MB |
| First facet request | 432 ms | 12,000 ms |
| Best of 5 requests | 228 ms | 159 ms |
| Memory usage after faceting (after gc()) | 21 MB | 22 MB |

Upping the ante to 5M documents, 6.8 paths/docs, max depth 6 (22M unique paths) 
resulted in

||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  752 s |  238 s |
| Memory required for indexing | 2500 MB | 128 MB |
| First facet request | 3370 ms | 147,000 ms |
| Best of 5 requests | 2400 ms | 2673 ms |
| Memory usage after faceting | 435 MB | 294 MB |

Scaling down to 100K documents, 1.6 paths/doc, max depth 4 (63K unique paths) 
resulted in
||            || LUCENE-3097 || LUCENE-2309 ||
| Index build time |  5317 ms |  2563 ms |
| Memory required for indexing | 48 MB | 32 MB |
| First facet request | 245 ms | 1425 ms |
| Best of 5 requests | 15 ms | 8 ms |
| Memory usage after faceting | 1 MB | 2 MB |

Some observations: It seems clear that some trade-offs are very different for 
the two solutions. LUCENE-3097 has brilliant startup time and slows analyzing 
time a bit through the whole indexing process. LUCENE-2309 is dog slow at 
startup but does not impact indexing. They seem similar with regards to 
search-time speed and memory usage.

Now, LUCENE-2309 patches some semi-random Lucene-4, so this is not a fair test. 
Likewise, the tests were just quick hacks; the disk cache was not flushed, the 
laptop was used for browsing etc. When LUCENE-3097 patches trunk, a proper 
comparison can be made.

I am a bit worried about the observed memory usage for index build. It seems 
that LUCENE-3097 uses a lot of heap there? I created the documents one at a 
time just before adding them to the index, so the memory usage is for the 
writers and a quick profile told me that it was mainly used for int-arrays. 
Does that sound right?
  
> Faceting module
> ---------------
>
>                 Key: LUCENE-3079
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3079
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/facet
>            Reporter: Michael McCandless
>            Assignee: Shai Erera
>             Fix For: 3.4, 4.0
>
>         Attachments: LUCENE-3079-dev-tools.patch, LUCENE-3079.patch, 
> LUCENE-3079.patch, LUCENE-3079.patch
>
>
> Faceting is a hugely important feature, available in Solr today but
> not [easily] usable by Lucene-only apps.
> We should fix this, by creating a shared faceting module.
> Ideally, we factor out Solr's faceting impl, and maybe poach/merge
> from other impls (eg Bobo browse).
> Hoss describes some important challenges we'll face in doing this
> (http://markmail.org/message/5w35c2fr4zkiwsz6), copied here:
> {noformat}
> To look at "faceting" as a concrete example, there are big the reasons 
> faceting works so well in Solr: Solr has total control over the 
> index, knows exactly when the index has changed to rebuild caches, has a 
> strict schema so it can make sense of field types and 
> pick faceting algos accordingly, has multi-phase distributed search 
> approach to get exact counts efficiently across multiple shards, etc...
> (and there are still a lot of additional enhancements and improvements 
> that can be made to take even more advantage of knowledge solr has because 
> it "owns" the index that we no one has had time to tackle)
> {noformat}
> This is a great list of the things we face in refactoring.  It's also
> important because, if Solr needed to be so deeply intertwined with
> caching, schema, etc., other apps that want to facet will have the
> same "needs" and so we really have to address them in creating the
> shared module.
> I think we should get a basic faceting module started, but should not
> cut Solr over at first.  We should iterate on the module, fold in
> improvements, etc., and then, once we can fully verify that cutting
> over doesn't hurt Solr (ie lose functionality or performance) we can
> later cutover.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3079) Faceting module

Reply via email to