[ https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231040#comment-14231040 ]
Robert Muir commented on LUCENE-5914: ------------------------------------- I opened LUCENE-6085 for the SI.attributes, which should help with cleanup. I ran some benchmarks on various datasets to get an idea where this is at, they are disappointing. For geonames, the new format increases size of the stored fields 50%, for apache http server logs, it doubles the size. Indexing time is significantly slower for any datasets i test as well: there must be bugs in the lz4+shared dictionary? ||impl||size||index time||force merge time|| |trunk|372,845,278|101,745|15,976| |patch(BEST_SPEED)|780,861,727|141,699|60,114| |patch(BEST_COMPRESSION)|265,063,340|132,238|53,561| To confirm its a bug and not just the cost of additional i/o (due to less compression with shared dictionaries), i set deflate level to 0, and indexed with the BEST_COMPRESSION layout to really jack up the size. Sure, it created a 1.8GB stored field file, but in 126,093ms with 44,377ms merging. This is faster than both the options in the patch... Anyway, this leads to more questions: * Do we really need a completely separate lz4 impl for the shared dictionaries support? Its tough to understand e.g. why it reimplements the hashtable differently and so on. * Do we really need to share code between different stored fields impls that have different use-cases and goals? I think the patch currently overshares here, and the additional abstractions make it hard to work with. * Along with the sharing approach above: we can still reuse code between formats though. for example the document<->byte stuff could be shared static methods. I would just avoid subclassing and interfaces because I get lost in the patch too easily. And we just need to be careful that any shared code is simple and clear because we have to assume the formats will evolve overtime. * We shouldnt wrap the deflate case with zlib header/footer. This saves a little bit. About the oversharing issue: I really think the separate formats should just be separate formats, it will make life easier. Its more than just a difference in compression algorithm and we shouldn't try to structure things so that can just be swapped in, i think its not the right tradeoff. For example, with high compression its more important to lay it out in a way where bulk-merge doesn't cause re-compressions, even if it causes 'temporary' waste along segment boundaries. This is important because compression here gets very costly, and for e.g. "archiving" case, bulk merge should be potent as there shouldnt be so many deletions: we shouldnt bear the cost of re-compressing over and over. This gets much much worse if you try to use something "better" than gzip, too. On the other hand with low compression, we should ensure merging is still fast even in the presence of deletions: the shared dictionary approach is one way, another way is to just have at least the getMergeInstance() remember the current block and have "seek within block" optimization, which is probably simpler and better than what trunk does today. > More options for stored fields compression > ------------------------------------------ > > Key: LUCENE-5914 > URL: https://issues.apache.org/jira/browse/LUCENE-5914 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Fix For: 5.0 > > Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, > LUCENE-5914.patch, LUCENE-5914.patch > > > Since we added codec-level compression in Lucene 4.1 I think I got about the > same amount of users complaining that compression was too aggressive and that > compression was too light. > I think it is due to the fact that we have users that are doing very > different things with Lucene. For example if you have a small index that fits > in the filesystem cache (or is close to), then you might never pay for actual > disk seeks and in such a case the fact that the current stored fields format > needs to over-decompress data can sensibly slow search down on cheap queries. > On the other hand, it is more and more common to use Lucene for things like > log analytics, and in that case you have huge amounts of data for which you > don't care much about stored fields performance. However it is very > frustrating to notice that the data that you store takes several times less > space when you gzip it compared to your index although Lucene claims to > compress stored fields. > For that reason, I think it would be nice to have some kind of options that > would allow to trade speed for compression in the default codec. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org