[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14231040#comment-14231040
 ] 

Robert Muir commented on LUCENE-5914:
-------------------------------------

I opened LUCENE-6085 for the SI.attributes, which should help with cleanup.

I ran some benchmarks on various datasets to get an idea where this is at, they 
are disappointing. For geonames, the new format increases size of the stored 
fields 50%, for apache http server logs, it doubles the size. Indexing time is 
significantly slower for any datasets i test as well: there must be bugs in the 
lz4+shared dictionary?

||impl||size||index time||force merge time||
|trunk|372,845,278|101,745|15,976|
|patch(BEST_SPEED)|780,861,727|141,699|60,114|
|patch(BEST_COMPRESSION)|265,063,340|132,238|53,561|

To confirm its a bug and not just the cost of additional i/o (due to less 
compression with shared dictionaries), i set deflate level to 0, and indexed 
with the BEST_COMPRESSION layout to really jack up the size. Sure, it created a 
1.8GB stored field file, but in 126,093ms with 44,377ms merging. This is faster 
than both the options in the patch...

Anyway, this leads to more questions:
* Do we really need a completely separate lz4 impl for the shared dictionaries 
support? Its tough to understand e.g. why it reimplements the hashtable 
differently and so on.
* Do we really need to share code between different stored fields impls that 
have different use-cases and goals? I think the patch currently overshares 
here, and the additional abstractions make it hard to work with. 
* Along with the sharing approach above: we can still reuse code between 
formats though. for example the document<->byte stuff could be shared static 
methods. I would just avoid subclassing and interfaces because I get lost in 
the patch too easily. And we just need to be careful that any shared code is 
simple and clear because we have to assume the formats will evolve overtime.
* We shouldnt wrap the deflate case with zlib header/footer. This saves a 
little bit.

About the oversharing issue: I really think the separate formats should just be 
separate formats, it will make life easier. Its more than just a difference in 
compression algorithm and we shouldn't try to structure things so that can just 
be swapped in, i think its not the right tradeoff.

For example, with high compression its more important to lay it out in a way 
where bulk-merge doesn't cause re-compressions, even if it causes 'temporary' 
waste along segment boundaries. This is important because compression here gets 
very costly, and for e.g. "archiving" case, bulk merge should be potent as 
there shouldnt be so many deletions: we shouldnt bear the cost of 
re-compressing over and over. This gets much much worse if you try to use 
something "better" than gzip, too. 

On the other hand with low compression, we should ensure merging is still fast 
even in the presence of deletions: the shared dictionary approach is one way, 
another way is to just have at least the getMergeInstance() remember the 
current block and have "seek within block" optimization, which is probably 
simpler and better than what trunk does today.


> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 5.0
>
>         Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, 
> LUCENE-5914.patch, LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to