[ 
https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117986#comment-14117986
 ] 

Eks Dev commented on LUCENE-5914:
---------------------------------

bq. Do you have pointers to emails/irc logs describing such issues?

I do not know what the gold standard lucene usage is, but at least one use case 
I can describe, maybe it helps. I am not proposing anything here, just sharing 
experience. 

Think about the (typical lucene?) usage with structured data (e.g. indexing 
relational db, like product catalog or such) with many smallish fields and then 
retrieving 2k such documents to post-process them, classify, cluster them or 
whatnot (e.g. mahout and co.) 

- Default compression with CHUNK_SIZE makes it decompress 2k * CHUNK_SIZE/2  
bytes on average in order to retrieve 2k Documents 
- Reducing chunk_size helps a lot, but there is a sweet-spot, and if you reduce 
it too much, you will not see enough compression and then your index is not 
fitting into cache , so you get hurt on IO. 

Ideally we should enable to use biggish chunk_size during compression to 
improve compression and decompress only single document (not depending on 
chunk_size), just like you proposed here (if I figured it out correctly?)

Usually, such data is highly compressible (imagine all these low cardinality 
fields like color of something...) and even some basic compression does the 
magic.

What we did?
- Reduced chunk_size
- As a bonus to improve compression, added plain static dictionary compression 
for a few fields in update chain (we store analysed fields)
- When applicable, we pre-sort collection periodically before indexing (on low 
cardinality fields first).... this old db-admin secret weapon helps a lot

Conclusion: compression is great, and anything that helps tweak this balance 
(CPU effort / IO effort)  in different phases indexing/retrieving smoothly 
makes lucene use case coverage broader.  (e.g. "I want to afford more CPU 
during indexing, and less CPU during retrieval", static coder being extreme 
case for this...)

I am not sure I figured out exactly if and how this patch is going to help in a 
such cases (how to achieve reasonable compression if we do per document 
compression for small documents? Reusing dictionaries from previous chunks? 
static dictionaries... ). 

In any case, thanks for doing the heavy lifting here! I think you already did 
really great job with compression in lucene. 

PS: Ages ago, before lucene, when memory was really expensive, we had our own 
serialization (not in lucene) that simply had one static Huffman coder per 
field (with byte or word symbols), with code-table populated offline,  that was 
great, simple option as it enabled reasonable compression for "slow changing 
collections" and really fast random access.  
 

> More options for stored fields compression
> ------------------------------------------
>
>                 Key: LUCENE-5914
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5914
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Adrien Grand
>            Assignee: Adrien Grand
>             Fix For: 4.11
>
>         Attachments: LUCENE-5914.patch
>
>
> Since we added codec-level compression in Lucene 4.1 I think I got about the 
> same amount of users complaining that compression was too aggressive and that 
> compression was too light.
> I think it is due to the fact that we have users that are doing very 
> different things with Lucene. For example if you have a small index that fits 
> in the filesystem cache (or is close to), then you might never pay for actual 
> disk seeks and in such a case the fact that the current stored fields format 
> needs to over-decompress data can sensibly slow search down on cheap queries.
> On the other hand, it is more and more common to use Lucene for things like 
> log analytics, and in that case you have huge amounts of data for which you 
> don't care much about stored fields performance. However it is very 
> frustrating to notice that the data that you store takes several times less 
> space when you gzip it compared to your index although Lucene claims to 
> compress stored fields.
> For that reason, I think it would be nice to have some kind of options that 
> would allow to trade speed for compression in the default codec.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to