[ https://issues.apache.org/jira/browse/LUCENE-5914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Adrien Grand updated LUCENE-5914: --------------------------------- Attachment: LUCENE-5914.patch Here is a new patch that iterates on Robert's: - improved compression for numerics: - floats and doubles representing small integers take 1 byte - other positive floats and doubles take 4 / 8 bytes - other floats and doubles (negative) take 5 / 9 bytes - doubles that are actually casted floats take 5 bytes - longs are compressed if they represent a timestamp (2 bits are used to encode for the fact that the number is a multiple of a second, hour, day, or is uncompressed) - clean up of the checkFooter calls in the reader - slightly better encoding of the offsets with the BEST_SPEED option by using monotonic encoding: this allows to just slurp a sequence of bytes and then decode a single value instead of having to decode lengths and sum them up in order to have offsets (the BEST_COMPRESSION option still does this however) - fixed some javadocs errors > More options for stored fields compression > ------------------------------------------ > > Key: LUCENE-5914 > URL: https://issues.apache.org/jira/browse/LUCENE-5914 > Project: Lucene - Core > Issue Type: Improvement > Reporter: Adrien Grand > Assignee: Adrien Grand > Fix For: 5.0 > > Attachments: LUCENE-5914.patch, LUCENE-5914.patch, LUCENE-5914.patch, > LUCENE-5914.patch, LUCENE-5914.patch > > > Since we added codec-level compression in Lucene 4.1 I think I got about the > same amount of users complaining that compression was too aggressive and that > compression was too light. > I think it is due to the fact that we have users that are doing very > different things with Lucene. For example if you have a small index that fits > in the filesystem cache (or is close to), then you might never pay for actual > disk seeks and in such a case the fact that the current stored fields format > needs to over-decompress data can sensibly slow search down on cheap queries. > On the other hand, it is more and more common to use Lucene for things like > log analytics, and in that case you have huge amounts of data for which you > don't care much about stored fields performance. However it is very > frustrating to notice that the data that you store takes several times less > space when you gzip it compared to your index although Lucene claims to > compress stored fields. > For that reason, I think it would be nice to have some kind of options that > would allow to trade speed for compression in the default codec. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org