You guys got it, of course. :) I liked the sound of being able to detect how to pack things at run time and switch between multiple approaches over time.... or at least that's how I interpreted the announcement.
Otis -- Solr & ElasticSearch Support -- http://sematext.com/ Performance Monitoring -- http://sematext.com/spm On Mon, Sep 16, 2013 at 4:29 AM, Adrien Grand <[email protected]> wrote: > Thanks for pointing this out, Otis! > > I think the columnar nature of Parquet makes it more similar to doc > values than to stored fields, and indeed, if you look at the parquet > file-format specification [1], it is very similar to what we have for > doc values [2]. In both cases, we have > - dictionary encoding (PLAIN_DICTIONARY in parquet, TABLE_COMPRESSED > in Lucene45DVF), > - bit-packing (BIT_PACKED(/RLE) in parquet, DELTA_COMPRESSED in Lucene45DVF). > > Parquet also uses run-length encoding (RLE) which is unfortunately not > doable for doc values since they need to support random access. > Parquet's RLE compression is actually closer to what we have for > postings lists (a postings list of X values is encoded as X/128 blocs > of 128 packed values and X%128 RLE-encoded (VInt) values). On the > other hand, doc values have GCD_COMPRESSED (which efficiently > compresses any sequence of longs where all values can be expressed as > a * x + b) which is typically useful for storing dates that don't have > millisecond precision. > > About stored fields, it would indeed be possible to store all values > of a given field in a column-stride fashion per chunk. However, I > think parquet doesn't optimize for the same thing as stored fields: > parquet needs to run computations on the values of a few fields of > many documents (like doc values) while with stored fields, we usually > need to get all values of a single document. This makes columnar > storage a bit unconvenient for stored fields, although I think we > could try it on our chunks of stored documents given that it may > improve the compression ratio. > > I only have a very superficial understanding of parquet so if you know > I said something which is wrong about parquet, please tell me! > > [1] https://github.com/parquet/parquet-format > [2] > https://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/lucene45/Lucene45DocValuesConsumer.java > > -- > Adrien > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
