OK, the automatic handling and encoding options improve a lot in Parquet 2.0. (Manual override is not an option)
I'm using parquet-mr/parquet-avro to create parquet 2 files (ParquetProperties.WriterVersion.PARQUET_2_0). Drill seems to read them just fine but I wonder if there are any gotchas Regards, -Stefán On Thu, Feb 4, 2016 at 4:51 PM, Stefán Baxter <[email protected]> wrote: > Hi again, > > I did a little test and ~5 million fairly wide records take 791 MB in > parquet without dictionary encoding and 550MB with dictionary encoding > enabled (The non-dictionary encoded file is a whooping 45% bigger). > The plain, non-dictionary-encoding, file returns results for identical > queries in ~20% less time than the one that uses dictionary encoding. > > Regards, > -Stefán > > > > On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <[email protected]> > wrote: > >> Hi Jason, >> >> Thank you for the explanation. >> >> I have several *low* cardinality fields that contain semi-long values and >> they are, I think, a perfect candidate for dictionary encoding. >> >> I assumed that the choose to use dictionary encoding was a bit smarter >> than this and would rely on Strings type column where x% repeated values >> were a clear signal for it's use. >> >> If you can outline what needs to be done and where then I will gladly >> take a stab at it :). >> >> Several questions along those lines: >> >> - Does the Parquet library that Drill uses allow for programmatic >> section? >> - What metadata, regarding the column content, is available when the >> choice is made? >> - Where in the Parquet part of Drill is this logic? >> - Is there no ongoing effort in parquet-mr to make the automatic >> handling smarter? >> - Are all Parquet encoding options being used by drill? >> - Like the encoding of longs where delta between semi-subsequent >> numbers is stored. (As I understand it) >> >> thanks again. >> >> Regards, >> -Stefan >> >> >> >> >> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <[email protected] >> > wrote: >> >>> Hi Stefan, >>> >>> There is a reason that dictionary is disabled by default. The parquet-mr >>> library we leverage for writing parquet files currently has the behavior >>> to >>> write nearly all columns as dictionary encoded for all types when >>> dictionary encoding is enabled. This includes columns with integers, >>> doubles, dates and timestamps. >>> >>> Do you have some data that you believe is well suited for dictionary >>> encoding in the dataset? I think there are good uses for it, such as data >>> coming from systems that support enumerations, that might be represented >>> as >>> strings when exported from a database for use with Big Data tools like >>> Drill. Unfortunately we do not currently provide a mechanism for >>> requesting >>> dictionary encoding on only some columns, and we don't do anything like >>> buffer values to determine if a given column is well-suited for >>> dictionary >>> encoding before starting to write them. >>> >>> In many cases it obviously is not a good choice, and so we actually take >>> a >>> performance hit re-materializing the data out of the dictionary upon >>> read. >>> >>> If you would be interested in trying to contribute such an enhancement I >>> would be willing to help you get started with it. >>> >>> - Jason >>> >>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <[email protected] >>> > >>> wrote: >>> >>> > Hi, >>> > >>> > I'm converting Avro to parquest and I'm getting this log entry back >>> for a >>> > timestamp field: >>> > >>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B >>> raw, >>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, >>> PLAIN_DICTIONARY, >>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp} >>> > >>> > Can someone please tell me if this is the expected encoding for a >>> timestamp >>> > field. >>> > >>> > I'm a bit surprised that it seems to be dictionary based. (Yes, I have >>> > enabled dictionary encoding for Parquet files). >>> > >>> > Regards, >>> > -Stefán >>> > >>> >> >> >
