thnx, will do On Thu, Feb 4, 2016 at 11:49 PM, Jason Altekruse <altekruseja...@gmail.com> wrote:
> We haven't turned on the 2.0 encodings in Drill's Parquet writer, so they > have not been thoroughly tested. That being said we do use the standard > parquet-mr interfaces for reading parquet files in our complex parquet > reader. We are currently depending on 1.8.1 in Drill, so it should be > compatible. > > I think it would be safest to run with `store.parquet.use_new_reader` set > to true if you were going to working with parquet 2.0 files right now. > > - Jason > > On Thu, Feb 4, 2016 at 3:40 PM, Stefán Baxter <ste...@activitystream.com> > wrote: > > > OK, the automatic handling and encoding options improve a lot in Parquet > > 2.0. (Manual override is not an option) > > > > I'm using parquet-mr/parquet-avro to create parquet 2 files > > (ParquetProperties.WriterVersion.PARQUET_2_0). > > > > Drill seems to read them just fine but I wonder if there are any gotchas > > > > Regards, > > -Stefán > > > > > > On Thu, Feb 4, 2016 at 4:51 PM, Stefán Baxter <ste...@activitystream.com > > > > wrote: > > > > > Hi again, > > > > > > I did a little test and ~5 million fairly wide records take 791 MB in > > > parquet without dictionary encoding and 550MB with dictionary encoding > > > enabled (The non-dictionary encoded file is a whooping 45% bigger). > > > The plain, non-dictionary-encoding, file returns results for identical > > > queries in ~20% less time than the one that uses dictionary encoding. > > > > > > Regards, > > > -Stefán > > > > > > > > > > > > On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter < > ste...@activitystream.com > > > > > > wrote: > > > > > >> Hi Jason, > > >> > > >> Thank you for the explanation. > > >> > > >> I have several *low* cardinality fields that contain semi-long values > > and > > >> they are, I think, a perfect candidate for dictionary encoding. > > >> > > >> I assumed that the choose to use dictionary encoding was a bit smarter > > >> than this and would rely on Strings type column where x% repeated > values > > >> were a clear signal for it's use. > > >> > > >> If you can outline what needs to be done and where then I will gladly > > >> take a stab at it :). > > >> > > >> Several questions along those lines: > > >> > > >> - Does the Parquet library that Drill uses allow for programmatic > > >> section? > > >> - What metadata, regarding the column content, is available when > the > > >> choice is made? > > >> - Where in the Parquet part of Drill is this logic? > > >> - Is there no ongoing effort in parquet-mr to make the automatic > > >> handling smarter? > > >> - Are all Parquet encoding options being used by drill? > > >> - Like the encoding of longs where delta between semi-subsequent > > >> numbers is stored. (As I understand it) > > >> > > >> thanks again. > > >> > > >> Regards, > > >> -Stefan > > >> > > >> > > >> > > >> > > >> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse < > > altekruseja...@gmail.com > > >> > wrote: > > >> > > >>> Hi Stefan, > > >>> > > >>> There is a reason that dictionary is disabled by default. The > > parquet-mr > > >>> library we leverage for writing parquet files currently has the > > behavior > > >>> to > > >>> write nearly all columns as dictionary encoded for all types when > > >>> dictionary encoding is enabled. This includes columns with integers, > > >>> doubles, dates and timestamps. > > >>> > > >>> Do you have some data that you believe is well suited for dictionary > > >>> encoding in the dataset? I think there are good uses for it, such as > > data > > >>> coming from systems that support enumerations, that might be > > represented > > >>> as > > >>> strings when exported from a database for use with Big Data tools > like > > >>> Drill. Unfortunately we do not currently provide a mechanism for > > >>> requesting > > >>> dictionary encoding on only some columns, and we don't do anything > like > > >>> buffer values to determine if a given column is well-suited for > > >>> dictionary > > >>> encoding before starting to write them. > > >>> > > >>> In many cases it obviously is not a good choice, and so we actually > > take > > >>> a > > >>> performance hit re-materializing the data out of the dictionary upon > > >>> read. > > >>> > > >>> If you would be interested in trying to contribute such an > enhancement > > I > > >>> would be willing to help you get started with it. > > >>> > > >>> - Jason > > >>> > > >>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter < > > ste...@activitystream.com > > >>> > > > >>> wrote: > > >>> > > >>> > Hi, > > >>> > > > >>> > I'm converting Avro to parquest and I'm getting this log entry back > > >>> for a > > >>> > timestamp field: > > >>> > > > >>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, > > 2,169,557B > > >>> raw, > > >>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, > > >>> PLAIN_DICTIONARY, > > >>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp} > > >>> > > > >>> > Can someone please tell me if this is the expected encoding for a > > >>> timestamp > > >>> > field. > > >>> > > > >>> > I'm a bit surprised that it seems to be dictionary based. (Yes, I > > have > > >>> > enabled dictionary encoding for Parquet files). > > >>> > > > >>> > Regards, > > >>> > -Stefán > > >>> > > > >>> > > >> > > >> > > > > > >