Hi again, I did a little test and ~5 million fairly wide records take 791 MB in parquet without dictionary encoding and 550MB with dictionary encoding enabled (The non-dictionary encoded file is a whooping 45% bigger). The plain, non-dictionary-encoding, file returns results for identical queries in ~20% less time than the one that uses dictionary encoding.
Regards, -Stefán On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <ste...@activitystream.com> wrote: > Hi Jason, > > Thank you for the explanation. > > I have several *low* cardinality fields that contain semi-long values and > they are, I think, a perfect candidate for dictionary encoding. > > I assumed that the choose to use dictionary encoding was a bit smarter > than this and would rely on Strings type column where x% repeated values > were a clear signal for it's use. > > If you can outline what needs to be done and where then I will gladly > take a stab at it :). > > Several questions along those lines: > > - Does the Parquet library that Drill uses allow for programmatic > section? > - What metadata, regarding the column content, is available when the > choice is made? > - Where in the Parquet part of Drill is this logic? > - Is there no ongoing effort in parquet-mr to make the automatic > handling smarter? > - Are all Parquet encoding options being used by drill? > - Like the encoding of longs where delta between semi-subsequent > numbers is stored. (As I understand it) > > thanks again. > > Regards, > -Stefan > > > > > On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <altekruseja...@gmail.com> > wrote: > >> Hi Stefan, >> >> There is a reason that dictionary is disabled by default. The parquet-mr >> library we leverage for writing parquet files currently has the behavior >> to >> write nearly all columns as dictionary encoded for all types when >> dictionary encoding is enabled. This includes columns with integers, >> doubles, dates and timestamps. >> >> Do you have some data that you believe is well suited for dictionary >> encoding in the dataset? I think there are good uses for it, such as data >> coming from systems that support enumerations, that might be represented >> as >> strings when exported from a database for use with Big Data tools like >> Drill. Unfortunately we do not currently provide a mechanism for >> requesting >> dictionary encoding on only some columns, and we don't do anything like >> buffer values to determine if a given column is well-suited for dictionary >> encoding before starting to write them. >> >> In many cases it obviously is not a good choice, and so we actually take a >> performance hit re-materializing the data out of the dictionary upon read. >> >> If you would be interested in trying to contribute such an enhancement I >> would be willing to help you get started with it. >> >> - Jason >> >> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <ste...@activitystream.com> >> wrote: >> >> > Hi, >> > >> > I'm converting Avro to parquest and I'm getting this log entry back for >> a >> > timestamp field: >> > >> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B >> raw, >> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN, >> PLAIN_DICTIONARY, >> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp} >> > >> > Can someone please tell me if this is the expected encoding for a >> timestamp >> > field. >> > >> > I'm a bit surprised that it seems to be dictionary based. (Yes, I have >> > enabled dictionary encoding for Parquet files). >> > >> > Regards, >> > -Stefán >> > >> > >