Hi Jason,

Thank you for the explanation.

I have several *low* cardinality fields that contain semi-long values and
they are, I think, a perfect candidate for dictionary encoding.

I assumed that the choose to use dictionary encoding was a bit smarter than
this and would rely on Strings type column where x% repeated values were a
clear signal for it's use.

If you can outline what  needs to be done and where then I will gladly take
a stab at it :).

Several questions along those lines:

   - Does the Parquet library that Drill uses allow for programmatic
   section?
   - What metadata, regarding the column content, is available when the
   choice is made?
   - Where in the Parquet part of Drill is this logic?
   - Is there no ongoing effort in parquet-mr to make the automatic
   handling smarter?
   - Are all Parquet encoding options being used by drill?
   - Like the encoding of longs where delta between semi-subsequent numbers
   is stored. (As I understand it)

thanks again.

Regards,
 -Stefan




On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> Hi Stefan,
>
> There is a reason that dictionary is disabled by default. The parquet-mr
> library we leverage for writing parquet files currently has the behavior to
> write nearly all columns as dictionary encoded for all types when
> dictionary encoding is enabled. This includes columns with integers,
> doubles, dates and timestamps.
>
> Do you have some data that you believe is well suited for dictionary
> encoding in the dataset? I think there are good uses for it, such as data
> coming from systems that support enumerations, that might be represented as
> strings when exported from a database for use with Big Data tools like
> Drill. Unfortunately we do not currently provide a mechanism for requesting
> dictionary encoding on only some columns, and we don't do anything like
> buffer values to determine if a given column is well-suited for dictionary
> encoding before starting to write them.
>
> In many cases it obviously is not a good choice, and so we actually take a
> performance hit re-materializing the data out of the dictionary upon read.
>
> If you would be interested in trying to contribute such an enhancement I
> would be willing to help you get started with it.
>
> - Jason
>
> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > I'm converting Avro to parquest and I'm getting this log entry back for a
> > timestamp field:
> >
> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B
> raw,
> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN,
> PLAIN_DICTIONARY,
> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp}
> >
> > Can someone please tell me if this is the expected encoding for a
> timestamp
> > field.
> >
> > I'm a bit surprised that it seems to be dictionary based. (Yes, I have
> > enabled dictionary encoding for Parquet files).
> >
> > Regards,
> >  -Stefán
> >
>

Reply via email to