Hi again,

I did a little test and ~5 million fairly wide records take 791 MB in
parquet without dictionary encoding and 550MB with dictionary encoding
enabled (The non-dictionary encoded file is a whooping 45% bigger).
The plain, non-dictionary-encoding, file returns results for identical
queries in ~20% less time than the one that uses dictionary encoding.

Regards,
 -Stefán



On Thu, Feb 4, 2016 at 3:48 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi Jason,
>
> Thank you for the explanation.
>
> I have several *low* cardinality fields that contain semi-long values and
> they are, I think, a perfect candidate for dictionary encoding.
>
> I assumed that the choose to use dictionary encoding was a bit smarter
> than this and would rely on Strings type column where x% repeated values
> were a clear signal for it's use.
>
> If you can outline what  needs to be done and where then I will gladly
> take a stab at it :).
>
> Several questions along those lines:
>
>    - Does the Parquet library that Drill uses allow for programmatic
>    section?
>    - What metadata, regarding the column content, is available when the
>    choice is made?
>    - Where in the Parquet part of Drill is this logic?
>    - Is there no ongoing effort in parquet-mr to make the automatic
>    handling smarter?
>    - Are all Parquet encoding options being used by drill?
>    - Like the encoding of longs where delta between semi-subsequent
>    numbers is stored. (As I understand it)
>
> thanks again.
>
> Regards,
>  -Stefan
>
>
>
>
> On Thu, Feb 4, 2016 at 3:36 PM, Jason Altekruse <altekruseja...@gmail.com>
> wrote:
>
>> Hi Stefan,
>>
>> There is a reason that dictionary is disabled by default. The parquet-mr
>> library we leverage for writing parquet files currently has the behavior
>> to
>> write nearly all columns as dictionary encoded for all types when
>> dictionary encoding is enabled. This includes columns with integers,
>> doubles, dates and timestamps.
>>
>> Do you have some data that you believe is well suited for dictionary
>> encoding in the dataset? I think there are good uses for it, such as data
>> coming from systems that support enumerations, that might be represented
>> as
>> strings when exported from a database for use with Big Data tools like
>> Drill. Unfortunately we do not currently provide a mechanism for
>> requesting
>> dictionary encoding on only some columns, and we don't do anything like
>> buffer values to determine if a given column is well-suited for dictionary
>> encoding before starting to write them.
>>
>> In many cases it obviously is not a good choice, and so we actually take a
>> performance hit re-materializing the data out of the dictionary upon read.
>>
>> If you would be interested in trying to contribute such an enhancement I
>> would be willing to help you get started with it.
>>
>> - Jason
>>
>> On Wed, Feb 3, 2016 at 5:15 AM, Stefán Baxter <ste...@activitystream.com>
>> wrote:
>>
>> > Hi,
>> >
>> > I'm converting Avro to parquest and I'm getting this log entry back for
>> a
>> > timestamp field:
>> >
>> > Written 1,008,842B for [occurred_at] INT64: 591,435 values, 2,169,557B
>> raw,
>> > 1,008,606B comp, 5 pages, encodings: [BIT_PACKED, PLAIN,
>> PLAIN_DICTIONARY,
>> > RLE], dic { 123,832 entries, 990,656B raw, 123,832B comp}
>> >
>> > Can someone please tell me if this is the expected encoding for a
>> timestamp
>> > field.
>> >
>> > I'm a bit surprised that it seems to be dictionary based. (Yes, I have
>> > enabled dictionary encoding for Parquet files).
>> >
>> > Regards,
>> >  -Stefán
>> >
>>
>
>

Reply via email to