Drill Parquet Partitioning Method

John Omernik Mon, 03 Apr 2017 11:08:58 -0700

So as a user of Drill now for a while, I have gotten used to the idea of
partitions just being values, instead of key=value like other things (hive,
impala, others).


>From a user/analyst perspective, the dir0, dir1, dirN methodology provides
quite a bit of flexibility, but to be intuitive, we have to know what that
field is... thus there has to be some transfer of knowledge on what that
value for directory names are.

With other methods, the key is right there in the directory name.

Now, I am really getting to the nitty gritty here, I know we could do
things like create a view to name the dir0 to be something.

For example:

Drill Method:
mytable
- 2017-04-01
- 2017-04-02
- 2017-04-03

Vs.

Hive method:
mytable
- day=2017-04-01
- day=2017-04-02
- day=2017-04-03


However, it takes extra admin effort, and hive, spark, etc all know the
key=value method.

Drill stands on its own here.  So, my thought is this, dir0 is nice, it
provides flexibility.  But why not have drill be able to infer key=value,
and when writing partitions (although I don't think Drill does this yet)
write using alias specified?

The more important part is the reading as the writing doesn't really work
yet. (We don't Insert into mytable Partition by day like we do in hive, if
we want to write a partition, we create table mytable/partition thus could
easily put the key value in there as needed)

So the reading.  A. This could not break anything existing.  Thus, dir0
must always work. B. Can we use a select option to enable/disable? (Would
we even need this?).

Basically, if there is a = in the partition name, split by =, make the
value to be the right side, alias be left side.

The hard parts:

The planner would have to be aware of this, so when a scan of the directory
occurs, the field name as an alias could be valid...

If I did "select * from mytable where day = '2017-04-01' but that field
didn't exist, it could error out, that said, we should know that the
directories have Key=Value format when we scan for files... it's not like
that is impossible (especially since we don't know what fields are in the
parquet files unless we do metadata).

This would also be something we should include in metadata... If we do
Key=value then boom, write to the metadata cache, and speed up planning!

So why do I think we need this?

It would sure make data created by other sources easier/quicker to read. We
wouldn't be string parsing directory names at query time, and it would just
be another avenue to make Drill a natural fit in the ecosystem...


I would be interested in community thoughts here, if there is interest I
will make a Jira


John

Drill Parquet Partitioning Method

Reply via email to