Re: Drill Parquet Partitioning Method

Jesse Yates Mon, 03 Apr 2017 15:24:06 -0700

I filed a JIRA for this a while back:
https://issues.apache.org/jira/browse/DRILL-4615

And even more recently finished a patch on our internal branch (pegged at
1.6 right now). Since 1.6, things have shifted a bit in the abstraction of
managing directory partitions, so the patch wouldn't be directly
applicable. Where I got caught up was in the directory depth issue - there
is/was a bunch of expectation in the Easy reader, and fs partition rule,
around directory names being tied to known depth in the directory tree. It
makes it easy to do dir0, dir1, but hard for any other case :-/

The workaround I used was having a now config for the workspace, a
DirectoryStrategy, that uses at least a known directory name for each level
of the hierarchy and a parsing strategy for the value. So you would have a
directory list of, for your example, ['day'], or for a more complicated
structure like:
  /mytable
     /day=1
       /hour=2

a directory list like ['day', 'hour']. Not the best, but OK if you don't
have changing directory structures for a workspace.

Happy to post the patch if someone is interested, but it would take a bit
of work (for which I don't have time right now) to finagle it into master.

--Jesse

On Mon, Apr 3, 2017 at 1:56 PM Jinfeng Ni <j...@apache.org> wrote:

That's a good idea.

Let me clarify one thing first. Drill has two kinds of partitions:
auto partition, or directory-based partition.

The first one is a result of using drill's CTAS partition by statement
[1].  Both partition column name and column value are written and
encoded in the output file. When querying such data, you can directly
specify those partition columns in a WHERE/HAVING clause, and Drill
will do partition pruning.

The second one, directory-based partition, is to handle the cases
where the input files are generated by other tools. It could be
created by hive's or impala's partition by clause, or it could
generated by some other ecosystem. For hive generated files, you are
right the directory name has the pattern of "key=value", for files
generated by other tools, we may see "value" only as the directory
name. For the hive generated files, what we recommend today to user do
is to created a view ( i.e. substring(dir0, 4) as dateCol), to extract
the value from the directory name.

Your suggestion makes sense to me. We probably may consider adding an
option to the file system format plugin, to 'auto' parse the
"key=value" pattern, and replace "dirN" with "key".   That option
would work in a similar way as the options (skipFirstLine,
extractHeader, etc) ) we added for text file [2]. I agree with you
that it would make Drill more seamlessly work with other ecosystems.

If we do that, I'm not sure if Drill could error out for "select *
from mytable where day = '2017-04-01' ", if there is no "day" field in
the directory names. The thing is day could come from either
directory, or from data files. That probably need long discussion
about how Drill should handle a possible non-exist column.

1. https://drill.apache.org/docs/partition-by-clause/
2. https://drill.apache.org/docs/text-files-csv-tsv-psv/

On Mon, Apr 3, 2017 at 11:08 AM, John Omernik <j...@omernik.com> wrote:
> So as a user of Drill now for a while, I have gotten used to the idea of
> partitions just being values, instead of key=value like other things
(hive,
> impala, others).
>
> From a user/analyst perspective, the dir0, dir1, dirN methodology provides
> quite a bit of flexibility, but to be intuitive, we have to know what that
> field is... thus there has to be some transfer of knowledge on what that
> value for directory names are.
>
> With other methods, the key is right there in the directory name.
>
> Now, I am really getting to the nitty gritty here, I know we could do
> things like create a view to name the dir0 to be something.
>
> For example:
>
> Drill Method:
> mytable
> - 2017-04-01
> - 2017-04-02
> - 2017-04-03
>
> Vs.
>
> Hive method:
> mytable
> - day=2017-04-01
> - day=2017-04-02
> - day=2017-04-03
>
>
> However, it takes extra admin effort, and hive, spark, etc all know the
> key=value method.
>
> Drill stands on its own here.  So, my thought is this, dir0 is nice, it
> provides flexibility.  But why not have drill be able to infer key=value,
> and when writing partitions (although I don't think Drill does this yet)
> write using alias specified?
>
> The more important part is the reading as the writing doesn't really work
> yet. (We don't Insert into mytable Partition by day like we do in hive, if
> we want to write a partition, we create table mytable/partition thus could
> easily put the key value in there as needed)
>
> So the reading.  A. This could not break anything existing.  Thus, dir0
> must always work. B. Can we use a select option to enable/disable? (Would
> we even need this?).
>
> Basically, if there is a = in the partition name, split by =, make the
> value to be the right side, alias be left side.
>
> The hard parts:
>
> The planner would have to be aware of this, so when a scan of the
directory
> occurs, the field name as an alias could be valid...
>
> If I did "select * from mytable where day = '2017-04-01' but that field
> didn't exist, it could error out, that said, we should know that the
> directories have Key=Value format when we scan for files... it's not like
> that is impossible (especially since we don't know what fields are in the
> parquet files unless we do metadata).
>
> This would also be something we should include in metadata... If we do
> Key=value then boom, write to the metadata cache, and speed up planning!
>
> So why do I think we need this?
>
> It would sure make data created by other sources easier/quicker to read.
We
> wouldn't be string parsing directory names at query time, and it would
just
> be another avenue to make Drill a natural fit in the ecosystem...
>
>
> I would be interested in community thoughts here, if there is interest I
> will make a Jira
>
>
> John

-- 
Jesse Yates
Founder/CEO Fineo.io
Book a meeting: https://calendly.com/jyates

Re: Drill Parquet Partitioning Method

Reply via email to