So as a user of Drill now for a while, I have gotten used to the idea of partitions just being values, instead of key=value like other things (hive, impala, others).
>From a user/analyst perspective, the dir0, dir1, dirN methodology provides quite a bit of flexibility, but to be intuitive, we have to know what that field is... thus there has to be some transfer of knowledge on what that value for directory names are. With other methods, the key is right there in the directory name. Now, I am really getting to the nitty gritty here, I know we could do things like create a view to name the dir0 to be something. For example: Drill Method: mytable - 2017-04-01 - 2017-04-02 - 2017-04-03 Vs. Hive method: mytable - day=2017-04-01 - day=2017-04-02 - day=2017-04-03 However, it takes extra admin effort, and hive, spark, etc all know the key=value method. Drill stands on its own here. So, my thought is this, dir0 is nice, it provides flexibility. But why not have drill be able to infer key=value, and when writing partitions (although I don't think Drill does this yet) write using alias specified? The more important part is the reading as the writing doesn't really work yet. (We don't Insert into mytable Partition by day like we do in hive, if we want to write a partition, we create table mytable/partition thus could easily put the key value in there as needed) So the reading. A. This could not break anything existing. Thus, dir0 must always work. B. Can we use a select option to enable/disable? (Would we even need this?). Basically, if there is a = in the partition name, split by =, make the value to be the right side, alias be left side. The hard parts: The planner would have to be aware of this, so when a scan of the directory occurs, the field name as an alias could be valid... If I did "select * from mytable where day = '2017-04-01' but that field didn't exist, it could error out, that said, we should know that the directories have Key=Value format when we scan for files... it's not like that is impossible (especially since we don't know what fields are in the parquet files unless we do metadata). This would also be something we should include in metadata... If we do Key=value then boom, write to the metadata cache, and speed up planning! So why do I think we need this? It would sure make data created by other sources easier/quicker to read. We wouldn't be string parsing directory names at query time, and it would just be another avenue to make Drill a natural fit in the ecosystem... I would be interested in community thoughts here, if there is interest I will make a Jira John