Re: Best way to partition the data

Jinfeng Ni Fri, 01 Sep 2017 10:29:27 -0700

If you have small cardinality for partitioning column, yet still end up
with 50k different small files, it's possible that you have many parallel
writer minor-fragment (threads).  By default, each writer minor-fragment
will work independently. If you have cardinailty C and N writer minor
fragment, you could end up with up to C*N small files.


There are two possible solutions.

1) You may consider turning the following option to true. This will add
network communication/cpu cost, yet it will reduce the # of files to C.

alter session set `store.partition.hash_distribute` = true;   //default is
false.

2) Reduce the parallel writer minor-fragment by tuning other parameter
before you run CTAS partition statement.

For partition pruning, Drill works on row group level, not at page level.





On Fri, Sep 1, 2017 at 9:02 AM, Padma Penumarthy <ppenumar...@mapr.com>
wrote:

> Have you tried building metadata cache file using "refresh table metadata”
> command ?
> That will help reduce the planning time. Is most of the time spent in
> planning or execution ?
>
> Pruning is done at  rowgroup level i.e. at file level (we create one file
> per rowgroup).
> We do not support pruning at page level.
> I am thinking if it created 50K files, it means your cardinality is high.
> You might want to
> consider putting some directory hierarchy in place for ex. you can create
> a directory
> for each unique value of column 1 and a file for each unique value of
> column 2 underneath.
> If partition is done correctly, depending upon the filters, we should not
> read more
> rowgroups than what is needed.
>
> Thanks,
> Padma
>
>
>
> On Sep 1, 2017, at 6:54 AM, Damien Profeta <damien.prof...@amadeus.com<
> mailto:damien.prof...@amadeus.com>> wrote:
>
> Hello,
>
> I have a dataset that I always query on 2 columns that don't have a big
> cardinality. So to benefit from pruning, I tried to partition the file on
> these keys, but I end up with 50k differents small file (30Mo) and query on
> it spend most of the time in the planning phase, to decode the metadata
> file, resolve the absolute path…
>
> By looking at the parquet file structure, I saw that there are statistics
> at page level and chunk level. So I tried to generated parquet file where a
> page is dedicated for one value for the 2 partition column. By using the
> statistics, Drill could be able to drop the page/chunk.
> But it seems Drill is not making any use of the statistics in the parquet
> file because, whatever the query I do, I don't see any change in the number
> of page loaded.
>
> Do you confirm my conclusion? What would be the best way to organize the
> data so that Drill doesn't read the data that can be pruned easily
>
> Thanks
> Damien
>
>

Re: Best way to partition the data

Reply via email to