If you have small cardinality for partitioning column, yet still end up with 50k different small files, it's possible that you have many parallel writer minor-fragment (threads). By default, each writer minor-fragment will work independently. If you have cardinailty C and N writer minor fragment, you could end up with up to C*N small files.
There are two possible solutions. 1) You may consider turning the following option to true. This will add network communication/cpu cost, yet it will reduce the # of files to C. alter session set `store.partition.hash_distribute` = true; //default is false. 2) Reduce the parallel writer minor-fragment by tuning other parameter before you run CTAS partition statement. For partition pruning, Drill works on row group level, not at page level. On Fri, Sep 1, 2017 at 9:02 AM, Padma Penumarthy <ppenumar...@mapr.com> wrote: > Have you tried building metadata cache file using "refresh table metadata” > command ? > That will help reduce the planning time. Is most of the time spent in > planning or execution ? > > Pruning is done at rowgroup level i.e. at file level (we create one file > per rowgroup). > We do not support pruning at page level. > I am thinking if it created 50K files, it means your cardinality is high. > You might want to > consider putting some directory hierarchy in place for ex. you can create > a directory > for each unique value of column 1 and a file for each unique value of > column 2 underneath. > If partition is done correctly, depending upon the filters, we should not > read more > rowgroups than what is needed. > > Thanks, > Padma > > > > On Sep 1, 2017, at 6:54 AM, Damien Profeta <damien.prof...@amadeus.com< > mailto:damien.prof...@amadeus.com>> wrote: > > Hello, > > I have a dataset that I always query on 2 columns that don't have a big > cardinality. So to benefit from pruning, I tried to partition the file on > these keys, but I end up with 50k differents small file (30Mo) and query on > it spend most of the time in the planning phase, to decode the metadata > file, resolve the absolute path… > > By looking at the parquet file structure, I saw that there are statistics > at page level and chunk level. So I tried to generated parquet file where a > page is dedicated for one value for the 2 partition column. By using the > statistics, Drill could be able to drop the page/chunk. > But it seems Drill is not making any use of the statistics in the parquet > file because, whatever the query I do, I don't see any change in the number > of page loaded. > > Do you confirm my conclusion? What would be the best way to organize the > data so that Drill doesn't read the data that can be pruned easily > > Thanks > Damien > >