Re: Limit the number of output parquet files in CTAS

François Méthot Tue, 01 Nov 2016 17:24:37 -0700

Thanks Andries,

I experimented with the order by and it works as you mentionned.


I will do some reading and experimentation with the store.partition.hash_
distribute.

Francois




On Mon, Oct 31, 2016 at 4:24 PM, Andries Engelbrecht <
[email protected]> wrote:

> You can try and set store.partition.hash_distribute to true, but it is
> still listed as an alpha feature.
>
> You can also add a sort operation (order by) to the CTAS statement to
> force a single data stream at output. I believe this was discussed a while
> back on the user list.
>
> Ideally you want to look at the data set size and how much parallelism
> would work best in your environment for reading the data later.
>
> --Andries
>
>
> > On Oct 31, 2016, at 12:57 PM, François Méthot <[email protected]>
> wrote:
> >
> > Hi,
> >
> > Is there a way to limit the number of files produced by a CTAS query ?
> > I would like the speed benefits of having hundreds of scanner fragment
> but
> > don't want to deal with hundreds of output files.
> >
> > Our usecase right now is using 880 thread to scan and produce a report
> > output spread over... 880 parquets files.
> > Each resulting file is ~7M.
> >
> > Only way I found to reduce those files to smaller set is  to a perform
> > second CTAS query on the aggregated files with
> planner.width.max_per_query
> > set to smaller number.
> >
> > Any possible way to do this in one query?
> >
> > Thanks
> > Francois
>
>

Re: Limit the number of output parquet files in CTAS

Reply via email to