Thanks Andries, I experimented with the order by and it works as you mentionned.
I will do some reading and experimentation with the store.partition.hash_ distribute. Francois On Mon, Oct 31, 2016 at 4:24 PM, Andries Engelbrecht < [email protected]> wrote: > You can try and set store.partition.hash_distribute to true, but it is > still listed as an alpha feature. > > You can also add a sort operation (order by) to the CTAS statement to > force a single data stream at output. I believe this was discussed a while > back on the user list. > > Ideally you want to look at the data set size and how much parallelism > would work best in your environment for reading the data later. > > --Andries > > > > On Oct 31, 2016, at 12:57 PM, François Méthot <[email protected]> > wrote: > > > > Hi, > > > > Is there a way to limit the number of files produced by a CTAS query ? > > I would like the speed benefits of having hundreds of scanner fragment > but > > don't want to deal with hundreds of output files. > > > > Our usecase right now is using 880 thread to scan and produce a report > > output spread over... 880 parquets files. > > Each resulting file is ~7M. > > > > Only way I found to reduce those files to smaller set is to a perform > > second CTAS query on the aggregated files with > planner.width.max_per_query > > set to smaller number. > > > > Any possible way to do this in one query? > > > > Thanks > > Francois > >
