See inline.
On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <[email protected]> wrote: > ... > @Ted, > > > If you have new lines in your files then the files becomes unsuitable for > > splitting. This means that the only parallelism available in a ctas > > statement is multiple files > > Does it means newlines are incompatible with drill's distributed calculus > ? > What it means is that the entire CSV file has to be read by a single thread. If you don't mind waiting as this happens, you would get the same result. Just slower without parallelism. > Do you have a fair number of files? > I have one 30GB csv file. I don't know how many parquet file it could > create as process crashes because of newlines. > I can imagine approx 5 parquet files 500 MB. > That is reasonably small. But as Jacques says later in the thread, this future work.
