See inline.


On Mon, Feb 1, 2016 at 7:53 AM, Nicolas Paris <nipari...@gmail.com> wrote:

> ...
> @Ted,
>
> > If you have new lines in your files then the files becomes unsuitable for
> > splitting.  This means that the only parallelism available in a ctas
> > statement is multiple files
>
> ​Does it means newlines are incompatible with drill's distributed calculus
> ?
>

What it means is that the entire CSV file has to be read by a single
thread.  If you don't mind waiting as this happens, you would get the same
result.  Just slower without parallelism.


> Do you have a fair number of files?​
> ​I have one 30GB csv file. I don't know how many parquet file it could
> create as process crashes because of newlines.
> I can imagine approx 5 parquet files 500 MB.
>

That is reasonably small.

But as Jacques says later in the thread, this future work.

Reply via email to