Re: Foreman Parallelizer not working with compressed csv file?

Ted Dunning Thu, 23 Jul 2015 14:25:28 -0700

On Thu, Jul 23, 2015 at 2:19 PM, Juergen Kneissl <her...@gmx.net> wrote:

> On 07/23/15 22:04, Jason Altekruse wrote:
> > I'm very glad to hear that it exceeded your expectations. An important
> > point I would like to add, when you unzipped the file you likely allowed
> > drill to ready not only on both nodes, but also on multiple threads on
> each
> > node. When the file was compressed, only a single thread was reading and
> > processing it.
>
>
> Also bzip2 does not work out of the box in drill. Parallelization seems
> not possible
>
> So, when it comes to the need of compression it seems parquet is needed
> or there are further tests made howto calculate an query plan for a
> compressed file. (if this is even possible at all)
>
> Anyway, thanks for the help, using uncompressed csv did the trick for my
> first problem anyway

Parquet would help a bit with compression.

Another alternative is to put uncompressed CSV on a file system that does
transparent compression.  The MapR distribution supports that, for
instance.  I am sure that there are others.  If you use such a file system,
Drill wouldn't know that the file is anything but ordinary CSV.

With parquet transparent encryption should have much less impact.

Re: Foreman Parallelizer not working with compressed csv file?

Reply via email to