Re: Drill + gzipped-CSV performance

Jason Altekruse Wed, 07 Oct 2015 14:04:21 -0700

The issue you are likely hitting is not being CPU bound, but
under-parallelizing. Files that are gzip compressed are not splittable in
HDFS, so we will be reading the whole file on a single thread.


Plain text files, as well as those that are compressed with splittable
compression codecs will be read in parallel.

Here is  a presentation with some helpful information (I haven't read all
of it, but the table on slide 7 gies a nice overview of features in each
codec).

http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2

I am skeptical of their assertion that only bzip2 is splittable. This page
from the cloudera docs claims that only gzip is not splittable. You might
have to try out a few and see what you get for results.

http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/admin_data_compression_performance.html

On Wed, Oct 7, 2015 at 1:51 PM, Andy Pernsteiner <[email protected]>
wrote:

> Ya that makes sense.  I’ll check the system next time I run this to see
> how much CPU the drill bits wind up taking.  For now I’ll just accept the
> penalty :)
>
>
>
>  Andy Pernsteiner
>  Manager, Field Enablement
> ph: 206.228.0737
>
> www.mapr.com
> Now Available - Free Hadoop On-Demand Training
>
>
>
> From: Alexander Reshetov <[email protected]>
> Reply: [email protected] <[email protected]>>
> Date: October 7, 2015 at 4:37:29 PM
> To: [email protected] <[email protected]>>
> Subject:  Re: Drill + gzipped-CSV performance
>
> Hi Andy,
>
> I think that in your specific setup CPU becomes the bottleneck, which
> leads to slower query time. You can try query on other system with
> faster CPU. And/or try lower compression ratio.
>
> On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner
> <[email protected]> wrote:
> > In thinking this through, it probably is somewhat expected to see a
> slowdown when having to decompress data (esp gzip) as part of running a
> Drill query.
> >
> >
> >
> > Andy Pernsteiner
> > Manager, Field Enablement
> > ph: 206.228.0737
> >
> > www.mapr.com
> > Now Available - Free Hadoop On-Demand Training
> >
> >
> >
> > From: Andy Pernsteiner <[email protected]>
> > Reply: Andy Pernsteiner <[email protected]>>
> > Date: October 7, 2015 at 11:27:47 AM
> > To: [email protected] <[email protected]>>
> > Subject: Drill + gzipped-CSV performance
> >
> > I'm running some experimental queries, both against CSV, and against
> Gzipped-CSV (same data, same file-count, etc).
> >
> > I'm doing a simple :
> >
> >> select count(columns[0]) from dfs.workspace.`/csv`
> >
> > and
> >
> >> select count(columns[0]) from dfs.workspace.`/gz`
> >
> > Here are my results:
> >
> > 70-files, plain-CSV, 5GB on disk: 4.8s
> >
> > 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): 30.4s
> >
> >
> > When looking at profiles, it would appear that most of the time is spent
> on the TEXT_SUB_SCAN operation. Both queries spawn the same # of
> minor-fragments for this phase (68), but the process_time for those minor
> fragments is an average of 24s for the GZ data (most of the fragments are
> pretty close to each other in terms of deviation), and 700ms average for
> the plain CSV data.
> >
> > Is this expected?
> >
> > --
> > Andy Pernsteiner
> > Manager, Field Enablement
> > ph: 206.228.0737
> >
> > www.mapr.com
> > Now Available - Free Hadoop On-Demand Training
> >
> >
>

Re: Drill + gzipped-CSV performance

Reply via email to