The issue you are likely hitting is not being CPU bound, but under-parallelizing. Files that are gzip compressed are not splittable in HDFS, so we will be reading the whole file on a single thread.
Plain text files, as well as those that are compressed with splittable compression codecs will be read in parallel. Here is a presentation with some helpful information (I haven't read all of it, but the table on slide 7 gies a nice overview of features in each codec). http://www.slideshare.net/Hadoop_Summit/kamat-singh-june27425pmroom210cv2 I am skeptical of their assertion that only bzip2 is splittable. This page from the cloudera docs claims that only gzip is not splittable. You might have to try out a few and see what you get for results. http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/admin_data_compression_performance.html On Wed, Oct 7, 2015 at 1:51 PM, Andy Pernsteiner <[email protected]> wrote: > Ya that makes sense. I’ll check the system next time I run this to see > how much CPU the drill bits wind up taking. For now I’ll just accept the > penalty :) > > > > Andy Pernsteiner > Manager, Field Enablement > ph: 206.228.0737 > > www.mapr.com > Now Available - Free Hadoop On-Demand Training > > > > From: Alexander Reshetov <[email protected]> > Reply: [email protected] <[email protected]>> > Date: October 7, 2015 at 4:37:29 PM > To: [email protected] <[email protected]>> > Subject: Re: Drill + gzipped-CSV performance > > Hi Andy, > > I think that in your specific setup CPU becomes the bottleneck, which > leads to slower query time. You can try query on other system with > faster CPU. And/or try lower compression ratio. > > On Wed, Oct 7, 2015 at 9:15 PM, Andy Pernsteiner > <[email protected]> wrote: > > In thinking this through, it probably is somewhat expected to see a > slowdown when having to decompress data (esp gzip) as part of running a > Drill query. > > > > > > > > Andy Pernsteiner > > Manager, Field Enablement > > ph: 206.228.0737 > > > > www.mapr.com > > Now Available - Free Hadoop On-Demand Training > > > > > > > > From: Andy Pernsteiner <[email protected]> > > Reply: Andy Pernsteiner <[email protected]>> > > Date: October 7, 2015 at 11:27:47 AM > > To: [email protected] <[email protected]>> > > Subject: Drill + gzipped-CSV performance > > > > I'm running some experimental queries, both against CSV, and against > Gzipped-CSV (same data, same file-count, etc). > > > > I'm doing a simple : > > > >> select count(columns[0]) from dfs.workspace.`/csv` > > > > and > > > >> select count(columns[0]) from dfs.workspace.`/gz` > > > > Here are my results: > > > > 70-files, plain-CSV, 5GB on disk: 4.8s > > > > 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed): 30.4s > > > > > > When looking at profiles, it would appear that most of the time is spent > on the TEXT_SUB_SCAN operation. Both queries spawn the same # of > minor-fragments for this phase (68), but the process_time for those minor > fragments is an average of 24s for the GZ data (most of the fragments are > pretty close to each other in terms of deviation), and 700ms average for > the plain CSV data. > > > > Is this expected? > > > > -- > > Andy Pernsteiner > > Manager, Field Enablement > > ph: 206.228.0737 > > > > www.mapr.com > > Now Available - Free Hadoop On-Demand Training > > > > >
