I'm running some experimental queries, both against CSV, and against
Gzipped-CSV (same data, same file-count, etc).

I'm doing a simple :

> select count(columns[0]) from dfs.workspace.`/csv`

and

> select count(columns[0]) from dfs.workspace.`/gz`

Here are my results:

70-files, plain-CSV, 5GB on disk: *4.8s*

 70-files, gzipped-CSV, 1.7GB on disk (5GB uncompressed):  *30.4s*


When looking at profiles, it would appear that most of the time is spent on
the TEXT_SUB_SCAN operation.  Both queries spawn the same # of
minor-fragments for this phase (68), but the process_time for those minor
fragments is an average of 24s for the GZ data (most of the fragments are
pretty close to each other in terms of deviation), and 700ms average for
the plain CSV data.

Is this expected?

-- 
 Andy Pernsteiner
 Manager, Field Enablement
ph: 206.228.0737

www.mapr.com

Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Reply via email to