Re: Foreman Parallelizer not working with compressed csv file?

Juergen Kneissl Thu, 23 Jul 2015 12:09:43 -0700

Hi Jason,

On 07/23/15 18:53, Jason Altekruse wrote:
> I could be wrong, but I believe that gzip is not a compression that can be
> split, you must read and decompress the file from start to end. In this
> case we can not parallelize the read. This stackoverflow article mentions
> bzip2 as an alternative compression used by hadoop to solve this problem
> and allow parallel reads. I do not know if this would be supported in
> Drill, I do not think we have tested it.
> 
> http://stackoverflow.com/questions/14225751/random-access-to-gzipped-files
Sounds understandable, thank you very much.
I unzipped the file and both drillbits are performing quite loud now.


I have to add: performance is now way beyond what I imagined with my
testing hardware!!! This is awesome, makes fun and thank you very much
drill team.

I will try it out if I get this parallel reads with bzip2 instead
and answer again in this thread what happend and if it works with bzip2
in an hdfs environment.
(without compression it is clearly good what I am seeing here)


> 
> If you want to compress your data, we would recommend conversion to a more
> optimal format, such as parquet. This format has file-internal compression
> which allows the meta-data to be uncompressed and give direct pointers to
> the separate portions of the file that can be read independently.
For now parquet is not an option for me, but I will give it a shot later
on. From what I see with the amount of (unzipped) data I am dealing in
my tests there should be some sort of compression to at least think
about this in a "real life scenario"

> 
> On Thu, Jul 23, 2015 at 9:44 AM, Juergen Kneissl <her...@gmx.net> wrote:
> 
>> Yes of course:
>>
>> I add the SQL and the output of EXPLAIN PLAN FOR:
>>
>> -----------------------------------------
>>
>> jdbc:drill:schema=dfs> explain plan for SELECT columns[4] stichtag,
>> columns[10] geschlecht, count(columns[0]) anzahl  FROM
>> dfs.`/mon_ew_xt_uni_bus_11.csv.gz` where 1 = 1 and columns[23] = 1  and
>> columns[5] = 'AA' group by columns[4], columns[10];
>>
>> +------+------+
>>
>> | text | json |
>>
>> +------+------+
>>
>> | 00-00    Screen
>>
>> 00-01      Project(stichtag=[$0], geschlecht=[$1], anzahl=[$2])
>>
>> 00-02        UnionExchange
>>
>> 01-01          HashAgg(group=[{0, 1}], anzahl=[$SUM0($2)])
>>
>> 01-02            Project(stichtag=[$0], geschlecht=[$1], anzahl=[$2])
>>
>> 01-03              HashToRandomExchange(dist0=[[$0]], dist1=[[$1]])
>>
>> 02-01                UnorderedMuxExchange
>>
>> 03-01                  Project(stichtag=[$0], geschlecht=[$1],
>> anzahl=[$2], E_X_P_R_H_A_S_H_F_I_E_L_D=[castInt(hash64AsDouble($1,
>> hash64AsDouble($0)))])
>> 03-02                    HashAgg(group=[{0, 1}], anzahl=[COUNT($2)])
>>
>> 03-03                      Project(stichtag=[$0], geschlecht=[$1],
>> $f2=[$2])
>> 03-04                        SelectionVectorRemover
>>
>> 03-05                          Filter(condition=[AND(true, =($3, 1),
>> =($4, 'AA'))])
>>
>> 03-06                            Project(ITEM=[ITEM($0, 4)],
>> ITEM1=[ITEM($0, 10)], ITEM2=[ITEM($0, 0)], ITEM3=[ITEM($0, 23)],
>> ITEM4=[ITEM($0, 5)])
>> 03-07                              Scan(groupscan=[EasyGroupScan
>> [selectionRoot=hdfs://master:8020/mon_ew_xt_uni_bus_11.csv.gz,
>> numFiles=1, columns=[`columns`[4], `columns`[10], `columns`[0],
>> `columns`[23], `columns`[5]],
>> files=[hdfs://master:8020/mon_ew_xt_uni_bus_11.csv.gz]]])
>>
>>  | {
>>
>>   "head" : {
>>
>>     "version" : 1,
>>
>>     "generator" : {
>>
>>       "type" : "ExplainHandler",
>>
>>       "info" : ""
>>
>>     },
>>
>>     "type" : "APACHE_DRILL_PHYSICAL",
>>
>>     "options" : [ {
>>
>>       "name" : "planner.slice_target",
>>
>>       "kind" : "LONG",
>>
>>       "type" : "SESSION",
>>
>>       "num_val" : 10000
>>
>>     } ],
>>
>>     "queue" : 0,
>>
>>     "resultMode" : "EXEC"
>>
>>   },
>>
>>   "graph" : [ {
>>
>>     "pop" : "fs-scan",
>>
>>     "@id" : 196615,
>>
>>     "userName" : "drill",
>>
>>     "files" : [ "hdfs://master:8020/mon_ew_xt_uni_bus_11.csv.gz" ],
>>
>>     "storage" : {
>>
>>       "type" : "file",
>>
>>       "enabled" : true,
>>
>>       "connection" : "hdfs://master:8020",
>>
>>       "workspaces" : {
>>
>>         "root" : {
>>
>>           "location" : "/srv/hadoop/dfs/data",
>>
>>           "writable" : true,
>>
>>           "defaultInputFormat" : null
>>
>>         },
>>
>>         "tmp" : {
>>
>>           "location" : "/tmp",
>>
>>           "writable" : true,
>>
>>           "defaultInputFormat" : null
>>
>>         }
>>
>>       },
>>
>>       "formats" : {
>>
>>         "psv" : {
>>
>>           "type" : "text",
>>
>>           "extensions" : [ "tbl" ],
>>
>>           "delimiter" : "|"
>>
>>         },
>>
>>         "csv" : {
>>
>>           "type" : "text",
>>
>>           "extensions" : [ "csv" ],
>>
>>           "delimiter" : ";"
>>
>>         },
>>
>>         "tsv" : {
>>
>>           "type" : "text",
>>
>>           "extensions" : [ "tsv" ],
>>
>>           "delimiter" : "\t"
>>
>>         },
>>
>>         "parquet" : {
>>
>>           "type" : "parquet"
>>
>>         },
>>
>>         "json" : {
>>
>>           "type" : "json"
>>
>>         },
>>
>>         "avro" : {
>>
>>           "type" : "avro"
>>
>>         }
>>
>>       }
>>
>>     },
>>
>>     "format" : {
>>
>>       "type" : "text",
>>
>>       "extensions" : [ "csv" ],
>>
>>       "delimiter" : ";"
>>
>>     },
>>
>>     "columns" : [ "`columns`[4]", "`columns`[10]", "`columns`[0]",
>> "`columns`[23]", "`columns`[5]" ],
>>
>>     "selectionRoot" : "hdfs://master:8020/mon_ew_xt_uni_bus_11.csv.gz",
>>
>>     "cost" : 1.4826017E7
>>
>>   }, {
>>
>>     "pop" : "project",
>>
>>     "@id" : 196614,
>>
>>     "exprs" : [ {
>>
>>       "ref" : "`ITEM`",
>>
>>       "expr" : "`columns`[4]"
>>
>>     }, {
>>
>>       "ref" : "`ITEM1`",
>>
>>       "expr" : "`columns`[10]"
>>
>>     }, {
>>
>>       "ref" : "`ITEM2`",
>>
>>       "expr" : "`columns`[0]"
>>
>>     }, {
>>
>>       "ref" : "`ITEM3`",
>>
>>       "expr" : "`columns`[23]"
>>
>>     }, {
>>
>>       "ref" : "`ITEM4`",
>>
>>       "expr" : "`columns`[5]"
>>
>>     } ],
>>
>>     "child" : 196615,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 1.4826017E7
>>
>>   }, {
>>
>>     "pop" : "filter",
>>
>>     "@id" : 196613,
>>
>>     "child" : 196614,
>>
>>     "expr" : "booleanAnd(true, equal(`ITEM3`, 1) , equal(`ITEM4`, 'AA')
>> ) ",
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 333585.3825
>>
>>   }, {
>>
>>     "pop" : "selection-vector-remover",
>>
>>     "@id" : 196612,
>>
>>     "child" : 196613,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 333585.3825
>>
>>   }, {
>>
>>     "pop" : "project",
>>
>>     "@id" : 196611,
>>
>>     "exprs" : [ {
>>
>>       "ref" : "`stichtag`",
>>
>>       "expr" : "`ITEM`"
>>
>>     }, {
>>
>>       "ref" : "`geschlecht`",
>>
>>       "expr" : "`ITEM1`"
>>
>>     }, {
>>
>>       "ref" : "`$f2`",
>>
>>       "expr" : "`ITEM2`"
>>
>>     } ],
>>
>>     "child" : 196612,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 333585.3825
>>
>>   }, {
>>
>>     "pop" : "hash-aggregate",
>>
>>     "@id" : 196610,
>>
>>     "child" : 196611,
>>
>>     "cardinality" : 1.0,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "groupByExprs" : [ {
>>
>>       "ref" : "`stichtag`",
>>
>>       "expr" : "`stichtag`"
>>
>>     }, {
>>
>>       "ref" : "`geschlecht`",
>>
>>       "expr" : "`geschlecht`"
>>
>>     } ],
>>
>>     "aggrExprs" : [ {
>>
>>       "ref" : "`anzahl`",
>>
>>       "expr" : "count(`$f2`) "
>>
>>     } ],
>>
>>     "cost" : 250189.036875
>>
>>   }, {
>>
>>     "pop" : "project",
>>
>>     "@id" : 196609,
>>
>>     "exprs" : [ {
>>
>>       "ref" : "`stichtag`",
>>
>>       "expr" : "`stichtag`"
>>
>>     }, {
>>
>>       "ref" : "`geschlecht`",
>>
>>       "expr" : "`geschlecht`"
>>
>>     }, {
>>
>>       "ref" : "`anzahl`",
>>
>>       "expr" : "`anzahl`"
>>
>>     }, {
>>
>>       "ref" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>>
>>       "expr" : "castint(hash64asdouble(`geschlecht`,
>> hash64asdouble(`stichtag`) ) ) "
>>
>>     } ],
>>
>>     "child" : 196610,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 33358.53825
>>
>>   }, {
>>
>>     "pop" : "unordered-mux-exchange",
>>
>>     "@id" : 131073,
>>
>>     "child" : 196609,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 33358.53825
>>
>>   }, {
>>
>>     "pop" : "hash-to-random-exchange",
>>
>>     "@id" : 65539,
>>
>>     "child" : 131073,
>>
>>     "expr" : "`E_X_P_R_H_A_S_H_F_I_E_L_D`",
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 33358.53825
>>
>>   }, {
>>
>>     "pop" : "project",
>>
>>     "@id" : 65538,
>>
>>     "exprs" : [ {
>>
>>       "ref" : "`stichtag`",
>>
>>       "expr" : "`stichtag`"
>>
>>     }, {
>>
>>       "ref" : "`geschlecht`",
>>
>>       "expr" : "`geschlecht`"
>>
>>     }, {
>>
>>       "ref" : "`anzahl`",
>>
>>       "expr" : "`anzahl`"
>>
>>     } ],
>>
>>     "child" : 65539,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "cost" : 33358.53825
>>
>>   }, {
>>
>>     "pop" : "hash-aggregate",
>>
>>     "@id" : 65537,
>>
>>     "child" : 65538,
>>
>>     "cardinality" : 1.0,
>>
>>     "initialAllocation" : 1000000,
>>
>>     "maxAllocation" : 10000000000,
>>
>>     "groupByExprs" : [ {
>>
>>       "ref" : "`stichtag`",
>>
>>       "expr" : "`stichtag`"
>>     }, {
>>       "ref" : "`geschlecht`",
>>       "expr" : "`geschlecht`"
>>     } ],
>>     "aggrExprs" : [ {
>>       "ref" : "`anzahl`",
>>       "expr" : "$sum0(`anzahl`) "
>>     } ],
>>     "cost" : 25018.9036875
>>   }, {
>>     "pop" : "union-exchange",
>>     "@id" : 2,
>>     "child" : 65537,
>>     "initialAllocation" : 1000000,
>>     "maxAllocation" : 10000000000,
>>     "cost" : 3335.8538249999997
>>   }, {
>>     "pop" : "project",
>>     "@id" : 1,
>>     "exprs" : [ {
>>       "ref" : "`stichtag`",
>>       "expr" : "`stichtag`"
>>     }, {
>>       "ref" : "`geschlecht`",
>>       "expr" : "`geschlecht`"
>>     }, {
>>       "ref" : "`anzahl`",
>>       "expr" : "`anzahl`"
>>     } ],
>>     "child" : 2,
>>     "initialAllocation" : 1000000,
>>     "maxAllocation" : 10000000000,
>>     "cost" : 3335.8538249999997
>>   }, {
>>     "pop" : "screen",
>>     "@id" : 0,
>>     "child" : 1,
>>     "initialAllocation" : 1000000,
>>     "maxAllocation" : 10000000000,
>>     "cost" : 3335.8538249999997
>>   } ]
>> } |
>> +------+------+
>> 1 row selected (0,764 seconds)
>>
>>
>> On 07/23/15 18:10, Abdel Hakim Deneche wrote:
>>> Hi Juergen,
>>>
>>> can you share the query you tried to run ?
>>>
>>> Thanks
>>>
>>> On Thu, Jul 23, 2015 at 9:10 AM, Juergen Kneissl <her...@gmx.net> wrote:
>>>
>>>> Hi everybody,
>>>>
>>>> I installed and configured a small cluster with two machines (gnu/linux)
>>>> with the following setup:
>>>>
>>>> zookeeper in version 3.4.6 , drill in version 1.1.0 and also using
>>>> hadoop (version 2.7.1) hdfs as dist. filesystem.
>>>>
>>>> So, I am playing around a bit, but what I am still not understanding is
>>>> why my drill Foreman  bit1 (or whoever that is in the situation) is not
>>>> "really" parallelizing my request. (or do I expect something from the
>>>> architecture that is not intended?)
>>>>
>>>>
>>>> I select and aggregate on a 1,4 GB gzipped csv file, and I thought at
>>>> least part of the query  would be processed on the other drillbit.
>>>> (bit 2)
>>>>
>>>> For instance, in the profiles I see that Major Fragment 01 was divided
>>>> into four Minor Fragments (of which two were forwarded to bit 2)
>>>>
>>>> If I check on the drillbit.log file of the bit2 (in the above
>>>> configuration) a debug message tells me that the incoming record count
>>>> is 0?
>>>>
>>>> The question is: What am I doing wrong in my configuration? Has it
>>>> something todo with using a csv file?
>>>>
>>>> The query is also set in a way that it is clear the whole file has to be
>>>> read in memory. That does not concern me that much, now I just wanted to
>>>> check how the Foreman does the "Parallelization"
>>>>
>>>> Best Regards & Thanks for any hint
>>>>
>>>>
>>>> Juergen
>>>>
>>>
>>>
>>>
>>
>

Re: Foreman Parallelizer not working with compressed csv file?

Reply via email to