How to improve the performs of PIG Join

byambajargal Sun, 17 Apr 2011 06:03:42 -0700

Hello ...

I have a cluster with 11 nodes each of them have 16 GB RAM, 6 core CPU,! TB HDD and i use cloudera distribution CHD4b with Pig. I have two PigJoin queries which are a Parallel and a Replicated version of pig Join.

Theoretically Replicated Join could be faster than Parallel join but inmy case Parallel is faster.I am wondering why the replicated join is so slowly. i wont to improvethe performance of both query. Could you check the detail of the queries.


thanks

Byambajargal

ANNO = load '/datastorm/task3/obr_pm_annotation.txt' usingPigStorage(',') AS (element_id:long,concept_id:long); ;REL = load'/datastorm/task3/obs_relation.txt' using PigStorage(',') AS(id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO byconcept_id,REL by concept_id*PARALLEL 10*;ISA_ANNO_T = GROUP ISA_ANNOALL;ISA_ANNO_C = foreach ISA_ANNO_T generate COUNT($1); dump ISA_ANNO_C

HadoopVersion PigVersion UserId StartedAt FinishedAtFeatures0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 10:31:362011-04-15 10:43:22HASH_JOIN,GROU P_BY


Success!

Job Stats (time in seconds):

JobId Maps ReducesMaxMapTime MinMapTIme AvgMapTime MaxReduceTimeMinReduceTime AvgReduceTime Alias Feature Outputsjob_201103122121_0084 277 10 155 11 417351 379 ANNO,ISA_ANNO,REL HASH_JOINjob_201103122121_0085 631 1 105 7 242242 242 ISA_ANNO_C,ISA_ANNO_TGROUP_BY,COMBINERhdfs://haisen11:54310/tmp/temp281466632/tmp-171526868,


Input(s):
Successfully read 24153638 records from: "/datastorm/task3/obs_relation.txt"

Successfully read 442049697 records from:"/datastorm/task3/obr_pm_annotation.txt"


Output(s):

Successfully stored 1 records (14 bytes) in:"hdfs://haisen11:54310/tmp/temp281466632/tmp-171526868"


Counters:
Total records written : 1
Total bytes written : 14
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 41
Total records proactively spilled: 8781684

Job DAG:
job_201103122121_0084   ->      job_201103122121_0085,
job_201103122121_0085

2011-04-15 10:43:22,403 [main] INFOorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!2011-04-15 10:43:22,419 [main] INFOorg.apache.hadoop.mapreduce.lib.input.FileInputFormat - Totalinp ut paths toprocess : 12011-04-15 10:43:22,419 [main] INFOorg.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -T otal input pathsto process : 1

(844872046)


*Using replicated version*

*ANNO = load '/datastorm/task3/obr_pm_annotation.txt' usingPigStorage(',') AS (element_id:long,concept_id:long); ;REL = load'/datastorm/task3/obs_relation.txt' using PigStorage(',') AS(id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO byconcept_id,REL by concept_idUSING 'replicated';ISA_ANNO_T = GROUPISA_ANNO ALL;ISA_ANNO_C = foreach ISA_ANNO_T generate COUNT($1); dumpISA_ANNO_C*

**

HadoopVersion PigVersion UserId StartedAt FinishedAtFeatures0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 10:57:372011-04-15 11:26:32REPLICATED_JOIN,GROUP_BY


Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTimeMinMapTIme AvgMapTime MaxReduceTime MinReduceTimeAvgReduceTime Alias Feature Outputsjob_201103122121_0088 11 0 11 59 0 00 REL MAP_ON LYjob_201103122121_0089 266 1 151 101123 1566 15661566 ANNO,ISA_ANNO,ISA_ANNO_C,ISA_ANNO_TREPLICATED_JOIN,GROUP_BY,COMBINERhdfs://haisen11:54310/tmp/temp-1729753626/tmp-61569771,


Input(s):

Successfully read 442049697 records (17809735666 bytes) from:"/datastorm/task3/obr_pm_annotation.txt"Successfully read 24153638 records (691022731 bytes) from:"/datastorm/task3/obs_relation.txt"


Output(s):

Successfully stored 1 records (14 bytes) in:"hdfs://haisen11:54310/tmp/temp-1729753626/tmp-61569771"


Counters:
Total records written : 1
Total bytes written : 14
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201103122121_0088   ->      job_201103122121_0089,
job_201103122121_0089

2011-04-15 11:26:32,751 [main] INFOorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!2011-04-15 11:26:32,889 [main] INFOorg.apache.hadoop.mapreduce.lib.input.FileInputFormat - Totalinp ut paths toprocess : 12011-04-15 11:26:32,899 [main] INFOorg.apache.pig.backend.hadoop.executionengine.util.MapRedUtil -T otal input pathsto process : 1

(844872046)

*
*

* ANNO = load '/datastorm/task3/obr_pm_annotation.txt' usingPigStorage(',') AS (element_id:long,concept_id:long); ;REL = load'/datastorm/task3/obs_relation.txt' using PigStorage(',') AS(id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO byconcept_id,REL by concept_id PARALLEL 10;store ISA_ANNO into 'outputdel';

HadoopVersion PigVersion UserId StartedAt FinishedAtFeatures0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 16:08:522011-04-15 16:16:26 HASH_JOIN


Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTimeMaxReduceTime MinReduceTimeAvgReduc eTime AliasFeature Outputsjob_201103122121_0090 277 10 15 6 11 432353 394 ANNO,ISA_ANNO,RELH ASH_JOINhdfs://haisen11:54310/user/haisen/outputdel,


Input(s):
Successfully read 24153638 records from: "/datastorm/task3/obs_relation.txt"

Successfully read 442049697 records from:"/datastorm/task3/obr_pm_annotation.txt"


Output(s):

Successfully stored 844872046 records (34500196186 bytes) in:"hdfs://haisen11:54310/user/haisen/outputdel"


Counters:
Total records written : 844872046
Total bytes written : 34500196186
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 41
Total records proactively spilled: 8537764

Job DAG:
job_201103122121_0090

2011-04-15 16:16:26,320 [main] INFOorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

* ANNO = load '/datastorm/task3/obr_pm_annotation.txt' usingPigStorage(',') AS (element_id:long,concept_id:long); ;REL = load'/datastorm/task3/obs_relation.txt' using PigStorage(',') AS(id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO byconcept_id,REL by concept_id USING 'replicated';store ISA_ANNO into'outputdel';*

HadoopVersion PigVersion UserId StartedAt FinishedAtFeatures0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 16:32:202011-04-15 17:02:16 REPLICATED_JOIN


Success!

Job Stats (time in seconds):

JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTimeMaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputsjob_201103122121_0093 11 0 10 5 9 00 0 REL MAP_ONLYjob_201103122121_0094 266 0 156 96 128 00 0 ANNO,ISA_ANNO REPLICATED_JOIN,MAP_ONLYhdfs://haisen11:54310/user/haisen/outputdel1,


Input(s):

Successfully read 24153638 records (691022731 bytes) from:"/datastorm/task3/obs_relation.txt"Successfully read 442049697 records (17809735666 bytes) from:"/datastorm/task3/obr_pm_annotation.txt"


Output(s):

Successfully stored 844872046 records (34500196186 bytes) in:"hdfs://haisen11:54310/user/haisen/outputdel1"


Counters:
Total records written : 844872046
Total bytes written : 34500196186
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201103122121_0093   ->      job_201103122121_0094,
job_201103122121_0094

2011-04-15 17:02:16,651 [main] INFOorg.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher- Success!

How to improve the performs of PIG Join

Reply via email to