Hello ...
I have a cluster with 11 nodes each of them have 16 GB RAM, 6 core CPU, ! TB HDD and i use cloudera distribution CHD4b with Pig. I have two Pig Join queries which are a Parallel and a Replicated version of pig Join.

Theoretically Replicated Join could be faster than Parallel join but in my case Parallel is faster. I am wondering why the replicated join is so slowly. i wont to improve the performance of both query. Could you check the detail of the queries.

thanks

Byambajargal


ANNO = load '/datastorm/task3/obr_pm_annotation.txt' using PigStorage(',') AS (element_id:long,concept_id:long); ;REL = load '/datastorm/task3/obs_relation.txt' using PigStorage(',') AS (id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO by concept_id,REL by concept_id*PARALLEL 10*;ISA_ANNO_T = GROUP ISA_ANNO ALL;ISA_ANNO_C = foreach ISA_ANNO_T generate COUNT($1); dump ISA_ANNO_C

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 10:31:36 2011-04-15 10:43:22 HASH_JOIN,GROU P_BY

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201103122121_0084 277 10 15 5 11 417 351 379 ANNO,ISA_ANNO, REL HASH_JOIN job_201103122121_0085 631 1 10 5 7 242 242 242 ISA_ANNO_C,ISA_ANNO_T GROUP_BY,COMBINER hdfs://haisen11:54310/tmp/temp281466632/tmp-171526868,

Input(s):
Successfully read 24153638 records from: "/datastorm/task3/obs_relation.txt"
Successfully read 442049697 records from: "/datastorm/task3/obr_pm_annotation.txt"

Output(s):
Successfully stored 1 records (14 bytes) in: "hdfs://haisen11:54310/tmp/temp281466632/tmp-171526868"

Counters:
Total records written : 1
Total bytes written : 14
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 41
Total records proactively spilled: 8781684

Job DAG:
job_201103122121_0084   ->      job_201103122121_0085,
job_201103122121_0085


2011-04-15 10:43:22,403 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapR educeLauncher - Success! 2011-04-15 10:43:22,419 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total inp ut paths to process : 1 2011-04-15 10:43:22,419 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - T otal input paths to process : 1
(844872046)


*Using replicated version*
*ANNO = load '/datastorm/task3/obr_pm_annotation.txt' using PigStorage(',') AS (element_id:long,concept_id:long); ;REL = load '/datastorm/task3/obs_relation.txt' using PigStorage(',') AS (id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO by concept_id,REL by concept_idUSING 'replicated';ISA_ANNO_T = GROUP ISA_ANNO ALL;ISA_ANNO_C = foreach ISA_ANNO_T generate COUNT($1); dump ISA_ANNO_C*
**
HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 10:57:37 2011-04-15 11:26:32 REPLICATED_JOI N,GROUP_BY

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201103122121_0088 11 0 11 5 9 0 0 0 REL MAP_ON LY job_201103122121_0089 266 1 151 101 123 1566 1566 1566 ANNO,ISA_ANNO,ISA_ANNO_C,ISA_ANNO_T REPLICATED_JOIN,GROUP_BY,COMBINER hdfs://haisen11:54310/tmp/temp-1729753 626/tmp-61569771,

Input(s):
Successfully read 442049697 records (17809735666 bytes) from: "/datastorm/task3/obr_pm_annotation.txt" Successfully read 24153638 records (691022731 bytes) from: "/datastorm/task3/obs_relation.txt"

Output(s):
Successfully stored 1 records (14 bytes) in: "hdfs://haisen11:54310/tmp/temp-1729753626/tmp-61569771"

Counters:
Total records written : 1
Total bytes written : 14
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201103122121_0088   ->      job_201103122121_0089,
job_201103122121_0089


2011-04-15 11:26:32,751 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapR educeLauncher - Success! 2011-04-15 11:26:32,889 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total inp ut paths to process : 1 2011-04-15 11:26:32,899 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - T otal input paths to process : 1
(844872046)

*
*
* ANNO = load '/datastorm/task3/obr_pm_annotation.txt' using PigStorage(',') AS (element_id:long,concept_id:long); ;REL = load '/datastorm/task3/obs_relation.txt' using PigStorage(',') AS (id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO by concept_id,REL by concept_id PARALLEL 10;store ISA_ANNO into 'outputdel';
*

HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 16:08:52 2011-04-15 16:16:26 HASH_JOIN

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduc eTime Alias Feature Outputs job_201103122121_0090 277 10 15 6 11 432 353 394 ANNO,ISA_ANNO,REL H ASH_JOIN hdfs://haisen11:54310/user/haisen/outputdel,

Input(s):
Successfully read 24153638 records from: "/datastorm/task3/obs_relation.txt"
Successfully read 442049697 records from: "/datastorm/task3/obr_pm_annotation.txt"

Output(s):
Successfully stored 844872046 records (34500196186 bytes) in: "hdfs://haisen11:54310/user/haisen/outputdel"

Counters:
Total records written : 844872046
Total bytes written : 34500196186
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 41
Total records proactively spilled: 8537764

Job DAG:
job_201103122121_0090

2011-04-15 16:16:26,320 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLaunc her - Success!


* ANNO = load '/datastorm/task3/obr_pm_annotation.txt' using PigStorage(',') AS (element_id:long,concept_id:long); ;REL = load '/datastorm/task3/obs_relation.txt' using PigStorage(',') AS (id:long,concept_id:long,parent_concept_id:long);ISA_ANNO = join ANNO by concept_id,REL by concept_id USING 'replicated';store ISA_ANNO into 'outputdel';*


HadoopVersion PigVersion UserId StartedAt FinishedAt Features 0.20.2-CDH3B4 0.8.0-CDH3B4 haisen 2011-04-15 16:32:20 2011-04-15 17:02:16 REPLICATED_JOIN

Success!

Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs job_201103122121_0093 11 0 10 5 9 0 0 0 REL MAP_ONLY job_201103122121_0094 266 0 156 96 128 0 0 0 ANNO,ISA_ANNO REPLICATED_JOIN,MAP_ONLY hdfs://haisen11:54310/user/haisen/outputdel1,

Input(s):
Successfully read 24153638 records (691022731 bytes) from: "/datastorm/task3/obs_relation.txt" Successfully read 442049697 records (17809735666 bytes) from: "/datastorm/task3/obr_pm_annotation.txt"

Output(s):
Successfully stored 844872046 records (34500196186 bytes) in: "hdfs://haisen11:54310/user/haisen/outputdel1"

Counters:
Total records written : 844872046
Total bytes written : 34500196186
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_201103122121_0093   ->      job_201103122121_0094,
job_201103122121_0094


2011-04-15 17:02:16,651 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!







Reply via email to