Hi  Rohini:
  I view the web ui, all the task is executed in parallel.
  After investigating the logs, found following points for L9 failure.
L9.pig
register pigperf.jar;
A = load '/user/pig/tests/data/pigmix/page_views' using 
org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
    as (user, action, timespent, query_term, ip_addr, timestamp,
        estimated_revenue, page_info, page_links);
B = order A by query_term parallel 40;
store B into 'L9out';

There will be 3 map-reduce job(scope-23,scope-26,scope-41) in this case.
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-23
Map Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.io.InterStorage)
 - scope-24
|
|---A: New For Each(false,false,false,false,false,false,false,false,false)[bag] 
- scope-19
    |   |
    |   Project[bytearray][0] - scope-1
    |   |
    |   Project[bytearray][1] - scope-3
    |   |
    |   Project[bytearray][2] - scope-5
    |   |
    |   Project[bytearray][3] - scope-7
    |   |
    |   Project[bytearray][4] - scope-9
    |   |
    |   Project[bytearray][5] - scope-11
    |   |
    |   Project[bytearray][6] - scope-13
    |   |
    |   Project[bytearray][7] - scope-15
    |   |
    |   Project[bytearray][8] - scope-17
    |
    |---A: 
Load(hdfs://bdpe16.sh.intel.com:8020/user/pig/tests/data/pigmix/page_views:org.apache.pig.test.pigmix.udf.PigPerformanceLoader)
 - scope-0--------
Global sort: false
----------------

MapReduce node scope-26
Map Plan
B: Local Rearrange[tuple]{tuple}(false) - scope-30
|   |
|   Constant(all) - scope-29
|
|---New For Each(false)[tuple] - scope-28
    |   |
    |   Project[bytearray][3] - scope-27
    |
    
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.builtin.RandomSampleLoader('org.apache.pig.impl.io.InterStorage','100'))
 - scope-25--------
Reduce Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp610018336:org.apache.pig.impl.io.InterStorage)
 - scope-39
|
|---New For Each(false)[tuple] - scope-38
    |   |
    |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - scope-37
    |   |
    |   |---Project[tuple][*] - scope-36
    |
    |---New For Each(false,false)[tuple] - scope-35
        |   |
        |   Constant(10) - scope-34
        |   |
        |   Project[bag][1] - scope-32
        |
        |---Package(Packager)[tuple]{chararray} - scope-31--------
Global sort: false
Secondary sort: true
----------------

MapReduce node scope-41
Map Plan
B: Local Rearrange[tuple]{bytearray}(false) - scope-42
|   |
|   Project[bytearray][3] - scope-20
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.io.InterStorage)
 - scope-40--------
Reduce Plan
B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-22
|
|---New For Each(true)[tuple] - scope-45
    |   |
    |   Project[bag][1] - scope-44
    |
    |---Package(LitePackager)[tuple]{bytearray} - scope-43--------
Global sort: true
Quantile file: hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp610018336


Scope-26 is do sampling and generate Quantile file.
Always scope-26 fail
#hadoop job –history 
job_1469651298110_0002-1469672332355-root-PigLatin%3AL9.pig-1469678558094-6414-0-FAILED-default-1469672377395.jhist
Hadoop job: job_1469651298110_0002
=====================================
User: root
JobName: PigLatin:L9.pig
JobConf: 
hdfs://bdpe41:8020/tmp/hadoop-yarn/staging/root/.staging/job_1469651298110_0002/job.xml
Submitted At: 27-Jul-2016 22:18:52
Launched At: 27-Jul-2016 22:19:37 (45sec)
Finished At: 28-Jul-2016 00:02:38 (1hrs, 43mins, 0sec)
Status: FAILED

=====================================

Task Summary
============================
Kind    Total   Successful      Failed  Killed  StartTime       FinishTime

Setup   0       0               0       0
Map     7197    6414            572     211     27-Jul-2016 22:19:41    
28-Jul-2016 00:02:40 (1hrs, 42mins, 59sec)
Reduce  1       0               0       1       27-Jul-2016 22:21:20    
28-Jul-2016 00:02:40 (1hrs, 41mins, 19sec)
Cleanup 0       0               0       0

Query  why reduce fails in log, only find that “Task KILL is received. Killing 
attempt!”.  Not know why the reduce task is killed.

{"type":"REDUCE_ATTEMPT_KILLED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskAttemptUnsuccessfulCompletion":{"taskid":"task_1469651298110_0002_r_000000","taskType":"REDUCE","attemptId":"attempt_1469651298110_0002_r_000000_0","finishTime":1469678560791,"hostname":"bdpe15","port":41213,"rackname":"/default-rack","status":"KILLED","error":"Task
 KILL is received. Killing 
attempt!","counters":{"org.apache.hadoop.mapreduce.jobhistory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.FileSystemCounter","displayName":"File
 System Counters","counts":[{"name":"FILE_BYTES_READ","displayName":"FILE: 
Number of bytes 
read","value":0},{"name":"FILE_BYTES_WRITTEN","displayName":"FILE: Number of 
bytes written","value":169316},{"name":"FILE_READ_OPS","displayName":"FILE: 
Number of read 
operations","value":0},{"name":"FILE_LARGE_READ_OPS","displayName":"FILE: 
Number of large read 
operations","value":0},{"name":"FILE_WRITE_OPS","displayName":"FILE: Number of 
write operations","value":0},{"name":"HDFS_BYTES_READ","displayName":"HDFS: 
Number of bytes 
read","value":0},{"name":"HDFS_BYTES_WRITTEN","displayName":"HDFS: Number of 
bytes written","value":0},{"name":"HDFS_READ_OPS","displayName":"HDFS: Number 
of read 
operations","value":0},{"name":"HDFS_LARGE_READ_OPS","displayName":"HDFS: 
Number of large read 
operations","value":0},{"name":"HDFS_WRITE_OPS","displayName":"HDFS: Number of 
write 
operations","value":0}]},{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce
 Framework","counts":[{"name":"COMBINE_INPUT_RECORDS","displayName":"Combine 
input 
records","value":0},{"name":"COMBINE_OUTPUT_RECORDS","displayName":"Combine 
output records","value":0},{"name":"REDUCE_INPUT_GROUPS","displayName":"Reduce 
input groups","value":0},{"name":"REDUCE_SHUFFLE_BYTES","displayName":"Reduce 
shuffle 
bytes","value":21039704},{"name":"REDUCE_INPUT_RECORDS","displayName":"Reduce 
input records","value":0},{"name":"REDUCE_OUTPUT_RECORDS","displayName":"Reduce 
output records","value":0},{"name":"SPILLED_RECORDS","displayName":"Spilled 
Records","value":0},{"name":"SHUFFLED_MAPS","displayName":"Shuffled Maps 
","value":6405},{"name":"FAILED_SHUFFLE","displayName":"Failed 
Shuffles","value":0},{"name":"MERGED_MAP_OUTPUTS","displayName":"Merged Map 
outputs","value":0},{"name":"GC_TIME_MILLIS","displayName":"GC time elapsed 
(ms)","value":3617},{"name":"CPU_MILLISECONDS","displayName":"CPU time spent 
(ms)","value":148570},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical 
memory (bytes) 
snapshot","value":346775552},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual
 memory (bytes) 
snapshot","value":2975604736},{"name":"COMMITTED_HEAP_BYTES","displayName":"Total
 committed heap usage (bytes)","value":1490026496}]},{"name":"Shuffle 
Errors","displayName":"Shuffle 
Errors","counts":[{"name":"BAD_ID","displayName":"BAD_ID","value":0},{"name":"CONNECTION","displayName":"CONNECTION","value":0},{"name":"IO_ERROR","displayName":"IO_ERROR","value":0},{"name":"WRONG_LENGTH","displayName":"WRONG_LENGTH","value":0},{"name":"WRONG_MAP","displayName":"WRONG_MAP","value":0},{"name":"WRONG_REDUCE","displayName":"WRONG_REDUCE","value":0}]},{"name":"org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter","displayName":"File
 Output Format Counters 
","counts":[{"name":"BYTES_WRITTEN","displayName":"Bytes 
Written","value":0}]}]}},"clockSplits":[363810,597022,913686,4199950,340,339,340,340,339,340,340,340],"cpuUsages":[14016,15693,22227,96634,0,0,0,0,0,0,0,0],"vMemKbytes":[2635265,2905863,2905864,2905863,2905864,2905863,2905864,2905863,2905864,2905864,2905863,2905864],"physMemKbytes":[534798,640361,522500,355737,338648,338647,338648,338647,338648,338648,338647,338648]}}}
{"type":"TASK_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskFailed":{"taskid":"task_1469651298110_0002_r_000000","taskType":"REDUCE","finishTime":1469678560792,"error":"","failedDueToAttempt":null,"status":"KILLED","counters":{"org.apache.hadoop.mapreduce.jobhistory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce
 Framework","counts":[{"name":"CPU_MILLISECONDS","displayName":"CPU time spent 
(ms)","value":0},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical memory 
(bytes) 
snapshot","value":0},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual 
memory (bytes) snapshot","value":0}]}]}}}}}
{"type":"JOB_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.JobUnsuccessfulCompletion":{"jobid":"job_1469651298110_0002","finishTime":1469678558094,"finishedMaps":6414,"finishedReduces":0,"jobStatus":"FAILED","diagnostics":{"string":"Task
 failed task_1469651298110_0002_m_003030\nJob failed as tasks failed. 
failedMaps:1 failedReduces:0"}}}}




Kelly Zhang/Zhang,Liyun
Best Regards





From: Rohini Palaniswamy [mailto:rohini.adi...@gmail.com]
Sent: Tuesday, July 26, 2016 9:58 PM
To: Zhang, Liyun
Cc: pig-...@hadoop.apache.org; Daniel Dai (da...@hortonworks.com)
Subject: Re: Can anyone who has the experience on pigmix share configuration 
and expected results?

Let us just take one script L9 for analysis.
    - What was the failure error/stack trace? We run Pigmix with just 1G of 
heap. So it cannot be going out of memory.
    - Where was the 6 hours spent? Can you give a breakdown? Are all the 
reducer tasks being launched in parallel? For eg: If a reducer normally takes 
30 mins, if it is launched in 6 waves it can take 3 hrs.  Try lowering reducer 
memory from -Xmx3276m to -Xmx2048m or -Xmx1638m if that is the case.



On Tue, Jul 26, 2016 at 12:18 AM, Zhang, Liyun 
<liyun.zh...@intel.com<mailto:liyun.zh...@intel.com>> wrote:
Hi all:
  Now I’m using pigmix to test the performance of Pig On 
Spark(PIG-4937<https://issues.apache.org/jira/browse/PIG-4937>). The test data 
is 1TB. After generating all the test data, I have run first round of test in 
mr mode.
The cluster has 8 nodes(each node has 40 cores and 60g memory, will assign 28 
cores and 56g for  nodemanager on the node).  Total cores and memory for the 
cluster is 224 cores and 448g memory.

The snippet of yarn-site.xml:
<property>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>57344</value>
    <description>the amount of memory on the NodeManager in MB</description>
  </property>
   <property>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>28</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>57344</value>
  </property>
    <property>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>false</value>
    <description>Whether virtual memory limits will be enforced for 
containers</description>
  </property>
  <property>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>4</value>
    <description>Ratio between virtual memory to physical memory when setting 
memory limits for containers</description>
  </property>

The snippet of mapred-site.xml is
  <property>
    <name>mapreduce.map.java.opts</name>
    <value>-Xmx1638m</value>
  </property>
  <property>
    <name>mapreduce.reduce.java.opts</name>
    <value>-Xmx3276m</value>
  </property>
  <property>
    <name>mapreduce.map.memory.mb</name>
    <value>2048</value>
  </property>
  <property>
    <name>mapreduce.reduce.memory.mb</name>
    <value>4096</value>
  </property>
  <property>
    <name>mapreduce.task.io.sort.mb</name>
    <value>820</value>
  </property>
  <property>
    <name>mapred.task.timeout</name>
    <value>1200000</value>
  </property>

The snippet of hdfs-site.xml
<property>
    <name>dfs.blocksize</name>
    <value>1124217344</value>
  </property>
<property>
  <name>dfs.replication</name>
  <value>1</value>
</property>
<property>
<name>dfs.socket.timeout</name>
<value>1200000</value>
</property>
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>1200000</value>
</property>

The result of last run of pigmix in mr mode(L9,10,13,14,17 fail). It shows that 
the average time spent on one script is nearly 6 hours.  I don’t know whether 
it really need so much time to run L1~L17?  Can anyone who has experience on 
pigmix share his/her configuration and expected result with me?



MR(sec)

L1

21544

L2

20482

L3

21629

L4

20905

L5

20738

L6

24131

L7

21983

L8

24549

L9

6585(Fail)

L10

22286(Fail)

L11

21849

L12

21266

L13

11099(Fail)

L14

43(Fail)

L15

23808

L16

42889

L17

10(Fail)




Kelly Zhang/Zhang,Liyun
Best Regards


Reply via email to