Hi Rohini: I view the web ui, all the task is executed in parallel. After investigating the logs, found following points for L9 failure. L9.pig register pigperf.jar; A = load '/user/pig/tests/data/pigmix/page_views' using org.apache.pig.test.udf.storefunc.PigPerformanceLoader() as (user, action, timespent, query_term, ip_addr, timestamp, estimated_revenue, page_info, page_links); B = order A by query_term parallel 40; store B into 'L9out';
There will be 3 map-reduce job(scope-23,scope-26,scope-41) in this case. #-------------------------------------------------- # Map Reduce Plan #-------------------------------------------------- MapReduce node scope-23 Map Plan Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.io.InterStorage) - scope-24 | |---A: New For Each(false,false,false,false,false,false,false,false,false)[bag] - scope-19 | | | Project[bytearray][0] - scope-1 | | | Project[bytearray][1] - scope-3 | | | Project[bytearray][2] - scope-5 | | | Project[bytearray][3] - scope-7 | | | Project[bytearray][4] - scope-9 | | | Project[bytearray][5] - scope-11 | | | Project[bytearray][6] - scope-13 | | | Project[bytearray][7] - scope-15 | | | Project[bytearray][8] - scope-17 | |---A: Load(hdfs://bdpe16.sh.intel.com:8020/user/pig/tests/data/pigmix/page_views:org.apache.pig.test.pigmix.udf.PigPerformanceLoader) - scope-0-------- Global sort: false ---------------- MapReduce node scope-26 Map Plan B: Local Rearrange[tuple]{tuple}(false) - scope-30 | | | Constant(all) - scope-29 | |---New For Each(false)[tuple] - scope-28 | | | Project[bytearray][3] - scope-27 | |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.builtin.RandomSampleLoader('org.apache.pig.impl.io.InterStorage','100')) - scope-25-------- Reduce Plan Store(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp610018336:org.apache.pig.impl.io.InterStorage) - scope-39 | |---New For Each(false)[tuple] - scope-38 | | | POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - scope-37 | | | |---Project[tuple][*] - scope-36 | |---New For Each(false,false)[tuple] - scope-35 | | | Constant(10) - scope-34 | | | Project[bag][1] - scope-32 | |---Package(Packager)[tuple]{chararray} - scope-31-------- Global sort: false Secondary sort: true ---------------- MapReduce node scope-41 Map Plan B: Local Rearrange[tuple]{bytearray}(false) - scope-42 | | | Project[bytearray][3] - scope-20 | |---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp1627657499:org.apache.pig.impl.io.InterStorage) - scope-40-------- Reduce Plan B: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-22 | |---New For Each(true)[tuple] - scope-45 | | | Project[bag][1] - scope-44 | |---Package(LitePackager)[tuple]{bytearray} - scope-43-------- Global sort: true Quantile file: hdfs://zly1.sh.intel.com:8020/tmp/temp-287979498/tmp610018336 Scope-26 is do sampling and generate Quantile file. Always scope-26 fail #hadoop job –history job_1469651298110_0002-1469672332355-root-PigLatin%3AL9.pig-1469678558094-6414-0-FAILED-default-1469672377395.jhist Hadoop job: job_1469651298110_0002 ===================================== User: root JobName: PigLatin:L9.pig JobConf: hdfs://bdpe41:8020/tmp/hadoop-yarn/staging/root/.staging/job_1469651298110_0002/job.xml Submitted At: 27-Jul-2016 22:18:52 Launched At: 27-Jul-2016 22:19:37 (45sec) Finished At: 28-Jul-2016 00:02:38 (1hrs, 43mins, 0sec) Status: FAILED ===================================== Task Summary ============================ Kind Total Successful Failed Killed StartTime FinishTime Setup 0 0 0 0 Map 7197 6414 572 211 27-Jul-2016 22:19:41 28-Jul-2016 00:02:40 (1hrs, 42mins, 59sec) Reduce 1 0 0 1 27-Jul-2016 22:21:20 28-Jul-2016 00:02:40 (1hrs, 41mins, 19sec) Cleanup 0 0 0 0 Query why reduce fails in log, only find that “Task KILL is received. Killing attempt!”. Not know why the reduce task is killed. {"type":"REDUCE_ATTEMPT_KILLED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskAttemptUnsuccessfulCompletion":{"taskid":"task_1469651298110_0002_r_000000","taskType":"REDUCE","attemptId":"attempt_1469651298110_0002_r_000000_0","finishTime":1469678560791,"hostname":"bdpe15","port":41213,"rackname":"/default-rack","status":"KILLED","error":"Task KILL is received. Killing attempt!","counters":{"org.apache.hadoop.mapreduce.jobhistory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.FileSystemCounter","displayName":"File System Counters","counts":[{"name":"FILE_BYTES_READ","displayName":"FILE: Number of bytes read","value":0},{"name":"FILE_BYTES_WRITTEN","displayName":"FILE: Number of bytes written","value":169316},{"name":"FILE_READ_OPS","displayName":"FILE: Number of read operations","value":0},{"name":"FILE_LARGE_READ_OPS","displayName":"FILE: Number of large read operations","value":0},{"name":"FILE_WRITE_OPS","displayName":"FILE: Number of write operations","value":0},{"name":"HDFS_BYTES_READ","displayName":"HDFS: Number of bytes read","value":0},{"name":"HDFS_BYTES_WRITTEN","displayName":"HDFS: Number of bytes written","value":0},{"name":"HDFS_READ_OPS","displayName":"HDFS: Number of read operations","value":0},{"name":"HDFS_LARGE_READ_OPS","displayName":"HDFS: Number of large read operations","value":0},{"name":"HDFS_WRITE_OPS","displayName":"HDFS: Number of write operations","value":0}]},{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce Framework","counts":[{"name":"COMBINE_INPUT_RECORDS","displayName":"Combine input records","value":0},{"name":"COMBINE_OUTPUT_RECORDS","displayName":"Combine output records","value":0},{"name":"REDUCE_INPUT_GROUPS","displayName":"Reduce input groups","value":0},{"name":"REDUCE_SHUFFLE_BYTES","displayName":"Reduce shuffle bytes","value":21039704},{"name":"REDUCE_INPUT_RECORDS","displayName":"Reduce input records","value":0},{"name":"REDUCE_OUTPUT_RECORDS","displayName":"Reduce output records","value":0},{"name":"SPILLED_RECORDS","displayName":"Spilled Records","value":0},{"name":"SHUFFLED_MAPS","displayName":"Shuffled Maps ","value":6405},{"name":"FAILED_SHUFFLE","displayName":"Failed Shuffles","value":0},{"name":"MERGED_MAP_OUTPUTS","displayName":"Merged Map outputs","value":0},{"name":"GC_TIME_MILLIS","displayName":"GC time elapsed (ms)","value":3617},{"name":"CPU_MILLISECONDS","displayName":"CPU time spent (ms)","value":148570},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical memory (bytes) snapshot","value":346775552},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual memory (bytes) snapshot","value":2975604736},{"name":"COMMITTED_HEAP_BYTES","displayName":"Total committed heap usage (bytes)","value":1490026496}]},{"name":"Shuffle Errors","displayName":"Shuffle Errors","counts":[{"name":"BAD_ID","displayName":"BAD_ID","value":0},{"name":"CONNECTION","displayName":"CONNECTION","value":0},{"name":"IO_ERROR","displayName":"IO_ERROR","value":0},{"name":"WRONG_LENGTH","displayName":"WRONG_LENGTH","value":0},{"name":"WRONG_MAP","displayName":"WRONG_MAP","value":0},{"name":"WRONG_REDUCE","displayName":"WRONG_REDUCE","value":0}]},{"name":"org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter","displayName":"File Output Format Counters ","counts":[{"name":"BYTES_WRITTEN","displayName":"Bytes Written","value":0}]}]}},"clockSplits":[363810,597022,913686,4199950,340,339,340,340,339,340,340,340],"cpuUsages":[14016,15693,22227,96634,0,0,0,0,0,0,0,0],"vMemKbytes":[2635265,2905863,2905864,2905863,2905864,2905863,2905864,2905863,2905864,2905864,2905863,2905864],"physMemKbytes":[534798,640361,522500,355737,338648,338647,338648,338647,338648,338648,338647,338648]}}} {"type":"TASK_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.TaskFailed":{"taskid":"task_1469651298110_0002_r_000000","taskType":"REDUCE","finishTime":1469678560792,"error":"","failedDueToAttempt":null,"status":"KILLED","counters":{"org.apache.hadoop.mapreduce.jobhistory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce Framework","counts":[{"name":"CPU_MILLISECONDS","displayName":"CPU time spent (ms)","value":0},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical memory (bytes) snapshot","value":0},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual memory (bytes) snapshot","value":0}]}]}}}}} {"type":"JOB_FAILED","event":{"org.apache.hadoop.mapreduce.jobhistory.JobUnsuccessfulCompletion":{"jobid":"job_1469651298110_0002","finishTime":1469678558094,"finishedMaps":6414,"finishedReduces":0,"jobStatus":"FAILED","diagnostics":{"string":"Task failed task_1469651298110_0002_m_003030\nJob failed as tasks failed. failedMaps:1 failedReduces:0"}}}} Kelly Zhang/Zhang,Liyun Best Regards From: Rohini Palaniswamy [mailto:rohini.adi...@gmail.com] Sent: Tuesday, July 26, 2016 9:58 PM To: Zhang, Liyun Cc: pig-...@hadoop.apache.org; Daniel Dai (da...@hortonworks.com) Subject: Re: Can anyone who has the experience on pigmix share configuration and expected results? Let us just take one script L9 for analysis. - What was the failure error/stack trace? We run Pigmix with just 1G of heap. So it cannot be going out of memory. - Where was the 6 hours spent? Can you give a breakdown? Are all the reducer tasks being launched in parallel? For eg: If a reducer normally takes 30 mins, if it is launched in 6 waves it can take 3 hrs. Try lowering reducer memory from -Xmx3276m to -Xmx2048m or -Xmx1638m if that is the case. On Tue, Jul 26, 2016 at 12:18 AM, Zhang, Liyun <liyun.zh...@intel.com<mailto:liyun.zh...@intel.com>> wrote: Hi all: Now I’m using pigmix to test the performance of Pig On Spark(PIG-4937<https://issues.apache.org/jira/browse/PIG-4937>). The test data is 1TB. After generating all the test data, I have run first round of test in mr mode. The cluster has 8 nodes(each node has 40 cores and 60g memory, will assign 28 cores and 56g for nodemanager on the node). Total cores and memory for the cluster is 224 cores and 448g memory. The snippet of yarn-site.xml: <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>57344</value> <description>the amount of memory on the NodeManager in MB</description> </property> <property> <name>yarn.nodemanager.resource.cpu-vcores</name> <value>28</value> </property> <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>2048</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>57344</value> </property> <property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> <description>Whether virtual memory limits will be enforced for containers</description> </property> <property> <name>yarn.nodemanager.vmem-pmem-ratio</name> <value>4</value> <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description> </property> The snippet of mapred-site.xml is <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1638m</value> </property> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx3276m</value> </property> <property> <name>mapreduce.map.memory.mb</name> <value>2048</value> </property> <property> <name>mapreduce.reduce.memory.mb</name> <value>4096</value> </property> <property> <name>mapreduce.task.io.sort.mb</name> <value>820</value> </property> <property> <name>mapred.task.timeout</name> <value>1200000</value> </property> The snippet of hdfs-site.xml <property> <name>dfs.blocksize</name> <value>1124217344</value> </property> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.socket.timeout</name> <value>1200000</value> </property> <property> <name>dfs.datanode.socket.write.timeout</name> <value>1200000</value> </property> The result of last run of pigmix in mr mode(L9,10,13,14,17 fail). It shows that the average time spent on one script is nearly 6 hours. I don’t know whether it really need so much time to run L1~L17? Can anyone who has experience on pigmix share his/her configuration and expected result with me? MR(sec) L1 21544 L2 20482 L3 21629 L4 20905 L5 20738 L6 24131 L7 21983 L8 24549 L9 6585(Fail) L10 22286(Fail) L11 21849 L12 21266 L13 11099(Fail) L14 43(Fail) L15 23808 L16 42889 L17 10(Fail) Kelly Zhang/Zhang,Liyun Best Regards