[jira] [Updated] (TEZ-1493) WordCount example fails in recovery
[ https://issues.apache.org/jira/browse/TEZ-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1493: Description: {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} The client side shows that the job is failed, but checking the logs found that the recovery works in server side, and eventually finish the job successfully. was: {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} WordCount example fails in recovery --- Key: TEZ-1493 URL: https://issues.apache.org/jira/browse/TEZ-1493 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} The client side shows that the job is failed, but checking the logs found that the recovery works in server side, and eventually finish the job successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1493) WordCount example fails in recovery
[ https://issues.apache.org/jira/browse/TEZ-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1493: Attachment: Tez-1493.patch WordCount example fails in recovery --- Key: TEZ-1493 URL: https://issues.apache.org/jira/browse/TEZ-1493 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: Tez-1493.patch {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} The client side shows that the job is failed, but checking the logs found that the recovery works in server side, and eventually finish the job successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1493) WordCount example fails in recovery
[ https://issues.apache.org/jira/browse/TEZ-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109015#comment-14109015 ] Jeff Zhang commented on TEZ-1493: - Attach the patch. The reason of this issue is that when the the second AM attempt is started, DAGClient will first fetch status via am which is in submitted state. WordCount example fails in recovery --- Key: TEZ-1493 URL: https://issues.apache.org/jira/browse/TEZ-1493 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Attachments: Tez-1493.patch {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} The client side shows that the job is failed, but checking the logs found that the recovery works in server side, and eventually finish the job successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1494) DAG hangs waiting for ShuffleManager.getNextInput()
Rajesh Balamohan created TEZ-1494: - Summary: DAG hangs waiting for ShuffleManager.getNextInput() Key: TEZ-1494 URL: https://issues.apache.org/jira/browse/TEZ-1494 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Attaching the DAG and the stack trace of the hung process. digraph rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1 { graph [ label=rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1, fontsize=24, fontname=Helvetica]; node [fontsize=12, fontname=Helvetica]; edge [fontsize=9, fontcolor=blue, fontname=Arial]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1 [ label = Map_1[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7 [ label = Map_7[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Reducer_6_out_Reducer_6 [ label = Reducer_6[out_Reducer_6], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8 [ label = Map_8[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_4_date_dim [ label = Map_4[date_dim], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_4_date_dim - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_4 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 [ label = Map_5[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Reducer_6 [ label = [input=OrderedPartitionedKVOutput,\n output=OrderedGroupedKVInput,\n dataMovement=SCATTER_GATHER,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8_customer_address [ label = Map_8[customer_address], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8_customer_address - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2_store_sales [ label = Map_2[store_sales], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2_store_sales - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1_household_demographics [ label = Map_1[household_demographics], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1_household_demographics - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_9 [ label = Map_9[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_9 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5_customer [ label = Map_5[customer], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5_customer - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7_current_addr [ label = Map_7[current_addr], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7_current_addr - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Reducer_6 [ label = Reducer_6[ReduceTezProcessor] ];
[jira] [Updated] (TEZ-1494) DAG hangs waiting for ShuffleManager.getNextInput()
[ https://issues.apache.org/jira/browse/TEZ-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1494: -- Description: Attaching the DAG and the stack trace of the hung process. Thread 30071: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=610 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame) was: Attaching the DAG and the stack trace of the hung process. digraph rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1 { graph [ label=rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1, fontsize=24, fontname=Helvetica]; node [fontsize=12, fontname=Helvetica]; edge [fontsize=9, fontcolor=blue, fontname=Arial]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1 [ label = Map_1[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7 [ label = Map_7[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_7 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Reducer_6_out_Reducer_6 [ label = Reducer_6[out_Reducer_6], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8 [ label = Map_8[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_4_date_dim [ label = Map_4[date_dim], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_4_date_dim - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_4 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 [ label = Map_5[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Reducer_6 [ label = [input=OrderedPartitionedKVOutput,\n output=OrderedGroupedKVInput,\n dataMovement=SCATTER_GATHER,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8_customer_address [ label = Map_8[customer_address], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8_customer_address - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_8 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2_store_sales [ label = Map_2[store_sales], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2_store_sales - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1_household_demographics [ label = Map_1[household_demographics], shape = box ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1_household_demographics - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_1 [ label = Input [inputClass=MRInputLegacy,\n initializer=HiveSplitGenerator] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_9 [ label = Map_9[MapTezProcessor] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_9 - rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_2 [ label = [input=UnorderedKVOutput,\n output=UnorderedKVInput,\n dataMovement=BROADCAST,\n schedulingType=SEQUENTIAL] ]; rajesh_20140825050909_6206d911_7de1_47aa_8788_dd9ffcc9ad36_1.Map_5_customer [ label = Map_5[customer], shape = box ];
[jira] [Updated] (TEZ-1494) DAG hangs waiting for ShuffleManager.getNextInput()
[ https://issues.apache.org/jira/browse/TEZ-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1494: -- Attachment: TEZ-1494-DAG.dot DAG hangs waiting for ShuffleManager.getNextInput() --- Key: TEZ-1494 URL: https://issues.apache.org/jira/browse/TEZ-1494 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1494-DAG.dot Attaching the DAG and the stack trace of the hung process. Thread 30071: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=610 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1490) dagid reported is incorrect in TezClient.java
[ https://issues.apache.org/jira/browse/TEZ-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109136#comment-14109136 ] Prakash Ramachandran commented on TEZ-1490: --- Thanks for the patch. yes its mostly a display issue as of now. dagid reported is incorrect in TezClient.java - Key: TEZ-1490 URL: https://issues.apache.org/jira/browse/TEZ-1490 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Jonathan Eagles Attachments: TEZ-1490-v1.patch The format used to get the dagid and appid in TezClient.java does not match the one used in TezDagId.java. ex. TezClient.java reports dagid as dag_1408740248751_3_01 The dagid as reported in logs is dag_1408740248751_0003_1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1486) TezUncheckedException when using dynamic partition pruning
[ https://issues.apache.org/jira/browse/TEZ-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1486: Attachment: TEZ-1486.1.txt Patch skips routing events if target vertex parallelism is 0. I've left the exception in there rightnow, even though it's technically possible to route events and generate an empty list, but is highly unlikely. [~bikassaha] - review please. TezUncheckedException when using dynamic partition pruning -- Key: TEZ-1486 URL: https://issues.apache.org/jira/browse/TEZ-1486 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gunther Hagleitner Assignee: Siddharth Seth Attachments: TEZ-1486.1.txt I'm working on using the AM event mechanism to dynamically prune partitions at DAG runtime for certain queries. The query is: select count(*) from srcpart join srcpart_double_hour on (srcpart.hr*2 = srcpart_double_hour.hr) where srcpart_double_hour.hour = 11; This will result in two vertices connected through a broadcast edge. The vertex prepares two things: The list of partition keys (hr) that are being sent to the AM for dynamic pruning and the records to be used in the hash join. The second vertex will block until all events are received (initializer) then it will load and process the hash join. It's possible for queries like this to result in zero splits on the second vertex (i.e.: no matching rows for the join) The exception I get when this is run is: org.apache.tez.dag.api.TezUncheckedException: Event must be routed. sourceVertex=vertex_1408686217936_0003_3_00 srcIndex = 0 destAttemptId=vertex_1408686217936_0003_3_01 edgeManager=org.apache.tez.dag.app.dag.impl.BroadcastEdgeManager Ev\ ent type=DATA_MOVEMENT_EVENT at org.apache.tez.dag.app.dag.impl.Edge.sendTezEventToDestinationTasks(Edge.java:371) at org.apache.tez.dag.app.dag.impl.VertexImpl$RouteEventTransition.transition(VertexImpl.java:3372) at org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1088) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:111) at org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:49) at org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:244) at org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:2923) at org.apache.tez.dag.app.dag.impl.VertexImpl.access$5900(VertexImpl.java:169) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2914) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2906) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1355) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:168) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1650) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1636) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1083) Enable IFile RLE for DefaultSorter
[ https://issues.apache.org/jira/browse/TEZ-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109453#comment-14109453 ] Gopal V commented on TEZ-1083: -- This looks alright - this just needs a roll-over check for the sameKey long variable. The worst-case value for that is near O(n^2), so it might overflow before totalKeys does. For performance, it can be assumed that if sameKeys is 0, isRLENeeded == true - instead of checking within the loop. Enable IFile RLE for DefaultSorter -- Key: TEZ-1083 URL: https://issues.apache.org/jira/browse/TEZ-1083 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Gopal V Attachments: TEZ-1083.1.patch Generate RLE IFiles for DefaultSorter and use it to fast-forward map-side merge. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1494) DAG hangs waiting for ShuffleManager.getNextInput()
[ https://issues.apache.org/jira/browse/TEZ-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109454#comment-14109454 ] Hitesh Shah commented on TEZ-1494: -- [~rajesh.balamohan] Is this an issue present in the 0.5.0 RC? DAG hangs waiting for ShuffleManager.getNextInput() --- Key: TEZ-1494 URL: https://issues.apache.org/jira/browse/TEZ-1494 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1494-DAG.dot Attaching the DAG and the stack trace of the hung process. Thread 30071: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=610 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1493) WordCount example fails in recovery
[ https://issues.apache.org/jira/browse/TEZ-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1493: - Priority: Blocker (was: Major) Target Version/s: 0.5.1 WordCount example fails in recovery --- Key: TEZ-1493 URL: https://issues.apache.org/jira/browse/TEZ-1493 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: Tez-1493.patch {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} The client side shows that the job is failed, but checking the logs found that the recovery works in server side, and eventually finish the job successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TEZ-1473) TEZ_RUNTIME_SHUFFLE_BUFFER is too large by default
[ https://issues.apache.org/jira/browse/TEZ-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth resolved TEZ-1473. - Resolution: Invalid TEZ_RUNTIME_SHUFFLE_BUFFER is too large by default -- Key: TEZ-1473 URL: https://issues.apache.org/jira/browse/TEZ-1473 Project: Apache Tez Issue Type: Improvement Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: TEZ-1473.1.patch TEZ_RUNTIME_SHUFFLE_BUFFER is 8GB by default, while TEZ_TASK_RESOURCE_MEMORY_MB_DEFAULT is 1GB. It leads OoM or Container Killer. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109461#comment-14109461 ] Gopal V commented on TEZ-1492: -- The BufferUtils class needs re-namespacing as well, as part of this patch. IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1489) Broadcast Shuffle should call freeResources() on FetchedInput
[ https://issues.apache.org/jira/browse/TEZ-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109472#comment-14109472 ] Gopal V commented on TEZ-1489: -- The buffer is being cleared up correctly - but the unreserve() is not getting called, so the internal check switches to Disk even though buffers are unused. Broadcast Shuffle should call freeResources() on FetchedInput - Key: TEZ-1489 URL: https://issues.apache.org/jira/browse/TEZ-1489 Project: Apache Tez Issue Type: Bug Reporter: Gopal V BroadcastShuffle does not seem to free up the buffer space allocated by the FetchedInputs during the task runtime. SimpleFetchedInputAllocator::freeResources is never called as per my logging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1489) Broadcast Shuffle should call freeResources() on FetchedInput
[ https://issues.apache.org/jira/browse/TEZ-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109476#comment-14109476 ] Gopal V commented on TEZ-1489: -- UnorderedKVReader::moveToNextInput() maybe is a good place for this? Broadcast Shuffle should call freeResources() on FetchedInput - Key: TEZ-1489 URL: https://issues.apache.org/jira/browse/TEZ-1489 Project: Apache Tez Issue Type: Bug Reporter: Gopal V BroadcastShuffle does not seem to free up the buffer space allocated by the FetchedInputs during the task runtime. SimpleFetchedInputAllocator::freeResources is never called as per my logging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1489) Broadcast Shuffle should call freeResources() on FetchedInput
[ https://issues.apache.org/jira/browse/TEZ-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109481#comment-14109481 ] Siddharth Seth commented on TEZ-1489: - Do you have some logs which can be looked at - the code flow seems like it'll end up calling unreserve on SimpleFetchedInputAllocator. Broadcast Shuffle should call freeResources() on FetchedInput - Key: TEZ-1489 URL: https://issues.apache.org/jira/browse/TEZ-1489 Project: Apache Tez Issue Type: Bug Reporter: Gopal V BroadcastShuffle does not seem to free up the buffer space allocated by the FetchedInputs during the task runtime. SimpleFetchedInputAllocator::freeResources is never called as per my logging. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1360) Provide vertex parallelism to each vertex task
[ https://issues.apache.org/jira/browse/TEZ-1360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109484#comment-14109484 ] Siddharth Seth commented on TEZ-1360: - +1. Committing. Thanks [~gopalv], [~oae], [~rajesh.balamohan] Provide vertex parallelism to each vertex task -- Key: TEZ-1360 URL: https://issues.apache.org/jira/browse/TEZ-1360 Project: Apache Tez Issue Type: Bug Reporter: Johannes Zillmann Assignee: Gopal V Fix For: 0.5.1 Attachments: TEZ-1360.1.patch, TEZ-1360.2.patch, TEZ-1360.4.patch, TEZ-1360.5.patch, TEZ-1360.6.patch It would be good for a task to get a info about the total task count of its vertex. With this there would be an equivalent for map-reduce' {{mapred.map.tasks}} and {{mapred.reduce.tasks}} and mr-applications using this could be ported to Tez more easily. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1471) Additional supplement for TEZ local mode document
[ https://issues.apache.org/jira/browse/TEZ-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109518#comment-14109518 ] Siddharth Seth commented on TEZ-1471: - [~airbots] - the changes look good. Could you please move some of them to a separate section though - Running a DAG in Local Mode should be limited to the changes users need to make; it's fairly cluttered at this point. Something like a things to watch out for section, which can contain LargeData belong here TezConfiguration.TEZ_AM_INLINE_TASK_EXECUTION_MAX_TASKS(tez.am.inline.task.execution.max-tasks) should not be changed (defaults to 1). tez.history.logging.service.class should be the default value: org.apache.tez.dag.history.logging.impl.SimpleHistoryLoggingService. It means ATS is disabbled in current Local Mode. I don't think we need to call out NodeBlacklisting being disabled. Otherwise move it to the section about moving to a real cluster. Additional supplement for TEZ local mode document - Key: TEZ-1471 URL: https://issues.apache.org/jira/browse/TEZ-1471 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: Chen He Assignee: Chen He Attachments: TEZ-1471-2.patch, TEZ-1471-3.patch, TEZ-1471-4.patch, TEZ-1471.patch some supplements for Local mode document -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1495) ATS integration for TezClient
Prakash Ramachandran created TEZ-1495: - Summary: ATS integration for TezClient Key: TEZ-1495 URL: https://issues.apache.org/jira/browse/TEZ-1495 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Tez client should automatically redirect to ATS when the AM is not running. All APIs exposed ( DAG status, counters, etc ) from the DAGClient should continue to work after the AM has shut down. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1495) ATS integration for TezClient
[ https://issues.apache.org/jira/browse/TEZ-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prakash Ramachandran updated TEZ-1495: -- Attachment: TEZ-1495.WIP.1.patch Changes in TEZ-1495.WIP.1.patch - getDagStatus, getVertexStatus fallbacks to ATS ATS integration for TezClient - Key: TEZ-1495 URL: https://issues.apache.org/jira/browse/TEZ-1495 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Prakash Ramachandran Attachments: TEZ-1495.WIP.1.patch Tez client should automatically redirect to ATS when the AM is not running. All APIs exposed ( DAG status, counters, etc ) from the DAGClient should continue to work after the AM has shut down. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1476) DAGClient waitForCompletion output is confusing
[ https://issues.apache.org/jira/browse/TEZ-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109649#comment-14109649 ] Siddharth Seth commented on TEZ-1476: - +1. Looks good. DAGClient waitForCompletion output is confusing --- Key: TEZ-1476 URL: https://issues.apache.org/jira/browse/TEZ-1476 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Jonathan Eagles Priority: Critical Attachments: TEZ-1476-v1.patch, TEZ-1476-v2.patch When a DAG is submitted - 2014-08-21 16:38:06,153 INFO [main] rpc.DAGClientRPCImpl (DAGClientRPCImpl.java:log(428)) - Waiting for DAG to start running is logged. After this, nothing seems to get logged till the first task completes. It would be useful to log when the state changes to RUNNING - as well as at least one line stating the number of tasks, etc (0% progress line). Also, progress could be logged every few seconds irrespective of whether it has changed or not to give the impression that the job has not just gotten stuck. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1433) Invalid credentials can be used when a DAG is submitted to a session which has timed out
[ https://issues.apache.org/jira/browse/TEZ-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109672#comment-14109672 ] Siddharth Seth commented on TEZ-1433: - [~jeagles] - it seems like we have some races here, which could be triggered if we get AppState as RUNNING, just before the AM session times out. We're not really handling that at the moment - but should in the future. In such cases, I think it's better to rollback changes - rather than end up having one off failures, which could potentially be really difficult to debug. DAGSubmissionTimedOut - is another case where the client may time out, for whatever reason, despite the AM still being alive. I think a fix which is going to work when the error reporting and time outs are handled differently, would be better. Invalid credentials can be used when a DAG is submitted to a session which has timed out Key: TEZ-1433 URL: https://issues.apache.org/jira/browse/TEZ-1433 Project: Apache Tez Issue Type: Bug Affects Versions: 0.4.0 Reporter: Siddharth Seth Assignee: Jonathan Eagles Attachments: TEZ-1433-v1.patch When a DAG is submitted to a session which has timed out, and the same DAG is then submitted to a new session - credentials associated with the old session can end up getting used. Before we know that the session is no longer valid, the DAG is modified to add local resources and credentials. On the next submission, since the DAG already has tokens (for HDFS for example) from the old session, the tokens are not updated. Meanwhile, the old token would end up being cancelled by the RM - since the applicaiton associated with the previous session has finished. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1471) Additional supplement for TEZ local mode document
[ https://issues.apache.org/jira/browse/TEZ-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109800#comment-14109800 ] Siddharth Seth commented on TEZ-1471: - +1. Thanks [~airbots] Additional supplement for TEZ local mode document - Key: TEZ-1471 URL: https://issues.apache.org/jira/browse/TEZ-1471 Project: Apache Tez Issue Type: Sub-task Affects Versions: 0.4.0 Reporter: Chen He Assignee: Chen He Attachments: TEZ-1471-2.patch, TEZ-1471-3.patch, TEZ-1471-4.patch, TEZ-1471-5.patch, TEZ-1471.patch some supplements for Local mode document -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1486) TezUncheckedException when using dynamic partition pruning
[ https://issues.apache.org/jira/browse/TEZ-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109823#comment-14109823 ] Bikas Saha commented on TEZ-1486: - Looks good. Some comments. Log message needs fixing. Please also add which events are not being routed. {code} LOG.info(Num Destination Tasks is . Not routing events);{code} The if stmt can be made common outside the existing if-else block {code}if (isDataMovementEvent) { DataMovementEvent dmEvent = (DataMovementEvent) tezEvent.getEvent(); if (routingRequired) { edgeManager.routeDataMovementEventToDestination(dmEvent, srcTaskIndex, dmEvent.getSourceIndex(), destTaskAndInputIndices); } } else { if (routingRequired) { edgeManager.routeInputSourceTaskFailedEventToDestination(srcTaskIndex, destTaskAndInputIndices); } }{code} Extra whitespace {code} break;{code} TezUncheckedException when using dynamic partition pruning -- Key: TEZ-1486 URL: https://issues.apache.org/jira/browse/TEZ-1486 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.0 Reporter: Gunther Hagleitner Assignee: Siddharth Seth Attachments: TEZ-1486.1.txt I'm working on using the AM event mechanism to dynamically prune partitions at DAG runtime for certain queries. The query is: select count(*) from srcpart join srcpart_double_hour on (srcpart.hr*2 = srcpart_double_hour.hr) where srcpart_double_hour.hour = 11; This will result in two vertices connected through a broadcast edge. The vertex prepares two things: The list of partition keys (hr) that are being sent to the AM for dynamic pruning and the records to be used in the hash join. The second vertex will block until all events are received (initializer) then it will load and process the hash join. It's possible for queries like this to result in zero splits on the second vertex (i.e.: no matching rows for the join) The exception I get when this is run is: org.apache.tez.dag.api.TezUncheckedException: Event must be routed. sourceVertex=vertex_1408686217936_0003_3_00 srcIndex = 0 destAttemptId=vertex_1408686217936_0003_3_01 edgeManager=org.apache.tez.dag.app.dag.impl.BroadcastEdgeManager Ev\ ent type=DATA_MOVEMENT_EVENT at org.apache.tez.dag.app.dag.impl.Edge.sendTezEventToDestinationTasks(Edge.java:371) at org.apache.tez.dag.app.dag.impl.VertexImpl$RouteEventTransition.transition(VertexImpl.java:3372) at org.apache.tez.dag.app.dag.impl.VertexImpl.scheduleTasks(VertexImpl.java:1088) at org.apache.tez.dag.app.dag.impl.VertexManager$VertexManagerPluginContextImpl.scheduleVertexTasks(VertexManager.java:111) at org.apache.tez.dag.app.dag.impl.ImmediateStartVertexManager.onVertexStarted(ImmediateStartVertexManager.java:49) at org.apache.tez.dag.app.dag.impl.VertexManager.onVertexStarted(VertexManager.java:244) at org.apache.tez.dag.app.dag.impl.VertexImpl.startVertex(VertexImpl.java:2923) at org.apache.tez.dag.app.dag.impl.VertexImpl.access$5900(VertexImpl.java:169) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2914) at org.apache.tez.dag.app.dag.impl.VertexImpl$StartTransition.transition(VertexImpl.java:2906) at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1355) at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:168) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1650) at org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1636) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:695) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1496) Multi MR inputs can not be configured without accessing internal proto structures
Vikram Dixit K created TEZ-1496: --- Summary: Multi MR inputs can not be configured without accessing internal proto structures Key: TEZ-1496 URL: https://issues.apache.org/jira/browse/TEZ-1496 Project: Apache Tez Issue Type: Bug Affects Versions: 0.5.1 Reporter: Vikram Dixit K Priority: Blocker With all the new API changes, the multi-mr input can no longer be configured cleanly without accessing internal structures in tez. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1497) Add tez-broadcast-example into tez-examples/
Gopal V created TEZ-1497: Summary: Add tez-broadcast-example into tez-examples/ Key: TEZ-1497 URL: https://issues.apache.org/jira/browse/TEZ-1497 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Assignee: Gopal V Modify https://github.com/t3rmin4t0r/tez-broadcast-example into a usable example inside tez-examples. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109874#comment-14109874 ] Rajesh Balamohan commented on TEZ-1492: --- [~gopalv] BufferUtils is placed in o.a.h.io as it relies on FastByteComparisons which is not a public class in Hadoop. IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109874#comment-14109874 ] Rajesh Balamohan edited comment on TEZ-1492 at 8/25/14 10:28 PM: - [~gopalv] BufferUtils is placed in o.a.h.io as it relies on FastByteComparisons which is package-local class in Hadoop. was (Author: rajesh.balamohan): [~gopalv] BufferUtils is placed in o.a.h.io as it relies on FastByteComparisons which is not a public class in Hadoop. IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1494) DAG hangs waiting for ShuffleManager.getNextInput()
[ https://issues.apache.org/jira/browse/TEZ-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109872#comment-14109872 ] Siddharth Seth commented on TEZ-1494: - [~rajesh.balamohan] - have you investigated this any further ? Were all the DataMovementEvents received, was task retry in play ? etc DAG hangs waiting for ShuffleManager.getNextInput() --- Key: TEZ-1494 URL: https://issues.apache.org/jira/browse/TEZ-1494 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1494-DAG.dot Attaching the DAG and the stack trace of the hung process. Thread 30071: (state = BLOCKED) - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame) - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, line=186 (Interpreted frame) - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() @bci=42, line=2043 (Interpreted frame) - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 (Interpreted frame) - org.apache.tez.runtime.library.shuffle.common.impl.ShuffleManager.getNextInput() @bci=67, line=610 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.moveToNextInput() @bci=26, line=176 (Interpreted frame) - org.apache.tez.runtime.library.common.readers.UnorderedKVReader.next() @bci=30, line=117 (Interpreted frame) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1476) DAGClient waitForCompletion output is confusing
[ https://issues.apache.org/jira/browse/TEZ-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109873#comment-14109873 ] Jonathan Eagles commented on TEZ-1476: -- Thanks [~sseth]] and [~zjffdu] for the reviews. I committed this to master and branch-0.5 DAGClient waitForCompletion output is confusing --- Key: TEZ-1476 URL: https://issues.apache.org/jira/browse/TEZ-1476 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Jonathan Eagles Priority: Critical Attachments: TEZ-1476-v1.patch, TEZ-1476-v2.patch When a DAG is submitted - 2014-08-21 16:38:06,153 INFO [main] rpc.DAGClientRPCImpl (DAGClientRPCImpl.java:log(428)) - Waiting for DAG to start running is logged. After this, nothing seems to get logged till the first task completes. It would be useful to log when the state changes to RUNNING - as well as at least one line stating the number of tasks, etc (0% progress line). Also, progress could be logged every few seconds irrespective of whether it has changed or not to give the impression that the job has not just gotten stuck. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1044) Need to consider map/reduce java.opts values for container reuse
[ https://issues.apache.org/jira/browse/TEZ-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109951#comment-14109951 ] Hitesh Shah commented on TEZ-1044: -- [~yeshavora] Please provide more details on this issue. Need to consider map/reduce java.opts values for container reuse Key: TEZ-1044 URL: https://issues.apache.org/jira/browse/TEZ-1044 Project: Apache Tez Issue Type: Improvement Reporter: Yesha Vora Currently, mapreduce.map.java.opts and mapreduce.reduce.java.opts are not being considered for container reuse. These properties should be considered while reusing containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1496) Multi MR inputs can not be configured without accessing internal proto structures
[ https://issues.apache.org/jira/browse/TEZ-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1496: - Target Version/s: 0.5.0 Multi MR inputs can not be configured without accessing internal proto structures - Key: TEZ-1496 URL: https://issues.apache.org/jira/browse/TEZ-1496 Project: Apache Tez Issue Type: Bug Reporter: Vikram Dixit K Priority: Blocker With all the new API changes, the multi-mr input can no longer be configured cleanly without accessing internal structures in tez. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1496) Multi MR inputs can not be configured without accessing internal proto structures
[ https://issues.apache.org/jira/browse/TEZ-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109960#comment-14109960 ] Hitesh Shah commented on TEZ-1496: -- [~vikram.dixit] Can you provide more details on what you are trying to configure and what APIs are lacking? Multi MR inputs can not be configured without accessing internal proto structures - Key: TEZ-1496 URL: https://issues.apache.org/jira/browse/TEZ-1496 Project: Apache Tez Issue Type: Bug Reporter: Vikram Dixit K Priority: Blocker With all the new API changes, the multi-mr input can no longer be configured cleanly without accessing internal structures in tez. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1496) Multi MR inputs can not be configured without accessing internal proto structures
[ https://issues.apache.org/jira/browse/TEZ-1496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated TEZ-1496: - Affects Version/s: (was: 0.5.1) Multi MR inputs can not be configured without accessing internal proto structures - Key: TEZ-1496 URL: https://issues.apache.org/jira/browse/TEZ-1496 Project: Apache Tez Issue Type: Bug Reporter: Vikram Dixit K Priority: Blocker With all the new API changes, the multi-mr input can no longer be configured cleanly without accessing internal structures in tez. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1476) DAGClient waitForCompletion output is confusing
[ https://issues.apache.org/jira/browse/TEZ-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-1476: Fix Version/s: 0.5.1 DAGClient waitForCompletion output is confusing --- Key: TEZ-1476 URL: https://issues.apache.org/jira/browse/TEZ-1476 Project: Apache Tez Issue Type: Bug Reporter: Siddharth Seth Assignee: Jonathan Eagles Priority: Critical Fix For: 0.5.1 Attachments: TEZ-1476-v1.patch, TEZ-1476-v2.patch When a DAG is submitted - 2014-08-21 16:38:06,153 INFO [main] rpc.DAGClientRPCImpl (DAGClientRPCImpl.java:log(428)) - Waiting for DAG to start running is logged. After this, nothing seems to get logged till the first task completes. It would be useful to log when the state changes to RUNNING - as well as at least one line stating the number of tasks, etc (0% progress line). Also, progress could be logged every few seconds irrespective of whether it has changed or not to give the impression that the job has not just gotten stuck. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1490) dagid reported is incorrect in TezClient.java
[ https://issues.apache.org/jira/browse/TEZ-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110023#comment-14110023 ] Jonathan Eagles commented on TEZ-1490: -- [~bikassaha], could you give a review and provide input for target version as well? dagid reported is incorrect in TezClient.java - Key: TEZ-1490 URL: https://issues.apache.org/jira/browse/TEZ-1490 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Jonathan Eagles Attachments: TEZ-1490-v1.patch The format used to get the dagid and appid in TezClient.java does not match the one used in TezDagId.java. ex. TezClient.java reports dagid as dag_1408740248751_3_01 The dagid as reported in logs is dag_1408740248751_0003_1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1497) Add tez-broadcast-example into tez-examples/
[ https://issues.apache.org/jira/browse/TEZ-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110042#comment-14110042 ] Bikas Saha commented on TEZ-1497: - The JoinExample in tez-examples is already illustrating the broadcast edge with a broadcast join user story. Does that suffice or is this example illustrating some other concepts? Add tez-broadcast-example into tez-examples/ Key: TEZ-1497 URL: https://issues.apache.org/jira/browse/TEZ-1497 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Assignee: Gopal V Modify https://github.com/t3rmin4t0r/tez-broadcast-example into a usable example inside tez-examples. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1433) Invalid credentials can be used when a DAG is submitted to a session which has timed out
[ https://issues.apache.org/jira/browse/TEZ-1433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110053#comment-14110053 ] Jonathan Eagles commented on TEZ-1433: -- A true rollback can be difficult to implement (how to distinguish user changes vs client changes) without significant instrumentation or maintaining a copy of the unmodified DAG submitted. [~sseth], how would you feel if we change to submitting a copy of the client DAG? In that case it becomes safe to submit a dag multiple times per session and even across sessions with any side effects. Invalid credentials can be used when a DAG is submitted to a session which has timed out Key: TEZ-1433 URL: https://issues.apache.org/jira/browse/TEZ-1433 Project: Apache Tez Issue Type: Bug Affects Versions: 0.4.0 Reporter: Siddharth Seth Assignee: Jonathan Eagles Attachments: TEZ-1433-v1.patch When a DAG is submitted to a session which has timed out, and the same DAG is then submitted to a new session - credentials associated with the old session can end up getting used. Before we know that the session is no longer valid, the DAG is modified to add local resources and credentials. On the next submission, since the DAG already has tokens (for HDFS for example) from the old session, the tokens are not updated. Meanwhile, the old token would end up being cancelled by the RM - since the applicaiton associated with the previous session has finished. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1490) dagid reported is incorrect in TezClient.java
[ https://issues.apache.org/jira/browse/TEZ-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110060#comment-14110060 ] Bikas Saha commented on TEZ-1490: - Clearly, the DO NOT CHANGE this comment did not help in maintaining the sync between the 2 identical codes. If we can make this common in the API project (and reference from the DAG project) that would be ideal. Otherwise the patch looks good to me as is. Given that the current RC is cancelled, we should probably bring this change all the way into 0.5.0. dagid reported is incorrect in TezClient.java - Key: TEZ-1490 URL: https://issues.apache.org/jira/browse/TEZ-1490 Project: Apache Tez Issue Type: Bug Reporter: Prakash Ramachandran Assignee: Jonathan Eagles Attachments: TEZ-1490-v1.patch The format used to get the dagid and appid in TezClient.java does not match the one used in TezDagId.java. ex. TezClient.java reports dagid as dag_1408740248751_3_01 The dagid as reported in logs is dag_1408740248751_0003_1 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1430) Javadoc generation should not generate docs for classes annotated as private
[ https://issues.apache.org/jira/browse/TEZ-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110226#comment-14110226 ] Jonathan Eagles commented on TEZ-1430: -- [~hitesh], can you please review as you added the original javadoc generation to tez? Javadoc generation should not generate docs for classes annotated as private Key: TEZ-1430 URL: https://issues.apache.org/jira/browse/TEZ-1430 Project: Apache Tez Issue Type: Improvement Reporter: Siddharth Seth Assignee: Jonathan Eagles Attachments: TEZ-1430-v2.patch mvn javadoc:javadoc generates javadoc for everything. Haven't tried mvn site though. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1492: -- Attachment: TEZ-1492.3.patch Moved BufferUtils and FastByteComparisons to o.a.t.runtime.library.utils in the latest patch. IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch, TEZ-1492.3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1493) Tez examples fail in recovery sometimes
[ https://issues.apache.org/jira/browse/TEZ-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-1493: Summary: Tez examples fail in recovery sometimes (was: WordCount example fails in recovery) Tez examples fail in recovery sometimes --- Key: TEZ-1493 URL: https://issues.apache.org/jira/browse/TEZ-1493 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Blocker Attachments: Tez-1493.patch {code} 14/08/25 17:37:03 INFO client.TezClient: Submitting DAG to YARN, applicationId=application_1408499461970_0053, dagName=WordCount 14/08/25 17:37:03 INFO impl.YarnClientImpl: Submitted application application_1408499461970_0053 14/08/25 17:37:03 INFO client.TezClient: The url to track the Tez AM: http://jzhangMBPr.local:8088/proxy/application_1408499461970_0053/ 14/08/25 17:37:03 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 14/08/25 17:37:03 INFO client.AHSProxy: Connecting to Application History server at /0.0.0.0:10200 14/08/25 17:37:03 INFO rpc.DAGClientRPCImpl: Waiting for DAG to start running 14/08/25 17:37:07 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 0% TotalTasks: 2 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING Progress: 50% TotalTasks: 2 Succeeded: 1 Running: 0 Failed: 0 Killed: 0 14/08/25 17:37:17 INFO rpc.DAGClientRPCImpl: DAG completed. FinalState=SUBMITTED WordCount failed with diagnostics: [] {code} The client side shows that the job is failed, but checking the logs found that the recovery works in server side, and eventually finish the job successfully. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TEZ-1498) Usage info is not printed when error number of arguments for JoinExample
Jeff Zhang created TEZ-1498: --- Summary: Usage info is not printed when error number of arguments for JoinExample Key: TEZ-1498 URL: https://issues.apache.org/jira/browse/TEZ-1498 Project: Apache Tez Issue Type: Bug Reporter: Jeff Zhang Assignee: Jeff Zhang -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14110264#comment-14110264 ] Gopal V commented on TEZ-1492: -- Thanks [~rajesh.balamohan], can I have the same diff with git mv instead of the big change-sets? IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch, TEZ-1492.3.patch, TEZ-1492.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TEZ-1492) IFile RLE not kicking in due to bug in BufferUtils.compare()
[ https://issues.apache.org/jira/browse/TEZ-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-1492: -- Attachment: TEZ-1492.4.patch No code change (but regenerated the patch with git diff -M --no-prefix HEAD) IFile RLE not kicking in due to bug in BufferUtils.compare() Key: TEZ-1492 URL: https://issues.apache.org/jira/browse/TEZ-1492 Project: Apache Tez Issue Type: Bug Reporter: Rajesh Balamohan Assignee: Rajesh Balamohan Labels: performance Attachments: TEZ-1492.1.patch, TEZ-1492.2.patch, TEZ-1492.3.patch, TEZ-1492.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (TEZ-1497) Add tez-broadcast-example into tez-examples/
[ https://issues.apache.org/jira/browse/TEZ-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gopal V resolved TEZ-1497. -- Resolution: Not a Problem Add tez-broadcast-example into tez-examples/ Key: TEZ-1497 URL: https://issues.apache.org/jira/browse/TEZ-1497 Project: Apache Tez Issue Type: Bug Reporter: Gopal V Assignee: Gopal V Modify https://github.com/t3rmin4t0r/tez-broadcast-example into a usable example inside tez-examples. -- This message was sent by Atlassian JIRA (v6.2#6252)