[jira] [Commented] (TEZ-3096) Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails
[ https://issues.apache.org/jira/browse/TEZ-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184580#comment-15184580 ] Zhiyuan Yang commented on TEZ-3096: --- Sorry, it may take some time for me to figure it our because I'm still beginner on Tez and I'm not familiar with Hive neither. > Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails > --- > > Key: TEZ-3096 > URL: https://issues.apache.org/jira/browse/TEZ-3096 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Zhiyuan Yang > > Tasks are failing exactly 300ms into running due to a FileSystem error. > {code} > 2016-02-04 05:05:56,853 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_03_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > 2016-02-04 05:05:56,903 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_00_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3096) Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails
[ https://issues.apache.org/jira/browse/TEZ-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184219#comment-15184219 ] Zhiyuan Yang commented on TEZ-3096: --- No problem. I'll look at your JIRA soon and give you the feedback. > Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails > --- > > Key: TEZ-3096 > URL: https://issues.apache.org/jira/browse/TEZ-3096 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Zhiyuan Yang > > Tasks are failing exactly 300ms into running due to a FileSystem error. > {code} > 2016-02-04 05:05:56,853 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_03_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > 2016-02-04 05:05:56,903 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_00_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TEZ-3145) Reduce message size when empty partitions is high
[ https://issues.apache.org/jira/browse/TEZ-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180722#comment-15180722 ] Jonathan Eagles edited comment on TEZ-3145 at 3/8/16 12:26 AM: --- The slim DME idea is to only send the empty partition across in the DME if the destination is going to read the source partition. was (Author: jeagles): The SLIM DME idea is to only send the empty partition across in the DME if the destination is going to read the source partition. > Reduce message size when empty partitions is high > - > > Key: TEZ-3145 > URL: https://issues.apache.org/jira/browse/TEZ-3145 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3145.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3145) Reduce message size when empty partitions is high
[ https://issues.apache.org/jira/browse/TEZ-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-3145: - Attachment: TEZ-3145.2.patch > Reduce message size when empty partitions is high > - > > Key: TEZ-3145 > URL: https://issues.apache.org/jira/browse/TEZ-3145 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3145.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3145) Reduce message size when empty partitions is high
[ https://issues.apache.org/jira/browse/TEZ-3145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Eagles updated TEZ-3145: - Attachment: (was: TEZ-3145.SLIM-DME.patch) > Reduce message size when empty partitions is high > - > > Key: TEZ-3145 > URL: https://issues.apache.org/jira/browse/TEZ-3145 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-3145.2.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2863) Container, node, and logs not available in UI for tasks that fail to launch
[ https://issues.apache.org/jira/browse/TEZ-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184052#comment-15184052 ] Hitesh Shah commented on TEZ-2863: -- +1. > Container, node, and logs not available in UI for tasks that fail to launch > --- > > Key: TEZ-2863 > URL: https://issues.apache.org/jira/browse/TEZ-2863 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-2863.1.patch, TEZ-2863.2-branch-0.7.patch, > TEZ-2863.2.patch, TEZ-2863.3-branch-0.7.patch, > TEZ-2863.3-branch-0.7.patch.addendum, TEZ-2863.3.patch, > TEZ-2863.3.patch.addendum, TEZ-2863.4-branch-0.7.patch, TEZ-2863.4.patch, > TEZ-2863.5-branch-0.7.patch, TEZ-2863.5.patch > > > While running a sample tez job > {noformat} > tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 > -Dtez.task.launch.cmd-opts="-Xmx1m" input output > {noformat} > It was noticed that the Tez UI task attempt > http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id > was missing the TEZ_ATTEMPT_STARTED event > {noformat} > 2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: > vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_00, > scheduledTime=1443711835342, launchTime=1443711835342 > 2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] > |rm.YarnTaskSchedulerService|: Allocation request for task: > attempt_1443711816411_0001_1_00_00_0 with request: Capability[ vCores:1>]Priority[2] host: localhost rack: null > 2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] > |impl.AMRMClientImpl|: Received new token for : localhost:57381 > 2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: Assigning container to task: > containerId=container_1443711816411_0001_01_02, > task=attempt_1443711816411_0001_1_00_00_0, containerHost=localhost:57381, > containerPriority= 2, containerResources=, > localityMatchType=NodeLocal, matchedLocation=localhost, > honorLocalityFlags=true, reusedContainer=false, delayedContainers=0 > 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: > Resolved localhost to /default-rack > 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: > Resolved localhost to /default-rack > 2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] > |node.AMNodeTracker|: Adding new node: localhost:57381 > 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] > |launcher.ContainerLauncherImpl|: Launching > container_1443711816411_0001_01_02 > 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] > |impl.ContainerManagementProtocolProxy|: Opening proxy : localhost:57381 > 2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] > |history.HistoryEventHandler|: [HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: > containerId=container_1443711816411_0001_01_02, launchTime=1443711836741 > 2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] > |rm.YarnTaskSchedulerService|: Allocated container > completed:container_1443711816411_0001_01_02 last allocated to task: > attempt_1443711816411_0001_1_00_00_0 > 2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] > |container.AMContainerImpl|: Container container_1443711816411_0001_01_02 > exited with diagnostics set to Container failed, exitCode=1. Exception from > container-launch. > Container id: container_1443711816411_0001_01_02 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
[jira] [Commented] (TEZ-3096) Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails
[ https://issues.apache.org/jira/browse/TEZ-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184011#comment-15184011 ] Tsuyoshi Ozawa commented on TEZ-3096: - I've also uploaded the patch, so I appreciate if you take a look. Thanks! > Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails > --- > > Key: TEZ-3096 > URL: https://issues.apache.org/jira/browse/TEZ-3096 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Zhiyuan Yang > > Tasks are failing exactly 300ms into running due to a FileSystem error. > {code} > 2016-02-04 05:05:56,853 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_03_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > 2016-02-04 05:05:56,903 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_00_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3096) Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails
[ https://issues.apache.org/jira/browse/TEZ-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184009#comment-15184009 ] Tsuyoshi Ozawa commented on TEZ-3096: - [~gopalv] [~aplusplus] maybe duplicated issue of TEZ-3148? > Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails > --- > > Key: TEZ-3096 > URL: https://issues.apache.org/jira/browse/TEZ-3096 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Zhiyuan Yang > > Tasks are failing exactly 300ms into running due to a FileSystem error. > {code} > 2016-02-04 05:05:56,853 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_03_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > 2016-02-04 05:05:56,903 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_00_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3155) Support a way to submit DAGs to a session where the DAG plan exceeds hadoop ipc limits
[ https://issues.apache.org/jira/browse/TEZ-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183558#comment-15183558 ] Hitesh Shah commented on TEZ-3155: -- bq. this addition seems to have no relation to the proto being modified - why was this needed? Please ignore. Surprising that findbugs has not reported this earlier for other patches. > Support a way to submit DAGs to a session where the DAG plan exceeds hadoop > ipc limits > --- > > Key: TEZ-3155 > URL: https://issues.apache.org/jira/browse/TEZ-3155 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Zhiyuan Yang > Attachments: TEZ-3155.1.patch, TEZ-3155.2.patch, TEZ-3155.3.patch, > TEZ-3155.4.patch > > > Currently, dag submissions fail if the dag plan exceeds the hadoop ipc > limits. One option would be to fall back to local resources if the dag plan > is too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3155) Support a way to submit DAGs to a session where the DAG plan exceeds hadoop ipc limits
[ https://issues.apache.org/jira/browse/TEZ-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183553#comment-15183553 ] Hitesh Shah commented on TEZ-3155: -- Thanks for addressing the previous comments. Some more comments based on patch 4: {code} 47 48 49 50 {code} - this addition seems to have no relation to the proto being modified - why was this needed? TezClient: {code} private FileSystem fs = null; {code} - rename this to something like stagingFs. Also, this should be initialized once in init() and re-used. {code} 137 private static final int gapToMaxIPCSize = 5 * 1024 * 1024; 138 private AtomicInteger serializedDAGPlanCounter = new AtomicInteger(0); {code} - above need code comments to describe that the vars are. - might be good to make gapToMaxIPCSize configurable with default as 5 MB ). Mark the new config property as Private though {code} dagClientConf.getInt(CommonConfigurationKeys.IPC_MAXIMUM_DATA_LENGTH, 530 CommonConfigurationKeys.IPC_MAXIMUM_DATA_LENGTH_DEFAULT)) { {code} - this should be a class member var and initialized once. Also it should use the main tezconf and not dagclientconf {code} TezConfiguration tezConf = amConfig.getTezConfiguration(); {code} - no need to create an extra local var. Just use "amConfig.getTezConfiguration()" directly {code} /* we need manually delete the serialized dagplan since staging path here won't be destroyed */ 190 Path dagPlanPath = new Path(request.getSerializedDagPlanPath()); 191 FileSystem fs = dagPlanPath.getFileSystem(conf); 192 fs.delete(dagPlanPath, false); {code} - this is not reliable if there is a test failure or an exception is thrown - staging dir should be set to target and also use the local fs - Using local fs could be done by having a package private method to override the stagingFs in TezClient with the value of FileSystem::getLocal - For the dag plan file, use deleteOnExit() TestTezClient: {code} int maxIPCMsgSize = 1024; 173 conf.setInt(CommonConfigurationKeys.IPC_MAXIMUM_DATA_LENGTH, maxIPCMsgSize); 174 processorDescriptor.setUserPayload(UserPayload.create(ByteBuffer.allocate(2*maxIPCMsgSize))); {code} - processorDescriptor.setUserPayload() is not being invoked for the largeDagPlan false case? - shouldnt it always be set to say 2 MB in both scenarios and the max limit changed to 1 MB in one scenario and say 8 ( +5 for the overhead check ) MB in the other scenario? This can played around with to address my following comments on the buffer and additional resources checks. - how is the 5 MB buffer check being tested? - Also, there is no test if additionalResources ( or a combination of dag plan + additional rsrcs ) exceeds ipc limits? DAGClientAMProtocolBlockingPBServerImpl: - fs can be initialized in the ctor itself {code} try (FSDataInputStream fsDataInputStream = fs.open(requestPath)) { 173 dagPlan = DAGPlan.parseFrom(fsDataInputStream); 174 } catch (IOException e) { 175 throw wrapException(e); 176 } {code} - wont the exception thrown in line 173 be caught be the catch in line 186 ? testSubmitDagInSessionWithLargeDagPlan - test could be enhanced to verify the payload contents after deserialization > Support a way to submit DAGs to a session where the DAG plan exceeds hadoop > ipc limits > --- > > Key: TEZ-3155 > URL: https://issues.apache.org/jira/browse/TEZ-3155 > Project: Apache Tez > Issue Type: Bug >Reporter: Hitesh Shah >Assignee: Zhiyuan Yang > Attachments: TEZ-3155.1.patch, TEZ-3155.2.patch, TEZ-3155.3.patch, > TEZ-3155.4.patch > > > Currently, dag submissions fail if the dag plan exceeds the hadoop ipc > limits. One option would be to fall back to local resources if the dag plan > is too large. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2954) Container launch timeouts should count towards node blacklisting
[ https://issues.apache.org/jira/browse/TEZ-2954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183434#comment-15183434 ] Siddharth Seth commented on TEZ-2954: - [~ozawa] - I'll try looking at the patch by the end of the week. > Container launch timeouts should count towards node blacklisting > > > Key: TEZ-2954 > URL: https://issues.apache.org/jira/browse/TEZ-2954 > Project: Apache Tez > Issue Type: Improvement >Reporter: Siddharth Seth >Assignee: Tsuyoshi Ozawa > Attachments: TEZ-2954.001.patch > > > Currently, only task failures count towards blacklisting. A container timing > out should do the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-3096) Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails
[ https://issues.apache.org/jira/browse/TEZ-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183340#comment-15183340 ] Zhiyuan Yang commented on TEZ-3096: --- I would like to take this task. If there is no problem, I will assign this JIRA to myself. > Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails > --- > > Key: TEZ-3096 > URL: https://issues.apache.org/jira/browse/TEZ-3096 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Zhiyuan Yang > > Tasks are failing exactly 300ms into running due to a FileSystem error. > {code} > 2016-02-04 05:05:56,853 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_03_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > 2016-02-04 05:05:56,903 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_00_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TEZ-3096) Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails
[ https://issues.apache.org/jira/browse/TEZ-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhiyuan Yang reassigned TEZ-3096: - Assignee: Zhiyuan Yang > Statemachine: TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS fails > --- > > Key: TEZ-3096 > URL: https://issues.apache.org/jira/browse/TEZ-3096 > Project: Apache Tez > Issue Type: Bug >Reporter: Gopal V >Assignee: Zhiyuan Yang > > Tasks are failing exactly 300ms into running due to a FileSystem error. > {code} > 2016-02-04 05:05:56,853 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_03_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > 2016-02-04 05:05:56,903 [ERROR] [Dispatcher thread {Central}] > |impl.TaskAttemptImpl|: Can't handle this event at current state for > attempt_1454544113740_0027_1_00_00_3 > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > TA_TEZ_EVENT_UPDATE at KILL_IN_PROGRESS > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:795) > at > org.apache.tez.dag.app.dag.impl.TaskAttemptImpl.handle(TaskAttemptImpl.java:120) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2180) > at > org.apache.tez.dag.app.DAGAppMaster$TaskAttemptEventDispatcher.handle(DAGAppMaster.java:2165) > at > org.apache.tez.common.AsyncDispatcher.dispatch(AsyncDispatcher.java:183) > at > org.apache.tez.common.AsyncDispatcher$1.run(AsyncDispatcher.java:114) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-3140) Reduce AM memory usage while serialization
[ https://issues.apache.org/jira/browse/TEZ-3140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohini Palaniswamy updated TEZ-3140: Attachment: TEZ-3140-3.patch Committed to branch-0.7 and master. Attaching the final patch TEZ-3140-3.patch which moves the test to TestEntityDescriptor.java. Thanks [~sseth] for the review. > Reduce AM memory usage while serialization > -- > > Key: TEZ-3140 > URL: https://issues.apache.org/jira/browse/TEZ-3140 > Project: Apache Tez > Issue Type: Improvement >Reporter: Rohini Palaniswamy >Assignee: Rohini Palaniswamy > Fix For: 0.7.1, 0.8.3 > > Attachments: TEZ-3140-1.patch, TEZ-3140-2.patch, TEZ-3140-3.patch > > >There is an unnecessary copy of userpayload byte array during > serialization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2863) Container, node, and logs not available in UI for tasks that fail to launch
[ https://issues.apache.org/jira/browse/TEZ-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15183162#comment-15183162 ] Jonathan Eagles commented on TEZ-2863: -- [~hitesh], can you have a another look at this patch? > Container, node, and logs not available in UI for tasks that fail to launch > --- > > Key: TEZ-2863 > URL: https://issues.apache.org/jira/browse/TEZ-2863 > Project: Apache Tez > Issue Type: Bug >Reporter: Jonathan Eagles >Assignee: Jonathan Eagles > Attachments: TEZ-2863.1.patch, TEZ-2863.2-branch-0.7.patch, > TEZ-2863.2.patch, TEZ-2863.3-branch-0.7.patch, > TEZ-2863.3-branch-0.7.patch.addendum, TEZ-2863.3.patch, > TEZ-2863.3.patch.addendum, TEZ-2863.4-branch-0.7.patch, TEZ-2863.4.patch, > TEZ-2863.5-branch-0.7.patch, TEZ-2863.5.patch > > > While running a sample tez job > {noformat} > tez-examples-*.jar orderedwordcount -Dtez.task.resource.memory.mb=1 > -Dtez.task.launch.cmd-opts="-Xmx1m" input output > {noformat} > It was noticed that the Tez UI task attempt > http://timelineserverhost:port/ws/v1/timeline/TEZ_TASK_ATTEMPT_ID/attempt_id > was missing the TEZ_ATTEMPT_STARTED event > {noformat} > 2015-10-01 10:03:55,344 [INFO] [Dispatcher thread {Central}] > |history.HistoryEventHandler|: > [HISTORY][DAG:dag_1443711816411_0001_1][Event:TASK_STARTED]: > vertexName=Tokenizer, taskId=task_1443711816411_0001_1_00_00, > scheduledTime=1443711835342, launchTime=1443711835342 > 2015-10-01 10:03:55,346 [INFO] [Dispatcher thread {Central}] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:55,356 [INFO] [TaskSchedulerEventHandlerThread] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:55,364 [INFO] [TaskSchedulerEventHandlerThread] > |rm.YarnTaskSchedulerService|: Allocation request for task: > attempt_1443711816411_0001_1_00_00_0 with request: Capability[ vCores:1>]Priority[2] host: localhost rack: null > 2015-10-01 10:03:56,639 [INFO] [AMRM Heartbeater thread] > |impl.AMRMClientImpl|: Received new token for : localhost:57381 > 2015-10-01 10:03:56,646 [INFO] [AMRM Callback Handler Thread] > |util.RackResolver|: Resolved localhost to /default-rack > 2015-10-01 10:03:56,648 [INFO] [DelayedContainerManager] > |rm.YarnTaskSchedulerService|: Assigning container to task: > containerId=container_1443711816411_0001_01_02, > task=attempt_1443711816411_0001_1_00_00_0, containerHost=localhost:57381, > containerPriority= 2, containerResources=, > localityMatchType=NodeLocal, matchedLocation=localhost, > honorLocalityFlags=true, reusedContainer=false, delayedContainers=0 > 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: > Resolved localhost to /default-rack > 2015-10-01 10:03:56,649 [INFO] [DelayedContainerManager] |util.RackResolver|: > Resolved localhost to /default-rack > 2015-10-01 10:03:56,686 [INFO] [TaskSchedulerAppCaller #0] > |node.AMNodeTracker|: Adding new node: localhost:57381 > 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] > |launcher.ContainerLauncherImpl|: Launching > container_1443711816411_0001_01_02 > 2015-10-01 10:03:56,700 [INFO] [ContainerLauncher #0] > |impl.ContainerManagementProtocolProxy|: Opening proxy : localhost:57381 > 2015-10-01 10:03:56,741 [INFO] [ContainerLauncher #0] > |history.HistoryEventHandler|: [HISTORY][DAG:N/A][Event:CONTAINER_LAUNCHED]: > containerId=container_1443711816411_0001_01_02, launchTime=1443711836741 > 2015-10-01 10:03:57,647 [INFO] [AMRM Callback Handler Thread] > |rm.YarnTaskSchedulerService|: Allocated container > completed:container_1443711816411_0001_01_02 last allocated to task: > attempt_1443711816411_0001_1_00_00_0 > 2015-10-01 10:03:57,648 [INFO] [Dispatcher thread {Central}] > |container.AMContainerImpl|: Container container_1443711816411_0001_01_02 > exited with diagnostics set to Container failed, exitCode=1. Exception from > container-launch. > Container id: container_1443711816411_0001_01_02 > Exit code: 1 > Stack trace: ExitCodeException exitCode=1: > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concu