[jira] [Commented] (TEZ-1560) Invalid state machine transition in recovery
[ https://issues.apache.org/jira/browse/TEZ-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395421#comment-14395421 ] Carter Shanklin commented on TEZ-1560: -- Here's what I did: * Using the Hortonworks Sandbox based on HDP 2.2.3. * Kicked off a Hive query and waited for some mappers to start running * Ran "tc qdisc add dev lo root netem loss 66%" This causes 66% packet loss on loopback so we can expect a lot of strange failures to start happening. * Waited about 5 minutes * Ran "tc qdisc delete dev lo root netem loss 66%" So now there is no packet loss * After about a minute or so the job failed with below error: {code} Status: Failed Invalid event V_INTERNAL_ERROR on Vertex vertex_1427920581283_0018_12_01 FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask {code} > Invalid state machine transition in recovery > > > Key: TEZ-1560 > URL: https://issues.apache.org/jira/browse/TEZ-1560 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Critical > Attachments: failed_tez_job.txt.gz > > > {code} > 2014-09-04 16:08:25,504 INFO [main] org.apache.tez.dag.app.dag.impl.DAGImpl: > dag_1409818083015_0001_1 transitioned from NEW to RUNNING > 2014-09-04 16:08:25,504 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, > vertexId=vertex_1409818083015_0001_1_00 [v1], state=NEW, > numInitedSourceVertices=0, numStartedSourceVertices=0, > numRecoveredSourceVertices=0, recoveredEvents=0, tasksIsNull=false, numTasks=0 > 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Root Inputs exist for Vertex: v1 > : {Input={InputName=Input}, > {Descriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$NoOpInput, > hasPayload=false}, > {ControllerDescriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer, > hasPayload=false}} > 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializer > for input: Input, with class: > [org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer] > 2014-09-04 16:08:25,506 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Setting user vertex manager > plugin: > org.apache.tez.test.dag.MultiAttemptDAG$FailOnAttemptVertexManagerPlugin on > vertex: v1 > 2014-09-04 16:08:25,508 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Creating 2 for vertex: > vertex_1409818083015_0001_1_00 [v1] > 2014-09-04 16:08:25,518 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializers: > 1 > 2014-09-04 16:08:25,520 INFO [InputInitializer [v1] #0] > org.apache.tez.dag.app.dag.RootInputInitializerManager: Starting > InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 > [v1] > 2014-09-04 16:08:25,522 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.RootInputInitializerManager: Succeeded > InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 > [v1] > 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: vertex_1409818083015_0001_1_00 > [v1] transitioned from NEW to INITIALIZING due to event V_INIT > 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, > vertexId=vertex_1409818083015_0001_1_01 [v2], state=NEW, > numInitedSourceVertices0, numStartedSourceVertices=0, > numRecoveredSourceVertices=1, tasksIsNull=false, numTasks=0 > 2014-09-04 16:08:25,523 ERROR [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event > V_SOURCE_VERTEX_RECOVERED on vertex v2 with vertexId > vertex_1409818083015_0001_1_01 at current state NEW > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > V_SOURCE_VERTEX_RECOVERED at NEW > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1344) > at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1) > at > org.apache.tez.dag.app.DAGAppMaster$Vert
[jira] [Updated] (TEZ-1560) Invalid state machine transition in recovery
[ https://issues.apache.org/jira/browse/TEZ-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carter Shanklin updated TEZ-1560: - Attachment: failed_tez_job.txt.gz Logs of the failed job. > Invalid state machine transition in recovery > > > Key: TEZ-1560 > URL: https://issues.apache.org/jira/browse/TEZ-1560 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Critical > Attachments: failed_tez_job.txt.gz > > > {code} > 2014-09-04 16:08:25,504 INFO [main] org.apache.tez.dag.app.dag.impl.DAGImpl: > dag_1409818083015_0001_1 transitioned from NEW to RUNNING > 2014-09-04 16:08:25,504 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, > vertexId=vertex_1409818083015_0001_1_00 [v1], state=NEW, > numInitedSourceVertices=0, numStartedSourceVertices=0, > numRecoveredSourceVertices=0, recoveredEvents=0, tasksIsNull=false, numTasks=0 > 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Root Inputs exist for Vertex: v1 > : {Input={InputName=Input}, > {Descriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$NoOpInput, > hasPayload=false}, > {ControllerDescriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer, > hasPayload=false}} > 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializer > for input: Input, with class: > [org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer] > 2014-09-04 16:08:25,506 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Setting user vertex manager > plugin: > org.apache.tez.test.dag.MultiAttemptDAG$FailOnAttemptVertexManagerPlugin on > vertex: v1 > 2014-09-04 16:08:25,508 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Creating 2 for vertex: > vertex_1409818083015_0001_1_00 [v1] > 2014-09-04 16:08:25,518 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializers: > 1 > 2014-09-04 16:08:25,520 INFO [InputInitializer [v1] #0] > org.apache.tez.dag.app.dag.RootInputInitializerManager: Starting > InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 > [v1] > 2014-09-04 16:08:25,522 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.RootInputInitializerManager: Succeeded > InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 > [v1] > 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: vertex_1409818083015_0001_1_00 > [v1] transitioned from NEW to INITIALIZING due to event V_INIT > 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, > vertexId=vertex_1409818083015_0001_1_01 [v2], state=NEW, > numInitedSourceVertices0, numStartedSourceVertices=0, > numRecoveredSourceVertices=1, tasksIsNull=false, numTasks=0 > 2014-09-04 16:08:25,523 ERROR [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event > V_SOURCE_VERTEX_RECOVERED on vertex v2 with vertexId > vertex_1409818083015_0001_1_01 at current state NEW > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > V_SOURCE_VERTEX_RECOVERED at NEW > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1344) > at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1) > at > org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1641) > at > org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > 2014-09-04 16:08:25,524 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-1560) Invalid state machine transition in recovery
[ https://issues.apache.org/jira/browse/TEZ-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395398#comment-14395398 ] Carter Shanklin commented on TEZ-1560: -- I hit this too while simulating a network failure, Tez 0.5.2. Ping me offline for details if you want more. > Invalid state machine transition in recovery > > > Key: TEZ-1560 > URL: https://issues.apache.org/jira/browse/TEZ-1560 > Project: Apache Tez > Issue Type: Bug >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Critical > > {code} > 2014-09-04 16:08:25,504 INFO [main] org.apache.tez.dag.app.dag.impl.DAGImpl: > dag_1409818083015_0001_1 transitioned from NEW to RUNNING > 2014-09-04 16:08:25,504 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, > vertexId=vertex_1409818083015_0001_1_00 [v1], state=NEW, > numInitedSourceVertices=0, numStartedSourceVertices=0, > numRecoveredSourceVertices=0, recoveredEvents=0, tasksIsNull=false, numTasks=0 > 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Root Inputs exist for Vertex: v1 > : {Input={InputName=Input}, > {Descriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$NoOpInput, > hasPayload=false}, > {ControllerDescriptor=ClassName=org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer, > hasPayload=false}} > 2014-09-04 16:08:25,505 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializer > for input: Input, with class: > [org.apache.tez.test.dag.MultiAttemptDAG$TestRootInputInitializer] > 2014-09-04 16:08:25,506 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Setting user vertex manager > plugin: > org.apache.tez.test.dag.MultiAttemptDAG$FailOnAttemptVertexManagerPlugin on > vertex: v1 > 2014-09-04 16:08:25,508 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Creating 2 for vertex: > vertex_1409818083015_0001_1_00 [v1] > 2014-09-04 16:08:25,518 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Starting root input initializers: > 1 > 2014-09-04 16:08:25,520 INFO [InputInitializer [v1] #0] > org.apache.tez.dag.app.dag.RootInputInitializerManager: Starting > InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 > [v1] > 2014-09-04 16:08:25,522 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.RootInputInitializerManager: Succeeded > InputInitializer for Input: Input on vertex vertex_1409818083015_0001_1_00 > [v1] > 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: vertex_1409818083015_0001_1_00 > [v1] transitioned from NEW to INITIALIZING due to event V_INIT > 2014-09-04 16:08:25,523 INFO [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Recovered Vertex State, > vertexId=vertex_1409818083015_0001_1_01 [v2], state=NEW, > numInitedSourceVertices0, numStartedSourceVertices=0, > numRecoveredSourceVertices=1, tasksIsNull=false, numTasks=0 > 2014-09-04 16:08:25,523 ERROR [AsyncDispatcher event handler] > org.apache.tez.dag.app.dag.impl.VertexImpl: Can't handle Invalid event > V_SOURCE_VERTEX_RECOVERED on vertex v2 with vertexId > vertex_1409818083015_0001_1_01 at current state NEW > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: > V_SOURCE_VERTEX_RECOVERED at NEW > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at > org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1344) > at org.apache.tez.dag.app.dag.impl.VertexImpl.handle(VertexImpl.java:1) > at > org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1641) > at > org.apache.tez.dag.app.DAGAppMaster$VertexEventDispatcher.handle(DAGAppMaster.java:1) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:745) > 2014-09-04 16:08:25,524 FATAL [AsyncDispatcher event handler] > org.apache.hadoop.yarn.event.AsyncDispatcher: Error in dispatcher thread > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395285#comment-14395285 ] Siddharth Seth commented on TEZ-2237: - I've attached two more patches - noopexample_2237 - which essentially reproduces this hang. Run with hadoop tez-examples.jar noop 3 3 ScatterGather 1 false | Doesn't start one of the outputs Run with hadoop tez-examples.jar noop 3 3 ScatterGather 1 true | Starts all outputs. The updated patch (TEZ-2237.test*) is just a minor modification of the original patch - with some additional logging, which fixes the issue at least for the example. [~cwensel] - From looking at the logs, both outputs are not being started by Cascading. As an example - look for "syslog_attempt_142732418_1908_1_51_33_0" in the logs posted by Cyrille. This has two outputs, but only one instance of an output being started (as logged by Cascading. I believe cascading always logs when it starts an output) {code} 2015-03-31 12:27:37,730 INFO [TezChild] element.TezGroupGate: calling OrderedPartitionedKVOutput#start() on: GroupBy(_pipe_332+_pipe_333)[by:[{1}:'key']] DEF94DA9BECF4A5BA6C85388B1EAAD41 {code} This is vertex AF538C3C515642AD98D7283120D61548 - which has two OrderedParitionedKVOutputs, two UnorderedKVInputs. Not starting one of them and Tez generating an empty event list causes the next vertex to hang (which reads the Output via OrderedGroupedKVInput - assuming a ScatterGather edge). Note: this is all from application_142732418_1908.red.txt. > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png, > alloc_vcores.png, application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt, > ordered-grouped-kv-input-traces.diff, start_containers.png, > stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2237: Attachment: noopexample_2237.txt > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png, > alloc_vcores.png, application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt, > ordered-grouped-kv-input-traces.diff, start_containers.png, > stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Seth updated TEZ-2237: Attachment: TEZ-2237.test.2_branch0.6.txt > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > TEZ-2237.test.2_branch0.6.txt, all_stacks.lst, alloc_mem.png, > alloc_vcores.png, application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, noopexample_2237.txt, > ordered-grouped-kv-input-traces.diff, start_containers.png, > stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2234 PreCommit Build #389
Jira: https://issues.apache.org/jira/browse/TEZ-2234 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/389/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2261 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709340/TEZ-2234.1.patch against master revision 5e2a55f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-TEZ-Build/389//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.runtime.library.output.TestOnFileUnorderedKVOutput org.apache.tez.runtime.library.output.TestOnFileSortedOutput Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/389//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/389//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/389//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 7376f53f1ceaa359de174a43c4b3a3315ca582f0 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #387 Archived 44 artifacts Archive block size is 32768 Received 0 blocks and 2722206 bytes Compression is 0.0% Took 0.83 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## 26 tests failed. REGRESSION: org.apache.tez.runtime.library.output.TestOnFileSortedOutput.baseTest[test[false, 1, -1]] Error Message: null Stack Trace: java.lang.NullPointerException: null at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:169) at org.apache.tez.runtime.library.output.TestOnFileSortedOutput.baseTest(TestOnFileSortedOutput.java:268) REGRESSION: org.apache.tez.runtime.library.output.TestOnFileSortedOutput.testAllEmptyPartition[test[false, 1, -1]] Error Message: null Stack Trace: java.lang.NullPointerException: null at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:169) at org.apache.tez.runtime.library.output.TestOnFileSortedOutput.testAllEmptyPartition(TestOnFileSortedOutput.java:314) REGRESSION: org.apache.tez.runtime.library.output.TestOnFileSortedOutput.testWithSomeEmptyPartition[test[false, 1, -1]] Error Message: null Stack Trace: java.lang.NullPointerException: null at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:169) at org.apache.tez.runtime.library.output.TestOnFileSortedOutput.testWithSomeEmptyPartition(TestOnFileSortedOutput.java:297) REGRESSION: org.apache.tez.runtime.library.output.TestOnFileSortedOutput.baseTest[test[false, 1, 0]] Error Message: null Stack Trace: java.lang.NullPointerException: null at org.apache.tez.runtime.library.output.OrderedPartitionedKVOutput.close(OrderedPartitionedKVOutput.java:169) at org.apache.tez.runtime.library.output.TestOnFileSortedOutput.baseTest(TestOnFileSortedOutput.java:268) REGRESSION: org.apache.tez.runtime.library.output.TestOnFileSortedOutput.testAllEmptyPartition[test[false, 1, 0]] Error Message: null Stack
[jira] [Commented] (TEZ-2234) Allow vertex managers to get output size per source vertex
[ https://issues.apache.org/jira/browse/TEZ-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395271#comment-14395271 ] Hadoop QA commented on TEZ-2234: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709340/TEZ-2234.1.patch against master revision 5e2a55f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. See https://builds.apache.org/job/PreCommit-TEZ-Build/389//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in : org.apache.tez.runtime.library.output.TestOnFileUnorderedKVOutput org.apache.tez.runtime.library.output.TestOnFileSortedOutput Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/389//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-TEZ-Build/389//artifact/patchprocess/newPatchFindbugsWarningstez-dag.html Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/389//console This message is automatically generated. > Allow vertex managers to get output size per source vertex > -- > > Key: TEZ-2234 > URL: https://issues.apache.org/jira/browse/TEZ-2234 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-2234.1.patch > > > Vertex managers may need per source vertex output stats to make > reconfiguration decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2234) Allow vertex managers to get output size per source vertex
[ https://issues.apache.org/jira/browse/TEZ-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2234: Attachment: TEZ-2234.1.patch > Allow vertex managers to get output size per source vertex > -- > > Key: TEZ-2234 > URL: https://issues.apache.org/jira/browse/TEZ-2234 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-2234.1.patch > > > Vertex managers may need per source vertex output stats to make > reconfiguration decisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2236) TEZ UI: Support loading of all rows in dag -> tasks table.
[ https://issues.apache.org/jira/browse/TEZ-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395161#comment-14395161 ] Hadoop QA commented on TEZ-2236: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709315/TEZ-2236.2.patch against master revision 5e2a55f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/388//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/388//console This message is automatically generated. > TEZ UI: Support loading of all rows in dag -> tasks table. > -- > > Key: TEZ-2236 > URL: https://issues.apache.org/jira/browse/TEZ-2236 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram >Assignee: Sreenath Somarajapuram > Attachments: TEZ-2236.1.patch, TEZ-2236.2.patch > > > 1. ember-table component was replaced with basic-ember-table component. Its > lightweight, easy to customize, uses pure css for layout and supports cell > level lazy loading and Pagination of complete loaded data. > 2. Load all rows in two phases - First load some rows for preview, then load > all related records to be displayed. > 3. Support caching of data across tabs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Failed: TEZ-2236 PreCommit Build #388
Jira: https://issues.apache.org/jira/browse/TEZ-2236 Build: https://builds.apache.org/job/PreCommit-TEZ-Build/388/ ### ## LAST 60 LINES OF THE CONSOLE ### [...truncated 2782 lines...] {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12709315/TEZ-2236.2.patch against master revision 5e2a55f. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/388//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/388//console This message is automatically generated. == == Adding comment to Jira. == == Comment added. 384d5721422cd8187bdc014447e895726ece5971 logged out == == Finished build. == == Build step 'Execute shell' marked build as failure Archiving artifacts Sending artifact delta relative to PreCommit-TEZ-Build #387 Archived 44 artifacts Archive block size is 32768 Received 2 blocks and 2694675 bytes Compression is 2.4% Took 1.8 sec [description-setter] Could not determine description. Recording test results Email was triggered for: Failure Sending email for trigger: Failure ### ## FAILED TESTS (if any) ## All tests passed
[jira] [Created] (TEZ-2278) Tez UI start/end time and duration shown are wrong for tasks
Rohini Palaniswamy created TEZ-2278: --- Summary: Tez UI start/end time and duration shown are wrong for tasks Key: TEZ-2278 URL: https://issues.apache.org/jira/browse/TEZ-2278 Project: Apache Tez Issue Type: Bug Components: UI Affects Versions: 0.6.0 Reporter: Rohini Palaniswamy Observing lot of time discrepancies between vertex, task and swinlane views. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2236) TEZ UI: Support loading of all rows in dag -> tasks table.
[ https://issues.apache.org/jira/browse/TEZ-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395104#comment-14395104 ] Sreenath Somarajapuram commented on TEZ-2236: - Thanks [~pramachandran], patch 2 is with the changes. Please review. data-array-loader-mixin - All find code must be having respective error handler, but again as a fall back have added a catch at the end of prototype chain. - limit: Thanks for pointing this out, all request will go with a limit now. - array.length: As we are just depended on the number of elements, wouldn't array.length be better than array.[]? general - columnselector: Was planned for a later patch, but have prepended to patch 2. - Column resize: Was planned for a later patch, but have prepended to patch 2. - Caching was added - Default row count changed to 25. target version: Aiming for Dal. > TEZ UI: Support loading of all rows in dag -> tasks table. > -- > > Key: TEZ-2236 > URL: https://issues.apache.org/jira/browse/TEZ-2236 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram >Assignee: Sreenath Somarajapuram > Attachments: TEZ-2236.1.patch, TEZ-2236.2.patch > > > 1. ember-table component was replaced with basic-ember-table component. Its > lightweight, easy to customize, uses pure css for layout and supports cell > level lazy loading and Pagination of complete loaded data. > 2. Load all rows in two phases - First load some rows for preview, then load > all related records to be displayed. > 3. Support caching of data across tabs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2236) TEZ UI: Support loading of all rows in dag -> tasks table.
[ https://issues.apache.org/jira/browse/TEZ-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreenath Somarajapuram updated TEZ-2236: Attachment: TEZ-2236.2.patch > TEZ UI: Support loading of all rows in dag -> tasks table. > -- > > Key: TEZ-2236 > URL: https://issues.apache.org/jira/browse/TEZ-2236 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram >Assignee: Sreenath Somarajapuram > Attachments: TEZ-2236.1.patch, TEZ-2236.2.patch > > > 1. ember-table component was replaced with basic-ember-table component. Its > lightweight, easy to customize, uses pure css for layout and supports cell > level lazy loading and Pagination of complete loaded data. > 2. Load all rows in two phases - First load some rows for preview, then load > all related records to be displayed. > 3. Support caching of data across tabs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2232) Allow setParallelism to be called multiple times before tasks get scheduled
[ https://issues.apache.org/jira/browse/TEZ-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated TEZ-2232: Attachment: TEZ-2232.2.patch Attaching rebased patch. Thanks for the review. Committing. > Allow setParallelism to be called multiple times before tasks get scheduled > --- > > Key: TEZ-2232 > URL: https://issues.apache.org/jira/browse/TEZ-2232 > Project: Apache Tez > Issue Type: Bug >Reporter: Bikas Saha >Assignee: Bikas Saha > Attachments: TEZ-2232.1.patch, TEZ-2232.2.patch > > > Currently, this is allowed only once currently. It is harder to support this > after the vertex tasks have already started running. But allowing it before > tasks start running is actually trivial. This just allows VertexManagers to > change their minds multiple times before they start the vertex processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2159) Tez UI: download timeline data for offline use.
[ https://issues.apache.org/jira/browse/TEZ-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394896#comment-14394896 ] Sreenath Somarajapuram commented on TEZ-2159: - Build is failing in grunt less. Font awesome is not accessible from inside shared.less. > Tez UI: download timeline data for offline use. > --- > > Key: TEZ-2159 > URL: https://issues.apache.org/jira/browse/TEZ-2159 > Project: Apache Tez > Issue Type: Improvement > Components: UI >Reporter: Prakash Ramachandran >Assignee: Prakash Ramachandran > Attachments: TEZ-2159.1.patch, TEZ-2159.wip.1.patch > > > It is useful to have capability to download the timeline data for a dag for > offline analysis. for ex. TEZ-2076 uses the timeline data to do offline > analysis of a tez application run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2277) modifyACLsStr in DAGAccessControls does not take effect
Thejas M Nair created TEZ-2277: -- Summary: modifyACLsStr in DAGAccessControls does not take effect Key: TEZ-2277 URL: https://issues.apache.org/jira/browse/TEZ-2277 Project: Apache Tez Issue Type: Bug Affects Versions: 0.6.0 Reporter: Thejas M Nair Priority: Critical Even if modifyACLsStr in DAGAccessControls constructor is set and that access control is set for the DAG, it does not actually get set in access control at runtime. See comment in [HIVE-10145|https://issues.apache.org/jira/browse/HIVE-10145?focusedCommentId=14393933&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14393933] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394833#comment-14394833 ] Cyrille Chépélov commented on TEZ-2237: --- Yes, it was, as is the run in progress (with a little more memory beyond the heap). Am away from the cluster at the moment but will post updated logs ASAP. > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > all_stacks.lst, alloc_mem.png, alloc_vcores.png, > application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, > start_containers.png, stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394735#comment-14394735 ] Siddharth Seth commented on TEZ-2237: - [~cchepelov] - quick note. Was the run with one of the attached patches ? Do you see the "Attempting to close output ... " message in the logs ? Could you attach the logs again please. Will look more a little later in the day. > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > all_stacks.lst, alloc_mem.png, alloc_vcores.png, > application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, > start_containers.png, stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394697#comment-14394697 ] Cyrille Chépélov commented on TEZ-2237: --- No success. At one point, activity subsided while the DAGMaster was busy creating containers which kept dying somehow. Last breath from one of the containers was in the middle of spills from UnorderedPartitionedKVWriter… No explicit message from the NodeManager, except that the container got preempted. Retrying with "tez.task.resource.memory.mb" -> "1170", // default 1024 "tez.container.max.java.heap.fraction" -> "0.7", // default 0.8 > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > all_stacks.lst, alloc_mem.png, alloc_vcores.png, > application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, > start_containers.png, stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394600#comment-14394600 ] Chris K Wensel commented on TEZ-2237: - Actually, looking at the code, outputs are started before starting inputs. then the wait for inputs starts. https://github.com/cwensel/cascading/blob/wip-3.0/cascading-hadoop2-tez/src/main/java/cascading/flow/tez/FlowProcessor.java#L131-131 The pipe line graph is walked in reverse topo order and all stages are initialized. sinks first, the sources (and any intermediate resource that must be initialized before execution). So to the point above, all outputs are started before any work is begun. in particular, they are even started before the inputs, and before the ready status of the inputs. > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > all_stacks.lst, alloc_mem.png, alloc_vcores.png, > application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, > start_containers.png, stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TEZ-2237) Complex DAG freezes and fails (was BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG lingers)
[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14394472#comment-14394472 ] Cyrille Chépélov commented on TEZ-2237: --- Thanks [~sseth]. Trying that at the moment (on branch-0.6 as of 366583aade76901f93e15e33111ad7586326ce1e + my TEZ-2256 patch). The first DAGs of the cascade just completed successfully, awaiting the hard part. > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > --- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 >Reporter: Cyrille Chépélov > Attachments: TEZ-2237-hack.branch6.txt, TEZ-2237-hack.master.txt, > all_stacks.lst, alloc_mem.png, alloc_vcores.png, > application_142732418_1444.yarn-logs.red.txt.gz, > application_142732418_1908.red.txt.bz2, > appmastersyslog_dag_1427282048097_0215_1.red.txt.gz, > appmastersyslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, > start_containers.png, stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_14_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_28_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2276) TEZ UI: Move sorting to web worker for making the UI responsive while searching.
Sreenath Somarajapuram created TEZ-2276: --- Summary: TEZ UI: Move sorting to web worker for making the UI responsive while searching. Key: TEZ-2276 URL: https://issues.apache.org/jira/browse/TEZ-2276 Project: Apache Tez Issue Type: Sub-task Reporter: Sreenath Somarajapuram Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2275) TEZ UI: Remove counter serialization for all entities to make loading faster.
Sreenath Somarajapuram created TEZ-2275: --- Summary: TEZ UI: Remove counter serialization for all entities to make loading faster. Key: TEZ-2275 URL: https://issues.apache.org/jira/browse/TEZ-2275 Project: Apache Tez Issue Type: Sub-task Reporter: Sreenath Somarajapuram -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2274) TEZ UI: Make TEZ-2236 & TEZ-2273 available for all pages, except 'All Dags'.
Sreenath Somarajapuram created TEZ-2274: --- Summary: TEZ UI: Make TEZ-2236 & TEZ-2273 available for all pages, except 'All Dags'. Key: TEZ-2274 URL: https://issues.apache.org/jira/browse/TEZ-2274 Project: Apache Tez Issue Type: Sub-task Reporter: Sreenath Somarajapuram 1. Make all tables use ember-table component 2. Support loading of all rows with caching 3. Support searching & sorting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2273) TEZ UI: Support client side, searching & sorting.
[ https://issues.apache.org/jira/browse/TEZ-2273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreenath Somarajapuram updated TEZ-2273: Summary: TEZ UI: Support client side, searching & sorting. (was: Support client side, searching & sorting.) > TEZ UI: Support client side, searching & sorting. > - > > Key: TEZ-2273 > URL: https://issues.apache.org/jira/browse/TEZ-2273 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2236) TEZ UI: Support loading of all rows in dag -> tasks table.
[ https://issues.apache.org/jira/browse/TEZ-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreenath Somarajapuram updated TEZ-2236: Summary: TEZ UI: Support loading of all rows in dag -> tasks table. (was: Support loading of all rows in dag -> tasks table.) > TEZ UI: Support loading of all rows in dag -> tasks table. > -- > > Key: TEZ-2236 > URL: https://issues.apache.org/jira/browse/TEZ-2236 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram >Assignee: Sreenath Somarajapuram > Attachments: TEZ-2236.1.patch > > > 1. ember-table component was replaced with basic-ember-table component. Its > lightweight, easy to customize, uses pure css for layout and supports cell > level lazy loading and Pagination of complete loaded data. > 2. Load all rows in two phases - First load some rows for preview, then load > all related records to be displayed. > 3. Support caching of data across tabs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2273) Support client side, searching & sorting.
Sreenath Somarajapuram created TEZ-2273: --- Summary: Support client side, searching & sorting. Key: TEZ-2273 URL: https://issues.apache.org/jira/browse/TEZ-2273 Project: Apache Tez Issue Type: Sub-task Reporter: Sreenath Somarajapuram -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2236) Support loading of all rows in dag -> tasks table.
[ https://issues.apache.org/jira/browse/TEZ-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sreenath Somarajapuram updated TEZ-2236: Description: 1. ember-table component was replaced with basic-ember-table component. Its lightweight, easy to customize, uses pure css for layout and supports cell level lazy loading and Pagination of complete loaded data. 2. Load all rows in two phases - First load some rows for preview, then load all related records to be displayed. 3. Support caching of data across tabs. was: 1. ember-table component was replaced with basic-ember-table component. Its lightweight, easy to customize, uses pure css for layout and supports cell level lazy loading and Pagination of complete loaded data. 2. Load all rows in two phases - First load some rows for preview, then load all related records to be displayed. > Support loading of all rows in dag -> tasks table. > -- > > Key: TEZ-2236 > URL: https://issues.apache.org/jira/browse/TEZ-2236 > Project: Apache Tez > Issue Type: Sub-task >Reporter: Sreenath Somarajapuram >Assignee: Sreenath Somarajapuram > Attachments: TEZ-2236.1.patch > > > 1. ember-table component was replaced with basic-ember-table component. Its > lightweight, easy to customize, uses pure css for layout and supports cell > level lazy loading and Pagination of complete loaded data. > 2. Load all rows in two phases - First load some rows for preview, then load > all related records to be displayed. > 3. Support caching of data across tabs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2269) DAGAppMaster becomes unresponsive
[ https://issues.apache.org/jira/browse/TEZ-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Balamohan updated TEZ-2269: -- Attachment: TEZ-2269.test.patch stacktrace isn't revealing much on deadlock & haven't been successful in getting which thread is holding up the lock. Tried out the test patch attached here multiple number of times, which safely uses "tryLock" with timeout in DAGImpl.getDAGStatus(). With the patch, the hang issue is not reproduced. [~sseth] - Thoughts? > DAGAppMaster becomes unresponsive > - > > Key: TEZ-2269 > URL: https://issues.apache.org/jira/browse/TEZ-2269 > Project: Apache Tez > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Rajesh Balamohan > Attachments: TEZ-2269.test.patch, > app_master_application_1428021179455_0001_jstack.txt, client_jstack.txt > > > Scenario: > - Run TPCH query20 @ 1 TB scale > - Tez master branch, Hive trunk > - auto-reduce parallelism is not an issue (happens with/without auto-reduce > parallelism) > 1 or 2 times in 10 runs, DAGAppMaster would freeze unexpectedly. There is no > pattern observed on which vertex it happens. But when this happens, only > option is to kill the application. I will attach the jstack soon, but that > doesn't seem to reveal much. > Need to debug more; Creating this JIRA for tracking purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2272) Should display committer for Output (Sink)
[ https://issues.apache.org/jira/browse/TEZ-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2272: Attachment: initializer_committer.png > Should display committer for Output (Sink) > -- > > Key: TEZ-2272 > URL: https://issues.apache.org/jira/browse/TEZ-2272 > Project: Apache Tez > Issue Type: Bug > Components: UI >Reporter: Jeff Zhang > Attachments: initializer_committer.png > > > On the page of Source & Sink, the initializer for Input is displayed but no > committer for Output. It is supposed to also display the committer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TEZ-2272) Should display committer for Output (Sink)
[ https://issues.apache.org/jira/browse/TEZ-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated TEZ-2272: Summary: Should display committer for Output (Sink) (was: Should add committer for Output (Sink)) > Should display committer for Output (Sink) > -- > > Key: TEZ-2272 > URL: https://issues.apache.org/jira/browse/TEZ-2272 > Project: Apache Tez > Issue Type: Bug > Components: UI >Reporter: Jeff Zhang > > On the page of Source & Sink, the initializer for Input is displayed but no > committer for Output. It is supposed to also display the committer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TEZ-2272) Should add committer for Output (Sink)
Jeff Zhang created TEZ-2272: --- Summary: Should add committer for Output (Sink) Key: TEZ-2272 URL: https://issues.apache.org/jira/browse/TEZ-2272 Project: Apache Tez Issue Type: Bug Components: UI Reporter: Jeff Zhang On the page of Source & Sink, the initializer for Input is displayed but no committer for Output. It is supposed to also display the committer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)