[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388170#comment-14388170 ]
Cyrille Chépélov commented on TEZ-2237: --------------------------------------- Indeed, application_1427324000018_1444.yarn-logs.red.txt was done using straight 0.6.0. A later log file (application_1427324000018_1467.red.txt.gz), impractically big, was made with branch-0.6 (as of 66ca9655a4412e1c1db1d37e882a407706dbe3ad), which seems to include TEZ-1929. It seemed to freeze when I uploaded the log yesterday, and I had to free up the cluster, so I killed it in the end. It seems I killed application_1427324000018_1467 too early yesterday. My updated plan is: # run again using TEZ branch-0.6 as of 974588e180ab53ea3e7243f2dea29a5d8ef2416d ("TEZ-2240"), cascading-3.0.0-wip-92 # run again (if still failing) using ("tez.am.dag.scheduler.class" -> "org.apache.tez.dag.app.dag.impl.DAGSchedulerNaturalOrderControlled") in the scalding.Job#config override method. # report > Complex DAG freezes and fails (was BufferTooSmallException raised in > UnorderedPartitionedKVWriter then DAG lingers) > ------------------------------------------------------------------------------------------------------------------- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 > Reporter: Cyrille Chépélov > Attachments: all_stacks.lst, alloc_mem.png, alloc_vcores.png, > application_1427324000018_1444.yarn-logs.red.txt.gz, > appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, > appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, > gc_count_MRAppMaster.png, mem_free.png, ordered-grouped-kv-input-traces.diff, > start_containers.png, stop_containers.png, > syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz, yarn_rm_flips.png > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)