[ https://issues.apache.org/jira/browse/TEZ-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14382911#comment-14382911 ]
Hitesh Shah commented on TEZ-2237: ---------------------------------- [~cchepelov] Ignore my comment on the catching/ignoring the exception. {code] try { keySerializer.serialize(key); } catch (BufferTooSmallException e) { if (metaStart == 0) { // Started writing at the start of the buffer. Write Key to disk. // Key too large for any buffer. Write entire record to disk. currentBuffer.reset(); writeLargeRecord(key, value, partition); return; } else { // Exceeded length on current buffer. // Try resetting the buffer to the next one, if this was not the start of a buffer, // and begin spilling the current buffer to disk if it has any records. setupNextBuffer(); write(key, value, partition); return; } {code} I am guessing hadoop.TupleSerialization$SerializationElementWriter is hitting the BTSE and it will fall back based on teh code above. A lot of this has to do with the fact that the output has a very small buffer. The 215 attempt log shows 191 spills to disk due to inadequate buffer size. The attempt took around 3 mins to run though. > BufferTooSmallException raised in UnorderedPartitionedKVWriter then DAG > lingers > ------------------------------------------------------------------------------- > > Key: TEZ-2237 > URL: https://issues.apache.org/jira/browse/TEZ-2237 > Project: Apache Tez > Issue Type: Bug > Affects Versions: 0.6.0 > Environment: Debian Linux "jessie" > OpenJDK Runtime Environment (build 1.8.0_40-internal-b27) > OpenJDK 64-Bit Server VM (build 25.40-b25, mixed mode) > 7 * Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz, 16/24 GB RAM per node, 1*system > disk + 4*1 or 2 TiB HDD for HDFS & local (on-prem, dedicated hardware) > Scalding 0.13.1 modified with https://github.com/twitter/scalding/pull/1220 > to run Cascading 3.0.0-wip-90 with TEZ 0.6.0 > Reporter: Cyrille Chépélov > Attachments: appmaster____syslog_dag_1427282048097_0215_1.red.txt.gz, > appmaster____syslog_dag_1427282048097_0237_1.red.txt.gz, > syslog_attempt_1427282048097_0215_1_21_000014_0.red.txt.gz, > syslog_attempt_1427282048097_0237_1_70_000028_0.red.txt.gz > > > On a specific DAG with many vertices (actually part of a larger meta-DAG), > after about a hour of processing, several BufferTooSmallException are raised > in UnorderedPartitionedKVWriter (about one every two or three spills). > Once these exceptions are raised, the DAG remains indefinitely "active", > tying up memory and CPU resources as far as YARN is concerned, while little > if any actual processing takes place. > It seems two separate issues are at hand: > 1. BufferTooSmallException are raised even though, small as the actually > allocated buffers seem to be (around a couple megabytes were allotted whereas > 100MiB were requested), the actual keys and values are never bigger than 24 > and 1024 bytes respectively. > 2. In the event BufferTooSmallExceptions are raised, the DAG fails to stop > (stop requests appear to be sent 7 hours after the BTSE exceptions are > raised, but 9 hours after these stop requests, the DAG was still lingering on > with all containers present tying up memory and CPU allocations) > The emergence of the BTSE prevent the Cascade to complete, preventing from > validating the results compared to traditional MR1-based results. The lack of > conclusion renders the cluster queue unavailable. -- This message was sent by Atlassian JIRA (v6.3.4#6332)