[ https://issues.apache.org/jira/browse/SPARK-20250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wenchen Fan resolved SPARK-20250. --------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 2.1.2 2.0.3 > Improper OOM error when a task been killed while spilling data > -------------------------------------------------------------- > > Key: SPARK-20250 > URL: https://issues.apache.org/jira/browse/SPARK-20250 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 > Reporter: Feng Zhu > Fix For: 2.0.3, 2.1.2, 2.2.0 > > > When a task is calling spill() but it receives a killing request from > driver (e.g., speculative task), the TaskMemoryManager will throw an OOM > exception. > Then the executor takes it as UncaughtException, which will be handled by > SparkUncaughtExceptionHandler and the executor will consequently be shutdown. > However, this error may lead to the whole application failure due to the > "max number of executor failures (30) reached". > In our production environment, we have encountered a lot of such cases. > \\ > {noformat} > 17/04/05 06:41:27 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort > data of 928.0 MB to disk (1 time so far) > 17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Spill > file:/data/usercache/application_1482394966158_87487271/blockmgr-85c25fa8-06b4/32/temp_local_b731 > 17/04/05 06:41:27 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152 > 17/04/05 06:41:30 INFO executor.Executor: Executor is trying to kill task > 16.0 in stage 3.0 (TID 857) > 17/04/05 06:41:30 ERROR memory.TaskMemoryManager: error while calling spill() > on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) > at > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228) > at > org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:302) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:346) > ................................................................ > 17/04/05 06:41:30 INFO sort.UnsafeExternalSorter: Thread 115 spilling sort > data of 928.0 MB to disk (2 times so far) > 17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Spill > file:/data/usercache/appcache/application_1482394966158_87487271/blockmgr-573312a3-bd46-4c5c-9293-1021cc34c77 > 17/04/05 06:41:30 INFO sort.UnsafeSorterSpillWriter: Write numRecords:2097152 > 17/04/05 06:41:31 ERROR memory.TaskMemoryManager: error while calling spill() > on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed > java.nio.channels.ClosedByInterruptException > at > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202) > at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:269) > at > org.apache.spark.storage.DiskBlockObjectWriter.updateBytesWritten(DiskBlockObjectWriter.scala:228) > at > org.apache.spark.storage.DiskBlockObjectWriter.recordWritten(DiskBlockObjectWriter.scala:207) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.write(UnsafeSorterSpillWriter.java:139) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:196) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:170) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) > > ......................................................................... > 17/04/05 06:41:31 WARN memory.TaskMemoryManager: leak 32.0 KB memory from > org.apache.spark.shuffle.sort.ShuffleExternalSorter@513661a6 > 17/04/05 06:41:31 ERROR executor.Executor: Managed memory leak detected; size > = 26010016 bytes, TID = 857 > 17/04/05 06:41:31 ERROR executor.Executor: Exception in task 16.0 in stage > 3.0 (TID 857) > java.lang.OutOfMemoryError: error while calling spill() on > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : > null > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) > ...................................................... > 17/04/05 06:41:31 ERROR util.SparkUncaughtExceptionHandler: Uncaught > exception in thread Thread[Executor task launch worker-0,5,main] > java.lang.OutOfMemoryError: error while calling spill() on > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@43a122ed : > null > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:178) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:244) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:83) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:164) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:358) > .......................................... > 17/04/05 06:41:31 INFO storage.DiskBlockManager: Shutdown hook called > 17/04/05 06:41:31 INFO util.ShutdownHookManager: Shutdown hook called > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org