Hi, I'm regularly hitting "Unable to acquire memory" problems only when trying to use overflow pages when running the full set of Spark tests across different platforms. The machines I'm using all have well over 10 GB of RAM and I'm running without any changes to the pom.xml file. Standard 3 GB Java heap specified.
I'm working off this revision: commit 43e0135421b2262cbb0e06aae53523f663b4f959 Author: Yin Huai <yh...@databricks.com> Date: Thu Aug 20 15:30:31 2015 +0800 [SPARK-10092] [SQL] Multi-DB support follow up. https://issues.apache.org/jira/browse/SPARK-10092 This pr is a follow-up one for Multi-DB support. It has the following changes: * `HiveContext.refreshTable` now accepts `dbName.tableName`. I've added prints in a variety of places, when we run just the one suite we don't hit the problem - but with the whole batch of tests, we do. Example below, note that it's always in the join31 test. cat CheckHashJoinFullBatch.txt | grep -C 10 "join31" - auto_join30 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 - auto_join31 - auto_join32 - auto_join4 - auto_join5 - auto_join6 - auto_join7 - auto_join8 - auto_join9 04:53:44.685 WARN org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite: Simplifications made on unsupported operations for test auto_join_filters - auto_join_filters - auto_join_nulls -- 05:08:18.329 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 43.0 in stage 2993.0 (TID 130982, localhost): TaskKilled (killed intentionally) 05:08:18.330 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 40.0 in stage 2993.0 (TID 130979, localhost): TaskKilled (killed intentionally) 05:08:18.340 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 15.0 in stage 2993.0 (TID 130954, localhost): TaskKilled (killed intentionally) 05:08:18.341 ERROR org.apache.spark.executor.Executor: Managed memory leak detected; size = 12582912 bytes, TID = 130985 05:08:18.341 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 46.0 in stage 2993.0 (TID 130985, localhost): TaskKilled (killed intentionally) 05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 41.0 in stage 2993.0 (TID 130980, localhost): TaskKilled (killed intentionally) 05:08:18.343 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 26.0 in stage 2993.0 (TID 130965, localhost): TaskKilled (killed intentionally) 05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 4.0 in stage 2993.0 (TID 130943, localhost): TaskKilled (killed intentionally) 05:08:18.345 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 11.0 in stage 2993.0 (TID 130950, localhost): TaskKilled (killed intentionally) 05:08:18.349 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 28.0 in stage 2993.0 (TID 130967, localhost): TaskKilled (killed intentionally) - join31 *** FAILED *** Failed to execute query using catalyst: Error: Job aborted due to stage failure: Task 42 in stage 2993.0 failed 1 times, most recent failure: Lost task 42.0 in stage 2993.0 (TID 130981, localhost): java.io.IOException: Unable to acquire 4194304 bytes of memory at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:371) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:350) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:489) at org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:138) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:477) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:610) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) I run the test on its own with: mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -DwildcardSuites=org.apache.spark.sql.hive.execution.HashJoinCompatibilitySuite -fn test > CheckHashJoin.txt 2>&1 I run the whole batch with mvn -Pyarn -Phadoop-2.2 -Phive -Phive-thriftserver -fn test > CheckHashJoinFullBatch.txt 2>&1 java version "1.7.0_65" OpenJDK Runtime Environment (rhel-2.5.1.2.el7_0-x86_64 u65-b17) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) Note the below is also the case when we have plenty of memory reported as free (10 GB+) free -m total used free shared buffers cached Mem: 11855 11389 466 668 0 3305 -/+ buffers/cache: 8084 3771 Swap: 6023 83 5940 Potentially useful debug info when it passes: Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 ˆ[[32m- join31ˆ[[0m When it fails my printout for if (useOverflowPage) is set to true. The output features: creating with existing in memory sorter, pageSizeBytes: 4194304 Creating unsafe external sorter, pageSizeBytes: 4194304 determined total space required is: 24 *decided to use overflow page* *Required space (24) is less than free space in current page (0)* acquiring 4194304 from shuffle memory manager memoryAcquired is: 1433178 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 07:41:01.442 ERROR org.apache.spark.executor.Executor: Managed memory leak detected; size = 8388608 bytes, TID = 230633 Creating unsafe external sorter, pageSizeBytes: 4194304 acquiring 4194304 from shuffle memory manager memoryAcquired is: 4194304 creating with existing in memory sorter, pageSizeBytes: 4194304 07:41:01.442 ERROR org.apache.spark.executor.Executor: Exception in task 4.0 in stage 6400.0 (TID 230633) java.io.IOException: Unable to acquire 4194304 bytes of memory Note that I was hitting the unable to acquire memory problems with the default pageSize and this was addressed by the helpful post here <https://mail-archives.apache.org/mod_mbox/spark-user/201508.mbox/%3CCA+LY3qkm2fH_ioMN6a-f+YvFEhavskZR73wbKZaZ=wvf9+o...@mail.gmail.com%3E> Perhaps I need to set another option or change the value? My top level pom.xml features <spark.buffer.pageSize>4m</spark.buffer.pageSize> for both Java and Scala tests. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Memory-allocation-error-with-Spark-1-5-HashJoinCompatibilitySuite-tp24416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org