[ https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944925#comment-14944925 ]
Hans van den Bogert edited comment on SPARK-10474 at 10/6/15 1:46 PM: ---------------------------------------------------------------------- One more debug println for the calculated cores (in contrast to numCores): https://gist.github.com/hansbogert/cc2baf3995d4e37270a2 Relevant output (output is the same for fine-grained as well as coarse-grained mesos): {noformat} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 10:25:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 cores:12 1048576 15/10/06 10:25:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. ... {noformat} The calculated 'cores' is 12, which the amount of cores of the local driver node, however the total mesos cluster has more than 40 cores. Either way, there is no difference between fine-grained and coarse grained mode, for this method at least. /update I should've read the logs on the mesos slaves as well, indeed a discrepancy between fine-grained mode and coarse grained mode: In fine-grained mode: {noformat} head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0009/executors/20151006-105432-84120842-5050-17066-S3/runs/latest/stdout 2numCores:1 cores:1 67108864 {noformat} And in coarse grained: {noformat} head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0010/executors/3/runs/latest/stdout Registered executor on node326.ib.cluster Starting task 3 sh -c ' "/var/scratch/vdbogert/src/spark-1.5.1/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.141.3.254:56069/user/CoarseGrainedScheduler --executor-id 20151006-105432-84120842-5050-17066-S3 --hostname node326.ib.cluster --cores 8 --app-id 20151006-105432-84120842-5050-17066-0010' Forked command at 4378 numCores:8 cores:8 16777216 {noformat} This is probably a different bug specific to Mesos fine-grained mode. My current workaround is setting the `spark.buffer.pageSize` to the value of 16M which otherwise would also have been used automatically in the coarse-grained mode. /update2 Even allocating only 16MB just like in coarse-grained mode (and going even lower to 8MB), and I'm *still* seeing this popping up. So was (Author: hbogert): One more debug println for the calculated cores (in contrast to numCores): https://gist.github.com/hansbogert/cc2baf3995d4e37270a2 Relevant output (output is the same for fine-grained as well as coarse-grained mesos): {noformat} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0) Type in expressions to have them evaluated. Type :help for more information. 15/10/06 10:25:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). numCores:0 cores:12 1048576 15/10/06 10:25:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. ... {noformat} The calculated 'cores' is 12, which the amount of cores of the local driver node, however the total mesos cluster has more than 40 cores. Either way, there is no difference between fine-grained and coarse grained mode, for this method at least. /update I should've read the logs on the mesos slaves as well, indeed a discrepancy between fine-grained mode and coarse grained mode: In fine-grained mode: {noformat} head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0009/executors/20151006-105432-84120842-5050-17066-S3/runs/latest/stdout 2numCores:1 cores:1 67108864 {noformat} And in coarse grained: {noformat} head /local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0010/executors/3/runs/latest/stdout Registered executor on node326.ib.cluster Starting task 3 sh -c ' "/var/scratch/vdbogert/src/spark-1.5.1/bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url akka.tcp://sparkDriver@10.141.3.254:56069/user/CoarseGrainedScheduler --executor-id 20151006-105432-84120842-5050-17066-S3 --hostname node326.ib.cluster --cores 8 --app-id 20151006-105432-84120842-5050-17066-0010' Forked command at 4378 numCores:8 cores:8 16777216 {noformat} This is probably a different bug specific to Mesos fine-grained mode. My current workaround is setting the `spark.buffer.pageSize` to the value of 16M which otherwise would also have been used automatically in the coarse-grained mode. > TungstenAggregation cannot acquire memory for pointer array after switching > to sort-based > ----------------------------------------------------------------------------------------- > > Key: SPARK-10474 > URL: https://issues.apache.org/jira/browse/SPARK-10474 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.5.0 > Reporter: Yi Zhou > Assignee: Andrew Or > Priority: Blocker > Fix For: 1.5.1, 1.6.0 > > > In aggregation case, a Lost task happened with below error. > {code} > java.io.IOException: Could not acquire 65536 bytes of memory > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220) > at > org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:126) > at > org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119) > at > org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > Key SQL Query > {code:sql} > INSERT INTO TABLE test_table > SELECT > ss.ss_customer_sk AS cid, > count(CASE WHEN i.i_class_id=1 THEN 1 ELSE NULL END) AS id1, > count(CASE WHEN i.i_class_id=3 THEN 1 ELSE NULL END) AS id3, > count(CASE WHEN i.i_class_id=5 THEN 1 ELSE NULL END) AS id5, > count(CASE WHEN i.i_class_id=7 THEN 1 ELSE NULL END) AS id7, > count(CASE WHEN i.i_class_id=9 THEN 1 ELSE NULL END) AS id9, > count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11, > count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13, > count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15, > count(CASE WHEN i.i_class_id=2 THEN 1 ELSE NULL END) AS id2, > count(CASE WHEN i.i_class_id=4 THEN 1 ELSE NULL END) AS id4, > count(CASE WHEN i.i_class_id=6 THEN 1 ELSE NULL END) AS id6, > count(CASE WHEN i.i_class_id=8 THEN 1 ELSE NULL END) AS id8, > count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10, > count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14, > count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16 > FROM store_sales ss > INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk > WHERE i.i_category IN ('Books') > AND ss.ss_customer_sk IS NOT NULL > GROUP BY ss.ss_customer_sk > HAVING count(ss.ss_item_sk) > 5 > {code} > Note: > the store_sales is a big fact table and item is a small dimension table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org