[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

Hans van den Bogert (JIRA) Tue, 06 Oct 2015 06:46:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944925#comment-14944925
 ]


Hans van den Bogert edited comment on SPARK-10474 at 10/6/15 1:46 PM:
----------------------------------------------------------------------

One more debug println for the calculated cores (in contrast to numCores):
https://gist.github.com/hansbogert/cc2baf3995d4e37270a2

Relevant output (output is the same for fine-grained as well as coarse-grained 
mesos):
{noformat}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 10:25:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
cores:12
1048576
15/10/06 10:25:05 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
...
{noformat}

The calculated 'cores' is 12, which the amount of cores of the local driver 
node, however the total mesos cluster has more than 40 cores. Either way, there 
is no difference between fine-grained and coarse grained mode, for this method 
at least.

/update 
I should've read the logs on the mesos slaves as well, indeed a discrepancy 
between fine-grained mode and coarse grained mode:
In fine-grained mode:
{noformat}
head 
/local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0009/executors/20151006-105432-84120842-5050-17066-S3/runs/latest/stdout
2numCores:1
cores:1
67108864
{noformat}


And in coarse grained:
{noformat}
head 
/local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0010/executors/3/runs/latest/stdout
Registered executor on node326.ib.cluster
Starting task 3
sh -c ' "/var/scratch/vdbogert/src/spark-1.5.1/bin/spark-class" 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@10.141.3.254:56069/user/CoarseGrainedScheduler 
--executor-id 20151006-105432-84120842-5050-17066-S3 --hostname 
node326.ib.cluster --cores 8 --app-id 20151006-105432-84120842-5050-17066-0010'
Forked command at 4378
numCores:8
cores:8
16777216
{noformat}

This is probably a different bug specific to Mesos fine-grained mode. My 
current workaround is setting the `spark.buffer.pageSize` to the value of 16M 
which otherwise would also have been used automatically in the coarse-grained 
mode.

/update2
Even allocating only 16MB just like in coarse-grained mode (and going even 
lower to 8MB), and I'm *still* seeing this popping up. So 


was (Author: hbogert):
One more debug println for the calculated cores (in contrast to numCores):
https://gist.github.com/hansbogert/cc2baf3995d4e37270a2

Relevant output (output is the same for fine-grained as well as coarse-grained 
mesos):
{noformat}
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/10/06 10:25:04 WARN SparkConf: In Spark 1.0 and later spark.local.dir will 
be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
mesos/standalone and LOCAL_DIRS in YARN).
numCores:0
cores:12
1048576
15/10/06 10:25:05 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
...
{noformat}

The calculated 'cores' is 12, which the amount of cores of the local driver 
node, however the total mesos cluster has more than 40 cores. Either way, there 
is no difference between fine-grained and coarse grained mode, for this method 
at least.

/update 
I should've read the logs on the mesos slaves as well, indeed a discrepancy 
between fine-grained mode and coarse grained mode:
In fine-grained mode:
{noformat}
head 
/local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0009/executors/20151006-105432-84120842-5050-17066-S3/runs/latest/stdout
2numCores:1
cores:1
67108864
{noformat}


And in coarse grained:
{noformat}
head 
/local/vdbogert/var/lib/mesos/slaves/20151006-105432-84120842-5050-17066-S3/frameworks/20151006-105432-84120842-5050-17066-0010/executors/3/runs/latest/stdout
Registered executor on node326.ib.cluster
Starting task 3
sh -c ' "/var/scratch/vdbogert/src/spark-1.5.1/bin/spark-class" 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
akka.tcp://sparkDriver@10.141.3.254:56069/user/CoarseGrainedScheduler 
--executor-id 20151006-105432-84120842-5050-17066-S3 --hostname 
node326.ib.cluster --cores 8 --app-id 20151006-105432-84120842-5050-17066-0010'
Forked command at 4378
numCores:8
cores:8
16777216
{noformat}

This is probably a different bug specific to Mesos fine-grained mode. My 
current workaround is setting the `spark.buffer.pageSize` to the value of 16M 
which otherwise would also have been used automatically in the coarse-grained 
mode.

> TungstenAggregation cannot acquire memory for pointer array after switching 
> to sort-based
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-10474
>                 URL: https://issues.apache.org/jira/browse/SPARK-10474
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Yi Zhou
>            Assignee: Andrew Or
>            Priority: Blocker
>             Fix For: 1.5.1, 1.6.0
>
>
> In aggregation case, a  Lost task happened with below error.
> {code}
>  java.io.IOException: Could not acquire 65536 bytes of memory
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
>         at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
>         at 
> org.apache.spark.sql.execution.UnsafeKVExternalSorter.<init>(UnsafeKVExternalSorter.java:126)
>         at 
> org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
>         at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
>         at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
>         at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>         at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>         at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>         at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>         at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>         at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>         at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         at org.apache.spark.scheduler.Task.run(Task.scala:88)
>         at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Key SQL Query
> {code:sql}
> INSERT INTO TABLE test_table
> SELECT
>   ss.ss_customer_sk AS cid,
>   count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
>   count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
>   count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
>   count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
>   count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
>   count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
>   count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
>   count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
>   count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
>   count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
>   count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
>   count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
>   count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
>   count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
>   count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
> FROM store_sales ss
> INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
> WHERE i.i_category IN ('Books')
> AND ss.ss_customer_sk IS NOT NULL
> GROUP BY ss.ss_customer_sk
> HAVING count(ss.ss_item_sk) > 5
> {code}
> Note:
> the store_sales is a big fact table and item is a small dimension table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10474) TungstenAggregation cannot acquire memory for pointer array after switching to sort-based

Reply via email to