Dear all,

We encountered problems of failed jobs with huge amount of data.

A simple local test was prepared for this question at
https://gist.github.com/copy-of-rezo/6a137e13a1e4f841e7eb
It generates 2 sets of key-value pairs, join them, selects distinct values
and counts data finally.

object Spill {
  def generate = {
    for{
      j <- 1 to 10
      i <- 1 to 200
    } yield(j, i)
  }
 
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName(getClass.getSimpleName)
    conf.set("spark.shuffle.spill", "true")
    conf.set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(conf)
    println(generate)
 
    val dataA = sc.parallelize(generate)
    val dataB = sc.parallelize(generate)
    val dst = dataA.join(dataB).distinct().count()
    println(dst)
  }
}

We compiled it locally and run 3 times with different settings of memory:
1) *--executor-memory 10M --driver-memory 10M --num-executors 1
--executor-cores 1*
It fails wtih "java.lang.OutOfMemoryError: GC overhead limit exceeded" at 
.....
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)

2) *--executor-memory 20M --driver-memory 20M --num-executors 1
--executor-cores 1*
It works OK

3)  *--executor-memory 10M --driver-memory 10M --num-executors 1
--executor-cores 1* But let's make less data for i from 200 to 100. It
reduces input data in 2 times and joined data in 4 times

  def generate = {
    for{
      j <- 1 to 10
      i <- 1 to 100   // previous value was 200 
    } yield(j, i)
  }
This code works OK. 

We don't understand why 10M is not enough for such simple operation with
32000 bytes of ints (2 * 10 * 200 * 2 * 4) approximately? 10M of RAM works
if we change the data volume in 2 times (2000 of records of (int, int)).  
Why spilling to disk doesn't cover this case? 





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Simple-local-test-failed-depending-on-memory-settings-tp19473.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to