subject:"\[jira\] \[Updated\] \(SPARK\-15796\) Spark 1.6 default memory settings can cause heavy GC when caching"

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-15796:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's cache. The real upper limit is 
> spark.memory.fraction.
> To sum up my questions/issues:
> * At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
> Maybe the old generation size should also be mentioned in configuration.html 
> near spark.memory.fraction.
> * Is it a goal for Spark to support heavy caching with default parameters and 
> without GC breakdown? If so, then better default values are needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-06 Thread Gabor Feher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-15796:

Description: 
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

  
val cacheFiller = sc.parallelize(1 to 5, 1000)  
  
  .mapPartitionsWithIndex {
case (ix, it) =>
  println(s"CREATE DATA PARTITION ${ix}")   
  
  val r = new scala.util.Random(ix)
  it.map(x => (r.nextLong, r.nextLong))
  }
cacheFiller.persist(StorageLevel.MEMORY_ONLY)
cacheFiller.foreach(identity)

val finishTime = System.currentTimeMillis()
val elapsedTime = (finishTime - startTime) / 1000
println(s"TIME= $elapsedTime s")
  }

}
{code}

If I call it the following way, it completes in around 5 minutes on my Laptop, 
while often stopping for slow Full GC cycles. I can also see with jvisualvm 
(Visual GC plugin) that the old generation of JVM is 96.8% filled.
{code}
sbt package
~/spark-1.6.0/bin/spark-submit \
  --class "CacheDemoApp" \
  --master "local[2]" \
  --driver-memory 3g \
  --driver-java-options "-XX:+PrintGCDetails" \
  target/scala-2.10/simple-project_2.10-1.0.jar
{code}
If I add any one of the below flags, then the run-time drops to around 40-50 
seconds and the difference is coming from the drop in GC times:
  --conf "spark.memory.fraction=0.6"
OR
  --conf "spark.memory.useLegacyMode=true"
OR
  --driver-java-options "-XX:NewRatio=3"

All the other cache types except for DISK_ONLY produce similar symptoms. It 
looks like that the problem is that the amount of data Spark wants to store 
long-term ends up being larger than the old generation size in the JVM and this 
triggers Full GC repeatedly.

I did some research:
* In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
defaults to 0.75.
* In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache size. 
It defaults to 0.6 and...
* http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
shouldn't be bigger than the size of the old generation.
* On the other hand, OpenJDK's default NewRatio is 2, which means an old 
generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
advice.

http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
generation is running close to full, then setting spark.memory.storageFraction 
to a lower value should help. I have tried with 
spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html explains 
that storageFraction is not an upper-limit but a lower limit-like thing on the 
size of Spark's cache. The real upper limit is spark.memory.fraction.


To sum up my questions/issues:
* At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
Maybe the old generation size should also be mentioned in configuration.html 
near spark.memory.fraction.
* Is it a goal for Spark to support heavy caching with default parameters and 
without GC breakdown? If so, then better default values are needed.


  was:
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-06 Thread Gabor Feher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-15796:

Description: 
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

  
val cacheFiller = sc.parallelize(1 to 5, 1000)  
  
  .mapPartitionsWithIndex {
case (ix, it) =>
  println(s"CREATE DATA PARTITION ${ix}")   
  
  val r = new scala.util.Random(ix)
  it.map(x => (r.nextLong, r.nextLong))
  }
cacheFiller.persist(StorageLevel.MEMORY_ONLY)
cacheFiller.foreach(identity)

val finishTime = System.currentTimeMillis()
val elapsedTime = (finishTime - startTime) / 1000
println(s"TIME= $elapsedTime s")
  }

}
{code}

If I call it the following way, it completes in around 5 minutes on my Laptop, 
while often stopping for slow Full GC cycles. I can also see with jvisualvm 
(Visual GC plugin) that the old generation of JVM is 96.8% filled.
{code}
sbt package
~/spark-1.6.0/bin/spark-submit \
  --class "CacheDemoApp" \
  --master "local[2]" \
  --driver-memory 3g \
  --driver-java-options "-XX:+PrintGCDetails" \
  target/scala-2.10/simple-project_2.10-1.0.jar
{code}
If I add any one of the below flags, then the run-time drops to around 40-50 
seconds and the difference is coming from the drop in GC times:
  --conf "spark.memory.fraction=0.6"
OR
  --conf "spark.memory.useLegacyMode=true"
OR
  --driver-java-options "-XX:NewRatio=3"

All the other cache types except for DISK_ONLY produce similar symptoms. It 
looks like that the problem is that the amount of data Spark wants to store 
long-term ends up being larger than the old generation size in the JVM and this 
triggers Full GC repeatedly.

I did some research:
* In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
defaults to 0.75.
* In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache size. 
It defaults to 0.6 and...
* http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
shouldn't be bigger than the size of the old generation.
* On the other hand, OpenJDK's default NewRatio is 2, which means an old 
generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
advice.

http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
generation is running close to full, then
setting spark.memory.storageFraction to a lower value shou;d help. I have tried 
with spark.memory.storageFraction=0.1,
but it still doesn't fix the issue. This is not a surprise: 
http://spark.apache.org/docs/1.6.1/configuration.html
explains that storageFraction is not an upper-limit but a lower limit-like 
thing on the size of Spark's cache. The real upper limit is 
spark.memory.fraction.


To sum up my questions/issues:
* At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
Maybe the old generation size should also be mentioned in configuration.html 
near spark.memory.fraction.
* Is it a goal for Spark to support heavy caching with default parameters and 
without GC breakdown? If so, then better default values are needed.


  was:
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-06 Thread Gabor Feher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-15796:

Description: 
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

  
val cacheFiller = sc.parallelize(1 to 5, 1000)  
  
  .mapPartitionsWithIndex {
case (ix, it) =>
  println(s"CREATE DATA PARTITION ${ix}")   
  
  val r = new scala.util.Random(ix)
  it.map(x => (r.nextLong, r.nextLong))
  }
cacheFiller.persist(StorageLevel.MEMORY_ONLY)
cacheFiller.foreach(identity)

val finishTime = System.currentTimeMillis()
val elapsedTime = (finishTime - startTime) / 1000
println(s"TIME= $elapsedTime s")
  }

}
{code}

If I call it the following way, it completes in around 5 minutes on my Laptop, 
while often stopping for slow Full GC cycles. I can also see with jvisualvm 
(Visual GC plugin) that the old generation of JVM is 96.8% filled.
{code}
sbt package
~/spark-1.6.0/bin/spark-submit \
  --class "CacheDemoApp" \
  --master "local[2]" \
  --driver-memory 3g \
  --driver-java-options "-XX:+PrintGCDetails" \
  target/scala-2.10/simple-project_2.10-1.0.jar
{code}
If I add any one of the below flags, then the run-time drops to around 40-50 
seconds and the
difference is coming from the drop in GC times:
  --conf "spark.memory.fraction=0.6"
OR
  --conf "spark.memory.useLegacyMode=true"
OR
  --driver-java-options "-XX:NewRatio=3"

All the other cache types except for DISK_ONLY produce similar symptoms. It 
looks like that the problem is that the amount of data Spark wants to store 
long-term ends up being larger than the old generation size in the JVM and this 
triggers Full GC repeatedly.

I did some research:
* In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
defaults to 0.75.
* In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache size. 
It defaults to 0.6 and...
* http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
shouldn't be bigger than the size of the old generation.
* On the other hand, OpenJDK's default NewRatio is 2, which means an old 
generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
advice.

http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
generation is running close to full, then
setting spark.memory.storageFraction to a lower value shou;d help. I have tried 
with spark.memory.storageFraction=0.1,
but it still doesn't fix the issue. This is not a surprise: 
http://spark.apache.org/docs/1.6.1/configuration.html
explains that storageFraction is not an upper-limit but a lower limit-like 
thing on the size of Spark's cache. The real upper limit is 
spark.memory.fraction.


To sum up my questions/issues:
* At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
Maybe the old generation size should also be mentioned in configuration.html 
near spark.memory.fraction.
* Is it a goal for Spark to support heavy caching with default parameters and 
without GC breakdown? If so, then better default values are needed.


  was:
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-06 Thread Gabor Feher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-15796:

Description: 
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

  
val cacheFiller = sc.parallelize(1 to 5, 1000)  
  
  .mapPartitionsWithIndex {
case (ix, it) =>
  println(s"CREATE DATA PARTITION ${ix}")   
  
  val r = new scala.util.Random(ix)
  it.map(x => (r.nextLong, r.nextLong))
  }
cacheFiller.persist(StorageLevel.MEMORY_ONLY)
cacheFiller.foreach(identity)

val finishTime = System.currentTimeMillis()
val elapsedTime = (finishTime - startTime) / 1000
println(s"TIME= $elapsedTime s")
  }

}
{code}

If I call it the following way, it completes in around 5 minutes on my Laptop, 
while often stopping for slow Full GC cycles. I can also see with jvisualvm 
(Visual GC plugin) that the old generation of JVM is 96.8% filled.
{code}
sbt package
~/spark-1.6.0/bin/spark-submit \
  --class "CacheDemoApp" \
  --master "local[2]" \
  --driver-memory 3g \
  --driver-java-options "-XX:+PrintGCDetails" \
  target/scala-2.10/simple-project_2.10-1.0.jar
{code}
If I add any one of the below flags, then the run-time drops to around 40-50 
seconds and the
difference is coming from the drop in GC times:
  --conf "spark.memory.fraction=0.6"
OR
  --conf "spark.memory.useLegacyMode=true"
OR
  --driver-java-options "-XX:NewRatio=3"

All the other cache types except for DISK_ONLY produce similar symptoms.
It looks like that the problem is that the amount of data Spark wants to store 
long-term ends up being larger than
the old generation size in the JVM and this triggers Full GC repeatedly.

I did some research:
* In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
defaults to 0.75.
* In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache size. 
It defaults to 0.6 and...
* http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
shouldn't be bigger than the size of the old generation.
* On the other hand, OpenJDK's default NewRatio is 2, which means an old 
generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
advice.

http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
generation is running close to full, then
setting spark.memory.storageFraction to a lower value shou;d help. I have tried 
with spark.memory.storageFraction=0.1,
but it still doesn't fix the issue. This is not a surprise: 
http://spark.apache.org/docs/1.6.1/configuration.html
explains that storageFraction is not an upper-limit but a lower limit-like 
thing on the size of Spark's cache. The real upper limit is 
spark.memory.fraction.


To sum up my questions/issues:
* At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
Maybe the old generation size should also be mentioned in configuration.html 
near spark.memory.fraction.
* Is it a goal for Spark to support heavy caching with default parameters and 
without GC breakdown? If so, then better default values are needed.


  was:
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-06 Thread Gabor Feher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-15796:

Description: 
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

  
val cacheFiller = sc.parallelize(1 to 5, 1000)  
  
  .mapPartitionsWithIndex {
case (ix, it) =>
  println(s"CREATE DATA PARTITION ${ix}")   
  
  val r = new scala.util.Random(ix)
  it.map(x => (r.nextLong, r.nextLong))
  }
cacheFiller.persist(StorageLevel.MEMORY_ONLY)
cacheFiller.foreach(identity)

val finishTime = System.currentTimeMillis()
val elapsedTime = (finishTime - startTime) / 1000
println(s"TIME= $elapsedTime s")
  }

}
{code}

If I call it the following way, it completes in around 5 minutes on my Laptop, 
while often stopping for slow Full GC cycles. I can also see with jvisualvm 
[Visual GC plugin] that the old generation of JVM is 96.8% filled.
{code}
sbt package
~/spark-1.6.0/bin/spark-submit \
  --class "CacheDemoApp" \
  --master "local[2]" \
  --driver-memory 3g \
  --driver-java-options "-XX:+PrintGCDetails" \
  target/scala-2.10/simple-project_2.10-1.0.jar
{code}
If I add any one of the below flags, then the run-time drops to around 40-50 
seconds and the
difference is coming from the drop in GC times:
  --conf "spark.memory.fraction=0.6"
OR
  --conf "spark.memory.useLegacyMode=true"
OR
  --driver-java-options "-XX:NewRatio=3"

All the other cache types except for DISK_ONLY produce similar symptoms.
It looks like that the problem is that the amount of data Spark wants to store 
long-term ends up being larger than
the old generation size in the JVM and this triggers Full GC repeatedly.

I did some research:
In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
defaults to 0.75.
In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache size. It 
defaults to 0.6 and the http://spark.apache.org/docs/1.5.2/configuration.html 
even says that it shouldn't be bigger
than the size of the old generation.
On the other hand, OpenJDK's default NewRatio is 2, which means an old 
generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
advice.

http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
generation is running close to full, then
setting spark.memory.storageFraction to a lower value shou;d help. I have tried 
with spark.memory.storageFraction=0.1,
but it still doesn't fix the issue. This is not a surprise: 
http://spark.apache.org/docs/1.6.1/configuration.html
explains that storageFraction is not an upper-limit but a lower limit-like 
thing on the size of Spark's cache. The real upper limit is 
spark.memory.fraction.


To sum up my questions/issues:
* At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
Maybe the old generation size should also be mentioned in configuration.html 
near spark.memory.fraction.
* Is it a goal for Spark to support heavy caching with default parameters and 
without GC breakdown? If so, then better default values are needed.


  was:
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-06 Thread Gabor Feher (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Feher updated SPARK-15796:

Description: 
While debugging performance issues in a Spark program, I've found a simple way 
to slow down Spark 1.6 significantly by filling the RDD memory cache. This 
seems to be a regression, because setting "spark.memory.useLegacyMode=true" 
fixes the problem. Here is a repro that is just a simple program that fills the 
memory cache of Spark using a MEMORY_ONLY cached RDD (but of course this comes 
up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

  
val cacheFiller = sc.parallelize(1 to 5, 1000)  
  
  .mapPartitionsWithIndex {
case (ix, it) =>
  println(s"CREATE DATA PARTITION ${ix}")   
  
  val r = new scala.util.Random(ix)
  it.map(x => (r.nextLong, r.nextLong))
  }
cacheFiller.persist(StorageLevel.MEMORY_ONLY)
cacheFiller.foreach(identity)

val finishTime = System.currentTimeMillis()
val elapsedTime = (finishTime - startTime) / 1000
println(s"TIME= $elapsedTime s")
  }

}
{code}

If I call it the following way, it completes in around 5 minutes on my Laptop, 
while
often stopping for slow Full GC cycles. I can also see with jvisualvm [Visual 
GC plugin] that
the old generation of JVM is 96.8% filled.
{code}
sbt package
~/spark-1.6.0/bin/spark-submit \
  --class "CacheDemoApp" \
  --master "local[2]" \
  --driver-memory 3g \
  --driver-java-options "-XX:+PrintGCDetails" \
  target/scala-2.10/simple-project_2.10-1.0.jar
{code}
If I add any one of the below flags, then the run-time drops to around 40-50 
seconds and the
difference is coming from the drop in GC times:
  --conf "spark.memory.fraction=0.6"
OR
  --conf "spark.memory.useLegacyMode=true"
OR
  --driver-java-options "-XX:NewRatio=3"

All the other cache types except for DISK_ONLY produce similar symptoms.
It looks like that the problem is that the amount of data Spark wants to store 
long-term ends up being larger than
the old generation size in the JVM and this triggers Full GC repeatedly.

I did some research:
In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
defaults to 0.75.
In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache size. It 
defaults to 0.6 and the http://spark.apache.org/docs/1.5.2/configuration.html 
even says that it shouldn't be bigger
than the size of the old generation.
On the other hand, OpenJDK's default NewRatio is 2, which means an old 
generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
advice.

http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
generation is running close to full, then
setting spark.memory.storageFraction to a lower value shou;d help. I have tried 
with spark.memory.storageFraction=0.1,
but it still doesn't fix the issue. This is not a surprise: 
http://spark.apache.org/docs/1.6.1/configuration.html
explains that storageFraction is not an upper-limit but a lower limit-like 
thing on the size of Spark's cache. The real upper limit is 
spark.memory.fraction.


To sum up my questions/issues:
* At least http://spark.apache.org/docs/1.6.1/tuning.html should be fixed. 
Maybe the old generation size should also be mentioned in configuration.html 
near spark.memory.fraction.
* Is it a goal for Spark to support heavy caching with default parameters and 
without GC breakdown? If so, then better default values are needed.


  was:
While debugging performance issues in a Spark program, I've found a simple way 
to slow
down Spark 1.6 significantly by filling the RDD memory cache. This seems to be 
a regression, because setting
"spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
just a simple
program that fills the memory cache of Spark using a MEMORY_ONLY cached RDD 
(but of course this comes up in more complex situations, too):

{code}
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel

object CacheDemoApp { 
  def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Cache Demo Application") 
  
val sc = new SparkContext(conf)
val startTime = System.currentTimeMillis()

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

[jira] [Updated] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

7 matches

Site Navigation

Mail list logo

Footer information