[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM
[ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195967#comment-15195967 ] Michael Armbrust commented on SPARK-12546: -- There is no partitioning in that example so it is not the same problem. Also, stacktrace is in zeplin code, not spark. > Writing to partitioned parquet table can fail with OOM > -- > > Key: SPARK-12546 > URL: https://issues.apache.org/jira/browse/SPARK-12546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Nong Li >Assignee: Michael Armbrust >Priority: Blocker > Labels: releasenotes > Fix For: 1.6.1, 2.0.0 > > > It is possible to have jobs fail with OOM when writing to a partitioned > parquet table. While this was probably always possible, it is more likely in > 1.6 due to the memory manager changes. The unified memory manager enables > Spark to use more of the process memory (in particular, for execution) which > gets us in this state more often. This issue can happen for libraries that > consume a lot of memory, such as parquet. Prior to 1.6, these libraries would > more likely use memory that spark was not using (i.e. from the storage pool). > In 1.6, this storage memory can now be used for execution. > There are a couple of configs that can help with this issue. > - parquet.memory.pool.ratio: This is a parquet config on how much of the > heap the parquet writers should use. This default to .95. Consider a much > lower value (e.g. 0.1) > - spark.memory.faction: This is a spark config to control how much of the > memory should be allocated to spark. Consider setting this to 0.6. > This should cause jobs to potentially spill more but require less memory. > More aggressive tuning will control this trade off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM
[ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15195157#comment-15195157 ] Jakub Liska commented on SPARK-12546: - Hey, I migrated to 1.6.0, is it possible that it somehow relates to this really weird problem? Because there are plenty of resources and the sample data is really really tiny : {code} val coreRdd = sc.textFile("s3n://gwiq-views-p/external/core/tsv/*.tsv").map(_.split("\t")).map( fields => Row(fields:_*) ) val coreDataFrame = sqlContext.createDataFrame(coreRdd, schema) coreDataFrame.registerTempTable("core") coreDataFrame.persist(StorageLevel.DISK_ONLY) {code} {code} SELECT COUNT(*) FROM core {code} {code} -- Create new SparkContext spark://master:7077 --- Exception in thread "pool-1-thread-5" java.lang.OutOfMemoryError: GC overhead limit exceeded at com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:66) at com.google.gson.internal.bind.ObjectTypeAdapter.read(ObjectTypeAdapter.java:69) at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.read(TypeAdapterRuntimeTypeWrapper.java:40) at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:188) at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.read(MapTypeAdapterFactory.java:146) at com.google.gson.Gson.fromJson(Gson.java:791) at com.google.gson.Gson.fromJson(Gson.java:757) at com.google.gson.Gson.fromJson(Gson.java:706) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.convert(RemoteInterpreterServer.java:417) at org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer.getProgress(RemoteInterpreterServer.java:384) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1376) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Processor$getProgress.getResult(RemoteInterpreterService.java:1361) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} > Writing to partitioned parquet table can fail with OOM > -- > > Key: SPARK-12546 > URL: https://issues.apache.org/jira/browse/SPARK-12546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Nong Li >Assignee: Michael Armbrust >Priority: Blocker > Labels: releasenotes > Fix For: 1.6.1, 2.0.0 > > > It is possible to have jobs fail with OOM when writing to a partitioned > parquet table. While this was probably always possible, it is more likely in > 1.6 due to the memory manager changes. The unified memory manager enables > Spark to use more of the process memory (in particular, for execution) which > gets us in this state more often. This issue can happen for libraries that > consume a lot of memory, such as parquet. Prior to 1.6, these libraries would > more likely use memory that spark was not using (i.e. from the storage pool). > In 1.6, this storage memory can now be used for execution. > There are a couple of configs that can help with this issue. > - parquet.memory.pool.ratio: This is a parquet config on how much of the > heap the parquet writers should use. This default to .95. Consider a much > lower value (e.g. 0.1) > - spark.memory.faction: This is a spark config to control how much of the > memory should be allocated to spark. Consider setting this to 0.6. > This should cause jobs to potentially spill more but require less memory. > More aggressive tuning will control this trade off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM
[ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157605#comment-15157605 ] Apache Spark commented on SPARK-12546: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/11308 > Writing to partitioned parquet table can fail with OOM > -- > > Key: SPARK-12546 > URL: https://issues.apache.org/jira/browse/SPARK-12546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Nong Li >Assignee: Michael Armbrust >Priority: Blocker > > It is possible to have jobs fail with OOM when writing to a partitioned > parquet table. While this was probably always possible, it is more likely in > 1.6 due to the memory manager changes. The unified memory manager enables > Spark to use more of the process memory (in particular, for execution) which > gets us in this state more often. This issue can happen for libraries that > consume a lot of memory, such as parquet. Prior to 1.6, these libraries would > more likely use memory that spark was not using (i.e. from the storage pool). > In 1.6, this storage memory can now be used for execution. > There are a couple of configs that can help with this issue. > - parquet.memory.pool.ratio: This is a parquet config on how much of the > heap the parquet writers should use. This default to .95. Consider a much > lower value (e.g. 0.1) > - spark.memory.faction: This is a spark config to control how much of the > memory should be allocated to spark. Consider setting this to 0.6. > This should cause jobs to potentially spill more but require less memory. > More aggressive tuning will control this trade off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12546) Writing to partitioned parquet table can fail with OOM
[ https://issues.apache.org/jira/browse/SPARK-12546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107258#comment-15107258 ] Nong Li commented on SPARK-12546: - A better workaround might be to figure the max number of concurrent output files to 1. This can be done by setting "spark.sql.sources.maxConcurrentWrites=1" > Writing to partitioned parquet table can fail with OOM > -- > > Key: SPARK-12546 > URL: https://issues.apache.org/jira/browse/SPARK-12546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Nong Li > > It is possible to have jobs fail with OOM when writing to a partitioned > parquet table. While this was probably always possible, it is more likely in > 1.6 due to the memory manager changes. The unified memory manager enables > Spark to use more of the process memory (in particular, for execution) which > gets us in this state more often. This issue can happen for libraries that > consume a lot of memory, such as parquet. Prior to 1.6, these libraries would > more likely use memory that spark was not using (i.e. from the storage pool). > In 1.6, this storage memory can now be used for execution. > There are a couple of configs that can help with this issue. > - parquet.memory.pool.ratio: This is a parquet config on how much of the > heap the parquet writers should use. This default to .95. Consider a much > lower value (e.g. 0.1) > - spark.memory.faction: This is a spark config to control how much of the > memory should be allocated to spark. Consider setting this to 0.6. > This should cause jobs to potentially spill more but require less memory. > More aggressive tuning will control this trade off. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org