[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12837: Fix Version/s: (was: 2.2.1) (was: 2.3.0) 2.2.0 > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12837: --- Target Version/s: 2.0.0 Priority: Critical (was: Major) > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-12837: --- Assignee: Wenchen Fan > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien-Dung LE updated SPARK-12837: - Description: Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver. Here are steps to re-produce the issue. 1. Start spark shell with a spark.driver.maxResultSize setting {code:java} bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m {code} 2. Execute the code {code:java} case class Toto( a: Int, b: Int) val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR {code} The error message is {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB) {code} was: Executing a sql statement with a large number of partitions requires a high memory space for the driver even there are no requests to collect data back to the driver. Here are steps to re-produce the issue. 1. Start spark shell with a spark.driver.maxResultSize setting {code:shell} bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m {code} 2. Execute the code {code:scala} case class Toto( a: Int, b: Int) val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( "toto2" ) // ERROR {code} The error message is {code:scala} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 393 tasks (1025.9 KB) is bigger than spark.driver.maxResultSize (1024.0 KB) {code} > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org