[ https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152674#comment-17152674 ]
George George commented on SPARK-31635: --------------------------------------- Hello [~Chen Zhang], Thanks a lot for getting back on this. I would agree with you that it is an improvement. However, I thought because it failed when using dataframe api and there is no documentation on it, that it is a bug. Your suggestion sounds really good to me and I think it's good to give the user the opportunity to configure this. Basically, then the user can decide if he waits a little more on the result or put more pressure on the driver. I could also try to submit a PR, but I guess I would need a more time on it. Just let me know if you would rather wait for my pr or do it yourself. Best, George > Spark SQL Sort fails when sorting big data points > ------------------------------------------------- > > Key: SPARK-31635 > URL: https://issues.apache.org/jira/browse/SPARK-31635 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.3.2 > Reporter: George George > Priority: Major > > Please have a look at the example below: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100) > test.toDF().as[Nested].sort("a").take(1) > {code} > *Sorting* big data objects using *Spark Dataframe* is failing with following > exception: > {code:java} > 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized > results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize > (100.0 MB) > [Stage 0:======> (12 + 3) / > 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total > size of serialized results of 13 tasks (100.1 MB) is bigger than > spark.driver.maxResu > {code} > However using the *RDD API* is working and no exception is thrown: > {code:java} > case class Point(x:Double, y:Double) > case class Nested(a: Long, b: Seq[Point]) > val test = spark.sparkContext.parallelize((1L to 100L).map(a => > Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100) > test.sortBy(_.a).take(1) > {code} > For both code snippets we started the spark shell with exactly the same > arguments: > {code:java} > spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB" > {code} > Even if we increase the spark.driver.maxResultSize, the executors still get > killed for our use case. The interesting thing is that when using the RDD API > directly the problem is not there. *Looks like there is a bug in dataframe > sort because is shuffling too much data to the driver?* > Note: this is a small example and I reduced the spark.driver.maxResultSize to > a smaller size, but in our application I've tried setting it to 8GB but as > mentioned above the job was killed. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org