[ 
https://issues.apache.org/jira/browse/SPARK-31635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152674#comment-17152674
 ] 

George George commented on SPARK-31635:
---------------------------------------

Hello [~Chen Zhang], 

Thanks a lot for getting back on this. 

I would agree with you that it is an improvement. However, I thought because it 
failed when using dataframe api and there is no documentation on it, that it is 
a bug.

Your suggestion sounds really good to me and I think it's good to give the user 
the opportunity to configure this. Basically, then the user can decide if he 
waits a little more on the result or put more pressure on the driver. 

I could also try to submit a PR, but I guess I would need a more time on it. 
Just let me know if you would rather wait for my pr or do it yourself.

Best,

George

 

> Spark SQL Sort fails when sorting big data points
> -------------------------------------------------
>
>                 Key: SPARK-31635
>                 URL: https://issues.apache.org/jira/browse/SPARK-31635
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.3.2
>            Reporter: George George
>            Priority: Major
>
>  Please have a look at the example below: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
> test.toDF().as[Nested].sort("a").take(1)
> {code}
>  *Sorting* big data objects using *Spark Dataframe* is failing with following 
> exception: 
> {code:java}
> 2020-05-04 08:01:00 ERROR TaskSetManager:70 - Total size of serialized 
> results of 14 tasks (107.8 MB) is bigger than spark.driver.maxResultSize 
> (100.0 MB)
> [Stage 0:======>                                                 (12 + 3) / 
> 100]org.apache.spark.SparkException: Job aborted due to stage failure: Total 
> size of serialized results of 13 tasks (100.1 MB) is bigger than 
> spark.driver.maxResu
> {code}
> However using the *RDD API* is working and no exception is thrown: 
> {code:java}
> case class Point(x:Double, y:Double)
> case class Nested(a: Long, b: Seq[Point])
> val test = spark.sparkContext.parallelize((1L to 100L).map(a => 
> Nested(a,Seq.fill[Point](250000)(Point(1,2)))), 100)
> test.sortBy(_.a).take(1)
> {code}
> For both code snippets we started the spark shell with exactly the same 
> arguments:
> {code:java}
> spark-shell --driver-memory 6G --conf "spark.driver.maxResultSize=100MB"
> {code}
> Even if we increase the spark.driver.maxResultSize, the executors still get 
> killed for our use case. The interesting thing is that when using the RDD API 
> directly the problem is not there. *Looks like there is a bug in dataframe 
> sort because is shuffling too much data to the driver?* 
> Note: this is a small example and I reduced the spark.driver.maxResultSize to 
> a smaller size, but in our application I've tried setting it to 8GB but as 
> mentioned above the job was killed. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to