akpatnam25 commented on pull request #33644:
URL: https://github.com/apache/spark/pull/33644#issuecomment-894694073


   for example, below is a snippet of a case that we would want to safeguard 
against: 
   ```
   def createValue() = new Array[Byte](180 * 1024 * 1024)
   def createRdd(n: Int) = sc.parallelize(0 until n, n).map(_ => createValue())
   val rdd = createRdd(13)
   rdd.treeAggregate(createValue())((v1: Array[Byte], v2: Array[Byte]) => v1, 
(v1: Array[Byte], v2: Array[Byte]) => v1, 4)
   ```
   For the above snippet, we are generating synthetic data to maximize the 
partitions being pulled into the driver. This is just to illustrate an example 
of what "bad behavior" might look like. With optimal resources on the driver 
side, this succeeds, but does not succeed when the driver memory is not set 
high enough.  The applications that have encountered this issue have obviously 
dealt with this by increasing driver memory, etc. 
   
   Given that, the overhead for most normal sized cases would just be the extra 
stage. In our above synthetic workload, there was no noticeable difference in 
time to complete despite the number of partitions we are pulling into the last 
stage. This last reduce step would be happening on the driver side anyways, so 
the only real overhead is the shuffle and the scheduling. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to