ChenMichael commented on pull request #34684:
URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896


   I'm not sure if this is the best way to solve this bug so I will detail the 
other solution I could come up with and then compare the possible problems with 
them.
   
   1. Materialize cached rdd immediately (.count)
       - `buildBuffers` becomes a blocking call. Moves cost of materialization 
from execution time to planning.
       - Maybe there is some code that assumes the cached rdd is not 
materialized until execution.
       - If someone obtained the cached rdd, but never executes it, then with 
the new changes there is wasted effort materializing the rdd. From the code, it 
seems like obtaining the rdd is always followed up by submitting the job to 
DAGScheduler so I don't know why this would happen.
   
   2. Never use accumulator stats for InMemoryRelation with AQE on
       - These accumulator stats should be more accurate than the estimated 
stats, so there can be missed opportunities for optimization


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to