ChenMichael commented on pull request #34684: URL: https://github.com/apache/spark/pull/34684#issuecomment-975845896
I'm not sure if this is the best way to solve this bug so I will detail the other solution I could come up with and then compare the possible problems with them. 1. Materialize cached rdd immediately (.count) - `buildBuffers` becomes a blocking call. Moves cost of materialization from execution time to planning. - Maybe there is some code that assumes the cached rdd is not materialized until execution. - If someone obtained the cached rdd, but never executes it, then with the new changes there is wasted effort materializing the rdd. From the code, it seems like obtaining the rdd is always followed up by submitting the job to DAGScheduler so I don't know why this would happen. 2. Never use accumulator stats for InMemoryRelation with AQE on - These accumulator stats should be more accurate than the estimated stats, so there can be missed opportunities for optimization -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org