Github user squito commented on the issue: https://github.com/apache/spark/pull/21456 Re: normalization -- if I understand correctly, its not that you know that the normalization definitely *does* change the strings for the heap dump you have. Its just to make sure that your change is effective even if normalization were to change things. In practice, I don't think spark's usage should lead to any de-normalized paths, but I think its a good precaution. Re: so many objects. I don't think its that surprising actually. Imagine a shuffle on a large cluster writing to 10k partitions. The shuffle-read side is going to make a lot of simultaneous requests to the same shuffle-write side task -- all that data lives in the same file, just at different offsets.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org