question about data skew and memory issues

David Diebold Tue, 14 Dec 2021 10:00:59 -0800

Hello all,

I was wondering if it possible to encounter out of memory exceptions on
spark executors when doing some aggregation, when a dataset is skewed.
Let's say we have a dataset with two columns:
- key : int
- value : float
And I want to aggregate values by key.
Let's say that we have a tons of key equal to 0.


Is it possible to encounter some out of memory exception after the shuffle ?
My expectation would be that the executor responsible of aggregating the
'0' partition could indeed have some oom exception if it tries to put all
the files of this partition in memory before processing them.
But why would it need to put them in memory when doing in aggregation ? It
looks to me that aggregation can be performed in a stream fashion, so I
would not expect any oom at all..

Thank you in advance for your lights :)
David

question about data skew and memory issues

Reply via email to