Why does Write() reshuffle before finalization?

Arwin Tio via user Mon, 28 Nov 2022 17:01:05 -0800

Hi team,

I am trying to debug a performance issue with WriteToParquet on Spark (
https://github.com/apache/beam/issues/24365) and was wondering if anybody
can shine a light on why Write() needs to trigger a shuffle before
finalization?


It is happening in WriteImpl:
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L1156-L1157
and was introduced in this PR: https://github.com/apache/beam/pull/958

Particularly confusing for me is why we are doing a skewed join on purpose:

  ...
  | 'Pair' >> core.Map(lambda x: (None, x))
  | core.GroupByKey()

Thanks,

Arwin

-- 


*Confidentiality Note:* We care about protecting our proprietary 
information, confidential material, and trade secrets. This message may 
contain some or all of those things. Cruise will suffer material harm if 
anyone other than the intended recipient disseminates or takes any action 
based on this message. If you have received this message (including any 
attachments) in error, please delete it immediately and notify the sender 
promptly.

Why does Write() reshuffle before finalization?

Reply via email to