Hi team, I am trying to debug a performance issue with WriteToParquet on Spark ( https://github.com/apache/beam/issues/24365) and was wondering if anybody can shine a light on why Write() needs to trigger a shuffle before finalization?
It is happening in WriteImpl: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/iobase.py#L1156-L1157 and was introduced in this PR: https://github.com/apache/beam/pull/958 Particularly confusing for me is why we are doing a skewed join on purpose: ... | 'Pair' >> core.Map(lambda x: (None, x)) | core.GroupByKey() Thanks, Arwin -- *Confidentiality Note:* We care about protecting our proprietary information, confidential material, and trade secrets. This message may contain some or all of those things. Cruise will suffer material harm if anyone other than the intended recipient disseminates or takes any action based on this message. If you have received this message (including any attachments) in error, please delete it immediately and notify the sender promptly.