Kyle Winkelman created BEAM-6735:
------------------------------------
Summary: WriteFiles with runner-determined sharding is forced to
handle spilling
Key: BEAM-6735
URL: https://issues.apache.org/jira/browse/BEAM-6735
Project: Beam
Issue Type: Improvement
Components: sdk-java-core
Reporter: Kyle Winkelman
As a result of BEAM-2302, files in excess of WriteFiles maxNumWritersPerBundle
are shuffled to be written later. The downside to this is that even if you can
guarantee that maxNumWritersPerBundle is high enough to handle all writes you
still have to pay the overhead of this write now being a MultiOutput ParDo.
e.g. In the Spark Runner when a ParDo has multiple outputs the returned data is
cached and if using the disableCache pipeline option it would cause
recalculation and all the temp files would be written again.
I'm sure that the Spark Runner is not the only runner that would benefit from
an optional setting for WriteFiles that would skip this spilling and simplify
the pipeline.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)