Kyle Winkelman created BEAM-6735:
------------------------------------

             Summary: WriteFiles with runner-determined sharding is forced to 
handle spilling
                 Key: BEAM-6735
                 URL: https://issues.apache.org/jira/browse/BEAM-6735
             Project: Beam
          Issue Type: Improvement
          Components: sdk-java-core
            Reporter: Kyle Winkelman


As a result of BEAM-2302, files in excess of WriteFiles maxNumWritersPerBundle 
are shuffled to be written later. The downside to this is that even if you can 
guarantee that maxNumWritersPerBundle is high enough to handle all writes you 
still have to pay the overhead of this write now being a MultiOutput ParDo.

e.g. In the Spark Runner when a ParDo has multiple outputs the returned data is 
cached and if using the disableCache pipeline option it would cause 
recalculation and all the temp files would be written again.

I'm sure that the Spark Runner is not the only runner that would benefit from 
an optional setting for WriteFiles that would skip this spilling and simplify 
the pipeline.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to