For bounded data, each bundle becomes a file: https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356
Kenn On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver <kcwea...@google.com> wrote: > As Luke and Robert indicated, unsetting num shards _may_ cause the runner > to optimize it automatically. > > For example, the Flink [1] and Dataflow [2] runners override num shards. > > However, in the Spark runner, I don't see any such override. So I have two > questions: > 1. Does the Spark runner override num shards somehow? > 2. How is num shards determined if it's set to 0 and not overridden by the > runner? > > [1] > https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243 > [2] > https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866 > > On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw <rober...@google.com> > wrote: > >> To let Dataflow choose the optimal number shards and maximize >> performance, it's often significantly better to simply leave it >> unspecified. A higher numShards only helps if you have at least that >> many workers. >> >> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vivek....@gmail.com> >> wrote: >> > >> > hi folks, I have this in code >> > >> > globalIndexJson.apply("GCSOutput", >> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500)); >> > >> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if >> changing numShards for larger datasize will write to GCS faster? >> >