For bounded data, each bundle becomes a file:
https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356

Kenn

On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver <kcwea...@google.com> wrote:

> As Luke and Robert indicated, unsetting num shards _may_ cause the runner
> to optimize it automatically.
>
> For example, the Flink [1] and Dataflow [2] runners override num shards.
>
> However, in the Spark runner, I don't see any such override. So I have two
> questions:
> 1. Does the Spark runner override num shards somehow?
> 2. How is num shards determined if it's set to 0 and not overridden by the
> runner?
>
> [1]
> https://github.com/apache/beam/blob/c2f0d282337f3ae0196a7717712396a5a41fdde1/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkStreamingPipelineTranslator.java#L240-L243
> [2]
> https://github.com/apache/beam/blob/a149b6b040e9573e53cd41b6bd69b7e7603ac2a2/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1853-L1866
>
> On Fri, Feb 14, 2020 at 10:09 AM Robert Bradshaw <rober...@google.com>
> wrote:
>
>> To let Dataflow choose the optimal number shards and maximize
>> performance, it's often significantly better to simply leave it
>> unspecified. A higher numShards only helps if you have at least that
>> many workers.
>>
>> On Thu, Feb 13, 2020 at 10:24 PM vivek chaurasiya <vivek....@gmail.com>
>> wrote:
>> >
>> > hi folks, I have this in code
>> >
>> >             globalIndexJson.apply("GCSOutput",
>> TextIO.write().to(fullGCSPath).withSuffix(".txt").withNumShards(500));
>> >
>> > the same code is executed for 50GB, 3TB, 5TB of data. I want to know if
>> changing numShards for larger datasize will write to GCS faster?
>>
>

Reply via email to