As a follow up the pricing as the number of bytes written + read to the
shuffle is confirmed.
However we were able to figure out a way to lower shuffle costs and things
are right in the world again.
Thanks ya'll!
Shannon
On Wed, Sep 18, 2019 at 4:52 PM Reuven Lax wrote:
> I believe that the To
I believe that the Total shuffle data process counter counts the number of
bytes written to shuffle + the number of bytes read. So if you shuffle 1GB
of data, you should expect to see 2GB on the counter.
On Wed, Sep 18, 2019 at 2:39 PM Shannon Duncan
wrote:
> Ok just ran the job on a small input
Sorry missed a part of the map output for flatten:
[image: image.png]
However the shuffle does show only 29.32 GB going into it but the output of
Total Shuffled data is 58.66 GB
[image: image.png]
On Wed, Sep 18, 2019 at 4:39 PM Shannon Duncan
wrote:
> Ok just ran the job on a small input and
Ok just ran the job on a small input and did not specify numShards. so it's
literally just:
.apply("WriteLines", TextIO.write().to(options.getOutput()));
Output of map for join:
[image: image.png]
Details of Shuffle:
[image: image.png]
Reported Bytes Shuffled:
[image: image.png]
On Wed, Sep 1
On Wed, Sep 18, 2019 at 2:12 PM Shannon Duncan
wrote:
> I will attempt to do without sharding (though I believe we did do a run
> without shards and it incurred the extra shuffle costs).
>
It shouldn't. There will be a shuffle, but that shuffle should contain a
small amount of data (essentially
I will attempt to do without sharding (though I believe we did do a run
without shards and it incurred the extra shuffle costs).
Pipeline is simple.
The only shuffle that is explicitly defined is the shuffle after merging
files together into a single PCollection (Flatten Transform).
So it's a Re
In that case you should be able to leave sharding unspecified, and you
won't incur the extra shuffle. Specifying explicit sharding is generally
necessary only for streaming.
On Wed, Sep 18, 2019 at 2:06 PM Shannon Duncan
wrote:
> batch on dataflowRunner.
>
> On Wed, Sep 18, 2019 at 4:05 PM Reuve
batch on dataflowRunner.
On Wed, Sep 18, 2019 at 4:05 PM Reuven Lax wrote:
> Are you using streaming or batch? Also which runner are you using?
>
> On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan
> wrote:
>
>> So I followed up on why TextIO shuffles and dug into the code some. It is
>> using the
Are you specifying the number of shards to write to:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextIO.java#L859
If so, this will incur an additional shuffle to re-distribute data written
by all workers into the given number of shards before writ
Are you using streaming or batch? Also which runner are you using?
On Wed, Sep 18, 2019 at 1:57 PM Shannon Duncan
wrote:
> So I followed up on why TextIO shuffles and dug into the code some. It is
> using the shards and getting all the values into a keyed group to write to
> a single file.
>
> H
What you propose with a writer per bundle is definitely possible, but I
expect the blocker is that in most cases the runner has control of bundle
sizes and there's nothing exposed to the user to control that. I've wanted
to do similar, but found average bundle sizes in my case on Dataflow to be
so
So I followed up on why TextIO shuffles and dug into the code some. It is
using the shards and getting all the values into a keyed group to write to
a single file.
However... I wonder if there is way to just take the records that are on a
worker and write them out. Thus not needing a shard number
12 matches
Mail list logo