Re: [DISCUSS] Performance of write() in file based IO

Reuven Lax Wed, 22 Aug 2018 08:26:46 -0700

Often only the metadata (i.e. temp file names) are shuffled, except in the
"spilling" case (which should only happen when using dynamic destinations).


WriteFiles depends heavily on side inputs. How are side inputs implemented
in the Spark runner?

On Wed, Aug 22, 2018 at 8:21 AM Robert Bradshaw <[email protected]> wrote:

> Yes, I stand corrected, dynamic writes is now much more than the
> primitive window-based naming we used to have.
>
> It would be interesting to visualize how much of this codepath is
> metatada vs. the actual data.
>
> In the case of file writing, it seems one could (maybe?) avoid
> requiring a stable input, as shards are accepted as a whole (unlike,
> say, sinks where a deterministic uid is needed for deduplication on
> retry).
>
> On Wed, Aug 22, 2018 at 4:55 PM Reuven Lax <[email protected]> wrote:
> >
> > Robert - much of the complexity isn't due to streaming, but rather
> because WriteFiles supports "dynamic" output (where the user can choose a
> destination file based on the input record). In practice if a pipeline is
> not using dynamic destinations the full graph is still generated, but much
> of that graph is never used (empty PCollections).
> >
> > On Wed, Aug 22, 2018 at 3:12 AM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> I agree that this is concerning. Some of the complexity may have also
> >> been introduced to accommodate writing files in Streaming mode, but it
> >> seems we should be able to execute this as a single Map operation.
> >>
> >> Have you profiled to see which stages and/or operations are taking up
> the time?
> >> On Wed, Aug 22, 2018 at 11:29 AM Tim Robertson
> >> <[email protected]> wrote:
> >> >
> >> > Hi folks,
> >> >
> >> > I've recently been involved in projects rewriting Avro files and have
> discovered a concerning performance trait in Beam.
> >> >
> >> > I have observed Beam between 6-20x slower than native Spark or
> MapReduce code for a simple pipeline of read Avro, modify, write Avro.
> >> >
> >> >  - Rewriting 200TB of Avro files (big cluster): 14 hrs using
> Beam/Spark, 40 minutes with a map-only MR job
> >> >  - Rewriting 1.5TB Avro file (small cluster): 2 hrs using Beam/Spark,
> 18 minutes using vanilla Spark code. Test code available [1]
> >> >
> >> > These tests were running Beam 2.6.0 on Cloudera 5.12.x clusters
> (Spark / YARN) on reference Dell / Cloudera hardware.
> >> >
> >> > I have only just started exploring but I believe the cause is rooted
> in the WriteFiles which is used by all our file based IO. WriteFiles is
> reasonably complex with reshuffles, spilling to temporary files (presumably
> to accommodate varying bundle sizes/avoid small files), a union, a GBK etc.
> >> >
> >> > Before I go too far with exploration I'd appreciate thoughts on
> whether we believe this is a concern (I do), if we should explore
> optimisations or any insight from previous work in this area.
> >> >
> >> > Thanks,
> >> > Tim
> >> >
> >> > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro
>

Re: [DISCUSS] Performance of write() in file based IO

Reply via email to