Robert - much of the complexity isn't due to streaming, but rather because WriteFiles supports "dynamic" output (where the user can choose a destination file based on the input record). In practice if a pipeline is not using dynamic destinations the full graph is still generated, but much of that graph is never used (empty PCollections).
On Wed, Aug 22, 2018 at 3:12 AM Robert Bradshaw <rober...@google.com> wrote: > I agree that this is concerning. Some of the complexity may have also > been introduced to accommodate writing files in Streaming mode, but it > seems we should be able to execute this as a single Map operation. > > Have you profiled to see which stages and/or operations are taking up the > time? > On Wed, Aug 22, 2018 at 11:29 AM Tim Robertson > <timrobertson...@gmail.com> wrote: > > > > Hi folks, > > > > I've recently been involved in projects rewriting Avro files and have > discovered a concerning performance trait in Beam. > > > > I have observed Beam between 6-20x slower than native Spark or MapReduce > code for a simple pipeline of read Avro, modify, write Avro. > > > > - Rewriting 200TB of Avro files (big cluster): 14 hrs using Beam/Spark, > 40 minutes with a map-only MR job > > - Rewriting 1.5TB Avro file (small cluster): 2 hrs using Beam/Spark, 18 > minutes using vanilla Spark code. Test code available [1] > > > > These tests were running Beam 2.6.0 on Cloudera 5.12.x clusters (Spark / > YARN) on reference Dell / Cloudera hardware. > > > > I have only just started exploring but I believe the cause is rooted in > the WriteFiles which is used by all our file based IO. WriteFiles is > reasonably complex with reshuffles, spilling to temporary files (presumably > to accommodate varying bundle sizes/avoid small files), a union, a GBK etc. > > > > Before I go too far with exploration I'd appreciate thoughts on whether > we believe this is a concern (I do), if we should explore optimisations or > any insight from previous work in this area. > > > > Thanks, > > Tim > > > > [1] https://github.com/gbif/beam-perf/tree/master/avro-to-avro >