Re: Extremely Slow DirectRunner
Any further thoughts here? Or tips on profiling Beam DirectRunner? Thanks, Evan On Wed, May 12, 2021 at 6:22 PM Evan Galpin wrote: > Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow > behaviour regardless of use_deprecated_reads. Not sure if that points to > something different then. > > Thanks, > Evan > > On Wed, May 12, 2021 at 18:16 Steve Niemitz wrote: > >> I think it was only broken in 2.29. >> >> On Wed, May 12, 2021 at 5:53 PM Evan Galpin >> wrote: >> >>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken >>> specifically in 2.29.0 (regression) or broken in all versions up to and >>> including 2.29.0 (ie never worked)? >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote: >>> Yeah, sorry my email was confusing. use_deprecated_reads is broken on the DirectRunner in 2.29. The behavior you describe is exactly the behavior I ran into as well when reading from pubsub with the new read method. I believe that soon the default is being reverted back to the old read method, not using SDFs, which will fix your performance issue. On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: > Hi Evan, > > It seems like the slow step is not the read that use_deprecated_read > targets for. Would you like to share your pipeline code if possible? > > On Wed, May 12, 2021 at 1:35 PM Evan Galpin > wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavior again. Is it possible that use_deprecated_read is >> broken in 2.29.0 as well? >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz >> wrote: >> >>> oops sorry I was off by 10...I meant 2.29 not 2.19. >>> >>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin >>> wrote: >>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for my use case at least) regardless of use_deprecated_read setting. Thanks, Evan On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > use_deprecated_read was broken in 2.19 on the direct runner and > didn't do anything. [1] I don't think the fix is in 2.20 either, but > will > be in 2.21. > > [1] https://github.com/apache/beam/pull/14469 > > On Wed, May 12, 2021 at 1:41 PM Evan Galpin > wrote: > >> I forgot to also mention that in all tests I was setting >> --experiments=use_deprecated_read >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 1:39 PM Evan Galpin < >> evan.gal...@gmail.com> wrote: >> >>> Hmm, I think I spoke too soon. I'm still seeing an issue of >>> overall DirectRunner slowness, not just pubsub. I have a pipeline >>> like so: >>> >>> Read pubsub | extract GCS glob patterns | FileIO.matchAll() >>> | FileIO.readMatches() | Read file contents | etc >>> >>> I have temporarily set up a transform between each step to log >>> what's going on and illustrate timing issues. I ran a series of >>> tests >>> changing only the SDK version each time since I hadn't noticed this >>> performance issue with 2.19.0 (effectively git-bisect). Before each >>> test, I >>> seeded the pubsub subscription with the exact same contents. >>> >>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I >>> couldn't seem to resolve) and onward show a significant slowdown. >>> >>> Here is a snippet of logging from v2.25.0: >>> >>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> processElement >>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:16:59 A.M. >>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>> INFO: Matched 2 files for pattern >>> gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file1.json >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file2.json >>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file1.json >>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file2.json >>> >>> Note th
Re: Extremely Slow DirectRunner
Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow behaviour regardless of use_deprecated_reads. Not sure if that points to something different then. Thanks, Evan On Wed, May 12, 2021 at 18:16 Steve Niemitz wrote: > I think it was only broken in 2.29. > > On Wed, May 12, 2021 at 5:53 PM Evan Galpin wrote: > >> Ah ok thanks for that. Do you mean use_deprecated_reads is broken >> specifically in 2.29.0 (regression) or broken in all versions up to and >> including 2.29.0 (ie never worked)? >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote: >> >>> Yeah, sorry my email was confusing. use_deprecated_reads is broken on >>> the DirectRunner in 2.29. >>> >>> The behavior you describe is exactly the behavior I ran into as well >>> when reading from pubsub with the new read method. I believe that soon the >>> default is being reverted back to the old read method, not using SDFs, >>> which will fix your performance issue. >>> >>> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: >>> Hi Evan, It seems like the slow step is not the read that use_deprecated_read targets for. Would you like to share your pipeline code if possible? On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote: > I just tried with v2.29.0 and use_deprecated_read but unfortunately I > observed slow behavior again. Is it possible that use_deprecated_read is > broken in 2.29.0 as well? > > Thanks, > Evan > > On Wed, May 12, 2021 at 3:21 PM Steve Niemitz > wrote: > >> oops sorry I was off by 10...I meant 2.29 not 2.19. >> >> On Wed, May 12, 2021 at 2:55 PM Evan Galpin >> wrote: >> >>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the >>> "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at >>> v2.25.0 (for my use case at least) regardless of use_deprecated_read >>> setting. >>> >>> Thanks, >>> Evan >>> >>> >>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz >>> wrote: >>> use_deprecated_read was broken in 2.19 on the direct runner and didn't do anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. [1] https://github.com/apache/beam/pull/14469 On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > I forgot to also mention that in all tests I was setting > --experiments=use_deprecated_read > > Thanks, > Evan > > On Wed, May 12, 2021 at 1:39 PM Evan Galpin > wrote: > >> Hmm, I think I spoke too soon. I'm still seeing an issue of >> overall DirectRunner slowness, not just pubsub. I have a pipeline >> like so: >> >> Read pubsub | extract GCS glob patterns | FileIO.matchAll() >> | FileIO.readMatches() | Read file contents | etc >> >> I have temporarily set up a transform between each step to log >> what's going on and illustrate timing issues. I ran a series of >> tests >> changing only the SDK version each time since I hadn't noticed this >> performance issue with 2.19.0 (effectively git-bisect). Before each >> test, I >> seeded the pubsub subscription with the exact same contents. >> >> SDK version 2.25.0 (I had a build issue with 2.24.0 that I >> couldn't seem to resolve) and onward show a significant slowdown. >> >> Here is a snippet of logging from v2.25.0: >> >> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >> processElement >> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:16:59 A.M. >> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >> INFO: Matched 2 files for pattern >> gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file1.json >> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file2.json >> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file1.json >> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file2.json >> >> Note that end-to-end, these steps took about *13 minutes*. With >> SDK 2.23.0 and identical user code, the same section of the pipeline >> took *2 >> seconds*: >> >> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >> processElement >>
Re: Extremely Slow DirectRunner
I think it was only broken in 2.29. On Wed, May 12, 2021 at 5:53 PM Evan Galpin wrote: > Ah ok thanks for that. Do you mean use_deprecated_reads is broken > specifically in 2.29.0 (regression) or broken in all versions up to and > including 2.29.0 (ie never worked)? > > Thanks, > Evan > > On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote: > >> Yeah, sorry my email was confusing. use_deprecated_reads is broken on >> the DirectRunner in 2.29. >> >> The behavior you describe is exactly the behavior I ran into as well when >> reading from pubsub with the new read method. I believe that soon the >> default is being reverted back to the old read method, not using SDFs, >> which will fix your performance issue. >> >> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: >> >>> Hi Evan, >>> >>> It seems like the slow step is not the read that use_deprecated_read >>> targets for. Would you like to share your pipeline code if possible? >>> >>> On Wed, May 12, 2021 at 1:35 PM Evan Galpin >>> wrote: >>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I observed slow behavior again. Is it possible that use_deprecated_read is broken in 2.29.0 as well? Thanks, Evan On Wed, May 12, 2021 at 3:21 PM Steve Niemitz wrote: > oops sorry I was off by 10...I meant 2.29 not 2.19. > > On Wed, May 12, 2021 at 2:55 PM Evan Galpin > wrote: > >> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the >> "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at >> v2.25.0 (for my use case at least) regardless of use_deprecated_read >> setting. >> >> Thanks, >> Evan >> >> >> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz >> wrote: >> >>> use_deprecated_read was broken in 2.19 on the direct runner and >>> didn't do anything. [1] I don't think the fix is in 2.20 either, but >>> will >>> be in 2.21. >>> >>> [1] https://github.com/apache/beam/pull/14469 >>> >>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin >>> wrote: >>> I forgot to also mention that in all tests I was setting --experiments=use_deprecated_read Thanks, Evan On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote: > Hmm, I think I spoke too soon. I'm still seeing an issue of > overall DirectRunner slowness, not just pubsub. I have a pipeline > like so: > > Read pubsub | extract GCS glob patterns | FileIO.matchAll() > | FileIO.readMatches() | Read file contents | etc > > I have temporarily set up a transform between each step to log > what's going on and illustrate timing issues. I ran a series of tests > changing only the SDK version each time since I hadn't noticed this > performance issue with 2.19.0 (effectively git-bisect). Before each > test, I > seeded the pubsub subscription with the exact same contents. > > SDK version 2.25.0 (I had a build issue with 2.24.0 that I > couldn't seem to resolve) and onward show a significant slowdown. > > Here is a snippet of logging from v2.25.0: > > *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 > processElement > INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:16:59 A.M. > org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process > INFO: Matched 2 files for pattern > gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file1.json > May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file2.json > May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file1.json > *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file2.json > > Note that end-to-end, these steps took about *13 minutes*. With > SDK 2.23.0 and identical user code, the same section of the pipeline > took *2 > seconds*: > > *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 > processElement > INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:03:40 A.M. > org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process > INFO: Matched 2 files for pattern > gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO:
Re: Extremely Slow DirectRunner
Ah ok thanks for that. Do you mean use_deprecated_reads is broken specifically in 2.29.0 (regression) or broken in all versions up to and including 2.29.0 (ie never worked)? Thanks, Evan On Wed, May 12, 2021 at 17:12 Steve Niemitz wrote: > Yeah, sorry my email was confusing. use_deprecated_reads is broken on the > DirectRunner in 2.29. > > The behavior you describe is exactly the behavior I ran into as well when > reading from pubsub with the new read method. I believe that soon the > default is being reverted back to the old read method, not using SDFs, > which will fix your performance issue. > > On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: > >> Hi Evan, >> >> It seems like the slow step is not the read that use_deprecated_read >> targets for. Would you like to share your pipeline code if possible? >> >> On Wed, May 12, 2021 at 1:35 PM Evan Galpin >> wrote: >> >>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >>> observed slow behavior again. Is it possible that use_deprecated_read is >>> broken in 2.29.0 as well? >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz >>> wrote: >>> oops sorry I was off by 10...I meant 2.29 not 2.19. On Wed, May 12, 2021 at 2:55 PM Evan Galpin wrote: > Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" > behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for > my use case at least) regardless of use_deprecated_read setting. > > Thanks, > Evan > > > On Wed, May 12, 2021 at 2:47 PM Steve Niemitz > wrote: > >> use_deprecated_read was broken in 2.19 on the direct runner and >> didn't do anything. [1] I don't think the fix is in 2.20 either, but >> will >> be in 2.21. >> >> [1] https://github.com/apache/beam/pull/14469 >> >> On Wed, May 12, 2021 at 1:41 PM Evan Galpin >> wrote: >> >>> I forgot to also mention that in all tests I was setting >>> --experiments=use_deprecated_read >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin >>> wrote: >>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall DirectRunner slowness, not just pubsub. I have a pipeline like so: Read pubsub | extract GCS glob patterns | FileIO.matchAll() | FileIO.readMatches() | Read file contents | etc I have temporarily set up a transform between each step to log what's going on and illustrate timing issues. I ran a series of tests changing only the SDK version each time since I hadn't noticed this performance issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the pubsub subscription with the exact same contents. SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem to resolve) and onward show a significant slowdown. Here is a snippet of logging from v2.25.0: *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 processElement INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file1.json May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file2.json May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file1.json *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Note that end-to-end, these steps took about *13 minutes*. With SDK 2.23.0 and identical user code, the same section of the pipeline took *2 seconds*: *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 processElement INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file1.json May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file2.json May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.Pipelin
Re: Extremely Slow DirectRunner
Yeah, sorry my email was confusing. use_deprecated_reads is broken on the DirectRunner in 2.29. The behavior you describe is exactly the behavior I ran into as well when reading from pubsub with the new read method. I believe that soon the default is being reverted back to the old read method, not using SDFs, which will fix your performance issue. On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: > Hi Evan, > > It seems like the slow step is not the read that use_deprecated_read > targets for. Would you like to share your pipeline code if possible? > > On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavior again. Is it possible that use_deprecated_read is >> broken in 2.29.0 as well? >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz >> wrote: >> >>> oops sorry I was off by 10...I meant 2.29 not 2.19. >>> >>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin >>> wrote: >>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for my use case at least) regardless of use_deprecated_read setting. Thanks, Evan On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > use_deprecated_read was broken in 2.19 on the direct runner and didn't > do anything. [1] I don't think the fix is in 2.20 either, but will be in > 2.21. > > [1] https://github.com/apache/beam/pull/14469 > > On Wed, May 12, 2021 at 1:41 PM Evan Galpin > wrote: > >> I forgot to also mention that in all tests I was setting >> --experiments=use_deprecated_read >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 1:39 PM Evan Galpin >> wrote: >> >>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall >>> DirectRunner slowness, not just pubsub. I have a pipeline like so: >>> >>> Read pubsub | extract GCS glob patterns | FileIO.matchAll() | >>> FileIO.readMatches() | Read file contents | etc >>> >>> I have temporarily set up a transform between each step to log >>> what's going on and illustrate timing issues. I ran a series of tests >>> changing only the SDK version each time since I hadn't noticed this >>> performance issue with 2.19.0 (effectively git-bisect). Before each >>> test, I >>> seeded the pubsub subscription with the exact same contents. >>> >>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't >>> seem to resolve) and onward show a significant slowdown. >>> >>> Here is a snippet of logging from v2.25.0: >>> >>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> processElement >>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:16:59 A.M. >>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>> INFO: Matched 2 files for pattern >>> gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file1.json >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file2.json >>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file1.json >>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file2.json >>> >>> Note that end-to-end, these steps took about *13 minutes*. With SDK >>> 2.23.0 and identical user code, the same section of the pipeline took *2 >>> seconds*: >>> >>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> processElement >>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:03:40 A.M. >>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>> INFO: Matched 2 files for pattern >>> gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file1.json >>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file2.json >>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file1.json >>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file2.json >>> >>> Any thoughts on what could be causing this? >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 9
Re: Extremely Slow DirectRunner
I'd be happy to share what I can. The applicable portion is this `expand` method of a PTransform (which does nothing more complex than group these other transforms together for re-use). The input to this PTransform is pubsub message bodies as strings. I'll paste it as plain-text. @Override public PCollection> expand(PCollection input) { return input .apply( "Convert OCNs to wildcard paths", ParDo.of(new BlobPathToRootPath())) .apply(FileIO.matchAll().withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW)) .apply(FileIO.readMatches().withCompression(Compression.UNCOMPRESSED)) .apply( MapElements .into(kvs(strings(), strings())) .via( (ReadableFile f) -> { try { String[] blobUri = f.getMetadata().resourceId().toString().split("/"); String doc_id = blobUri[4]; return KV.of(doc_id, f.readFullyAsUTF8String()); } catch (IOException ex) { throw new RuntimeException( "Failed to read the file", ex); } })); } Thanks, Evan On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang wrote: > Hi Evan, > > It seems like the slow step is not the read that use_deprecated_read > targets for. Would you like to share your pipeline code if possible? > > On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote: > >> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >> observed slow behavior again. Is it possible that use_deprecated_read is >> broken in 2.29.0 as well? >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz >> wrote: >> >>> oops sorry I was off by 10...I meant 2.29 not 2.19. >>> >>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin >>> wrote: >>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for my use case at least) regardless of use_deprecated_read setting. Thanks, Evan On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > use_deprecated_read was broken in 2.19 on the direct runner and didn't > do anything. [1] I don't think the fix is in 2.20 either, but will be in > 2.21. > > [1] https://github.com/apache/beam/pull/14469 > > On Wed, May 12, 2021 at 1:41 PM Evan Galpin > wrote: > >> I forgot to also mention that in all tests I was setting >> --experiments=use_deprecated_read >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 1:39 PM Evan Galpin >> wrote: >> >>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall >>> DirectRunner slowness, not just pubsub. I have a pipeline like so: >>> >>> Read pubsub | extract GCS glob patterns | FileIO.matchAll() | >>> FileIO.readMatches() | Read file contents | etc >>> >>> I have temporarily set up a transform between each step to log >>> what's going on and illustrate timing issues. I ran a series of tests >>> changing only the SDK version each time since I hadn't noticed this >>> performance issue with 2.19.0 (effectively git-bisect). Before each >>> test, I >>> seeded the pubsub subscription with the exact same contents. >>> >>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't >>> seem to resolve) and onward show a significant slowdown. >>> >>> Here is a snippet of logging from v2.25.0: >>> >>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> processElement >>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:16:59 A.M. >>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>> INFO: Matched 2 files for pattern >>> gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file1.json >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file2.json >>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file1.json >>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file2.json >>> >>> Note that end-to-end, these steps took about *13 minutes*. With SDK >>> 2.23.0 and identical user code, the same section of the pipeline took *2 >>> seconds*: >>> >>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> proces
Re: Extremely Slow DirectRunner
Hi Evan, It seems like the slow step is not the read that use_deprecated_read targets for. Would you like to share your pipeline code if possible? On Wed, May 12, 2021 at 1:35 PM Evan Galpin wrote: > I just tried with v2.29.0 and use_deprecated_read but unfortunately I > observed slow behavior again. Is it possible that use_deprecated_read is > broken in 2.29.0 as well? > > Thanks, > Evan > > On Wed, May 12, 2021 at 3:21 PM Steve Niemitz wrote: > >> oops sorry I was off by 10...I meant 2.29 not 2.19. >> >> On Wed, May 12, 2021 at 2:55 PM Evan Galpin >> wrote: >> >>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" >>> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for >>> my use case at least) regardless of use_deprecated_read setting. >>> >>> Thanks, >>> Evan >>> >>> >>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz >>> wrote: >>> use_deprecated_read was broken in 2.19 on the direct runner and didn't do anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. [1] https://github.com/apache/beam/pull/14469 On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > I forgot to also mention that in all tests I was setting > --experiments=use_deprecated_read > > Thanks, > Evan > > On Wed, May 12, 2021 at 1:39 PM Evan Galpin > wrote: > >> Hmm, I think I spoke too soon. I'm still seeing an issue of overall >> DirectRunner slowness, not just pubsub. I have a pipeline like so: >> >> Read pubsub | extract GCS glob patterns | FileIO.matchAll() | >> FileIO.readMatches() | Read file contents | etc >> >> I have temporarily set up a transform between each step to log what's >> going on and illustrate timing issues. I ran a series of tests changing >> only the SDK version each time since I hadn't noticed this performance >> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded >> the >> pubsub subscription with the exact same contents. >> >> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't >> seem to resolve) and onward show a significant slowdown. >> >> Here is a snippet of logging from v2.25.0: >> >> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >> processElement >> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:16:59 A.M. >> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >> INFO: Matched 2 files for pattern >> gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file1.json >> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file2.json >> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file1.json >> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file2.json >> >> Note that end-to-end, these steps took about *13 minutes*. With SDK >> 2.23.0 and identical user code, the same section of the pipeline took *2 >> seconds*: >> >> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >> processElement >> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:03:40 A.M. >> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >> INFO: Matched 2 files for pattern >> gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file1.json >> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file2.json >> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file1.json >> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file2.json >> >> Any thoughts on what could be causing this? >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 9:53 AM Evan Galpin >> wrote: >> >>> >>> >>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang >>> wrote: >>> Hi Evan, What do you mean startup delay? Is it the time that from you start the pipeline to the time that you notice the first output record from PubSub? >>> >>> Yes that's what I meant, the seemingly idle system waiting for >>> pubsub output despite data being in the subscription at pipeline start >>> time. >>> >>> On
Re: Extremely Slow DirectRunner
I just tried with v2.29.0 and use_deprecated_read but unfortunately I observed slow behavior again. Is it possible that use_deprecated_read is broken in 2.29.0 as well? Thanks, Evan On Wed, May 12, 2021 at 3:21 PM Steve Niemitz wrote: > oops sorry I was off by 10...I meant 2.29 not 2.19. > > On Wed, May 12, 2021 at 2:55 PM Evan Galpin wrote: > >> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" >> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for >> my use case at least) regardless of use_deprecated_read setting. >> >> Thanks, >> Evan >> >> >> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz >> wrote: >> >>> use_deprecated_read was broken in 2.19 on the direct runner and didn't >>> do anything. [1] I don't think the fix is in 2.20 either, but will be in >>> 2.21. >>> >>> [1] https://github.com/apache/beam/pull/14469 >>> >>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin >>> wrote: >>> I forgot to also mention that in all tests I was setting --experiments=use_deprecated_read Thanks, Evan On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote: > Hmm, I think I spoke too soon. I'm still seeing an issue of overall > DirectRunner slowness, not just pubsub. I have a pipeline like so: > > Read pubsub | extract GCS glob patterns | FileIO.matchAll() | > FileIO.readMatches() | Read file contents | etc > > I have temporarily set up a transform between each step to log what's > going on and illustrate timing issues. I ran a series of tests changing > only the SDK version each time since I hadn't noticed this performance > issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the > pubsub subscription with the exact same contents. > > SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't > seem to resolve) and onward show a significant slowdown. > > Here is a snippet of logging from v2.25.0: > > *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 > processElement > INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:16:59 A.M. > org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process > INFO: Matched 2 files for pattern > gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file1.json > May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file2.json > May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file1.json > *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file2.json > > Note that end-to-end, these steps took about *13 minutes*. With SDK > 2.23.0 and identical user code, the same section of the pipeline took *2 > seconds*: > > *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 > processElement > INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:03:40 A.M. > org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process > INFO: Matched 2 files for pattern > gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file1.json > May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file2.json > May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file1.json > *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file2.json > > Any thoughts on what could be causing this? > > Thanks, > Evan > > On Wed, May 12, 2021 at 9:53 AM Evan Galpin > wrote: > >> >> >> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang >> wrote: >> >>> Hi Evan, >>> >>> What do you mean startup delay? Is it the time that from you start >>> the pipeline to the time that you notice the first output record from >>> PubSub? >>> >> >> Yes that's what I meant, the seemingly idle system waiting for pubsub >> output despite data being in the subscription at pipeline start time. >> >> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía >>> wrote: >>> Can you try running direct runner with the option `--experiments=use_deprecated_read` >>> >> This seems to work for me, thanks for this! 👍 >> >> Seems like an instance of https://issues.apache.org/jira/browse/BEAM-10670?focused
Re: Extremely Slow DirectRunner
oops sorry I was off by 10...I meant 2.29 not 2.19. On Wed, May 12, 2021 at 2:55 PM Evan Galpin wrote: > Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" > behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for > my use case at least) regardless of use_deprecated_read setting. > > Thanks, > Evan > > > On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > >> use_deprecated_read was broken in 2.19 on the direct runner and didn't do >> anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. >> >> [1] https://github.com/apache/beam/pull/14469 >> >> On Wed, May 12, 2021 at 1:41 PM Evan Galpin >> wrote: >> >>> I forgot to also mention that in all tests I was setting >>> --experiments=use_deprecated_read >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin >>> wrote: >>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall DirectRunner slowness, not just pubsub. I have a pipeline like so: Read pubsub | extract GCS glob patterns | FileIO.matchAll() | FileIO.readMatches() | Read file contents | etc I have temporarily set up a transform between each step to log what's going on and illustrate timing issues. I ran a series of tests changing only the SDK version each time since I hadn't noticed this performance issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the pubsub subscription with the exact same contents. SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem to resolve) and onward show a significant slowdown. Here is a snippet of logging from v2.25.0: *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 processElement INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file1.json May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file2.json May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file1.json *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Note that end-to-end, these steps took about *13 minutes*. With SDK 2.23.0 and identical user code, the same section of the pipeline took *2 seconds*: *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 processElement INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file1.json May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file2.json May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file1.json *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Any thoughts on what could be causing this? Thanks, Evan On Wed, May 12, 2021 at 9:53 AM Evan Galpin wrote: > > > On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang > wrote: > >> Hi Evan, >> >> What do you mean startup delay? Is it the time that from you start >> the pipeline to the time that you notice the first output record from >> PubSub? >> > > Yes that's what I meant, the seemingly idle system waiting for pubsub > output despite data being in the subscription at pipeline start time. > > On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía >> wrote: >> >>> Can you try running direct runner with the option >>> `--experiments=use_deprecated_read` >>> >> > This seems to work for me, thanks for this! 👍 > > >>> Seems like an instance of >>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 >>> also reported in >>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >>> >>> We should rollback using the SDF wrapper by default because of the >>> usability and performa
Re: Extremely Slow DirectRunner
Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for my use case at least) regardless of use_deprecated_read setting. Thanks, Evan On Wed, May 12, 2021 at 2:47 PM Steve Niemitz wrote: > use_deprecated_read was broken in 2.19 on the direct runner and didn't do > anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. > > [1] https://github.com/apache/beam/pull/14469 > > On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > >> I forgot to also mention that in all tests I was setting >> --experiments=use_deprecated_read >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 1:39 PM Evan Galpin >> wrote: >> >>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall >>> DirectRunner slowness, not just pubsub. I have a pipeline like so: >>> >>> Read pubsub | extract GCS glob patterns | FileIO.matchAll() | >>> FileIO.readMatches() | Read file contents | etc >>> >>> I have temporarily set up a transform between each step to log what's >>> going on and illustrate timing issues. I ran a series of tests changing >>> only the SDK version each time since I hadn't noticed this performance >>> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the >>> pubsub subscription with the exact same contents. >>> >>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem >>> to resolve) and onward show a significant slowdown. >>> >>> Here is a snippet of logging from v2.25.0: >>> >>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> processElement >>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:16:59 A.M. >>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>> INFO: Matched 2 files for pattern >>> gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file1.json >>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file2.json >>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file1.json >>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file2.json >>> >>> Note that end-to-end, these steps took about *13 minutes*. With SDK >>> 2.23.0 and identical user code, the same section of the pipeline took *2 >>> seconds*: >>> >>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>> processElement >>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:03:40 A.M. >>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>> INFO: Matched 2 files for pattern >>> gs://my-bucket/my-dir/5004728247517184/** >>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file1.json >>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>> processElement >>> INFO: Got ReadableFile: my-file2.json >>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file1.json >>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 >>> processElement >>> INFO: Got file contents for document_id my-file2.json >>> >>> Any thoughts on what could be causing this? >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin >>> wrote: >>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote: > Hi Evan, > > What do you mean startup delay? Is it the time that from you start the > pipeline to the time that you notice the first output record from PubSub? > Yes that's what I meant, the seemingly idle system waiting for pubsub output despite data being in the subscription at pipeline start time. On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: > >> Can you try running direct runner with the option >> `--experiments=use_deprecated_read` >> > This seems to work for me, thanks for this! 👍 >> Seems like an instance of >> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 >> also reported in >> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >> >> We should rollback using the SDF wrapper by default because of the >> usability and performance issues reported. >> >> >> On Sat, May 8, 2021 at 12:57 AM Evan Galpin >> wrote: >> >>> Hi all, >>> >>> I’m experiencing very slow performance and startup delay when >>> testing a pipeline locally. I’m reading data
Re: Extremely Slow DirectRunner
use_deprecated_read was broken in 2.19 on the direct runner and didn't do anything. [1] I don't think the fix is in 2.20 either, but will be in 2.21. [1] https://github.com/apache/beam/pull/14469 On Wed, May 12, 2021 at 1:41 PM Evan Galpin wrote: > I forgot to also mention that in all tests I was setting > --experiments=use_deprecated_read > > Thanks, > Evan > > On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote: > >> Hmm, I think I spoke too soon. I'm still seeing an issue of overall >> DirectRunner slowness, not just pubsub. I have a pipeline like so: >> >> Read pubsub | extract GCS glob patterns | FileIO.matchAll() | >> FileIO.readMatches() | Read file contents | etc >> >> I have temporarily set up a transform between each step to log what's >> going on and illustrate timing issues. I ran a series of tests changing >> only the SDK version each time since I hadn't noticed this performance >> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the >> pubsub subscription with the exact same contents. >> >> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem >> to resolve) and onward show a significant slowdown. >> >> Here is a snippet of logging from v2.25.0: >> >> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >> processElement >> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn >> process >> INFO: Matched 2 files for pattern >> gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file1.json >> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file2.json >> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file1.json >> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file2.json >> >> Note that end-to-end, these steps took about *13 minutes*. With SDK >> 2.23.0 and identical user code, the same section of the pipeline took *2 >> seconds*: >> >> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >> processElement >> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn >> process >> INFO: Matched 2 files for pattern >> gs://my-bucket/my-dir/5004728247517184/** >> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file1.json >> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >> processElement >> INFO: Got ReadableFile: my-file2.json >> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file1.json >> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 >> processElement >> INFO: Got file contents for document_id my-file2.json >> >> Any thoughts on what could be causing this? >> >> Thanks, >> Evan >> >> On Wed, May 12, 2021 at 9:53 AM Evan Galpin >> wrote: >> >>> >>> >>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote: >>> Hi Evan, What do you mean startup delay? Is it the time that from you start the pipeline to the time that you notice the first output record from PubSub? >>> >>> Yes that's what I meant, the seemingly idle system waiting for pubsub >>> output despite data being in the subscription at pipeline start time. >>> >>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: > Can you try running direct runner with the option > `--experiments=use_deprecated_read` > >>> This seems to work for me, thanks for this! 👍 >>> >>> > Seems like an instance of > https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 > also reported in > https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E > > We should rollback using the SDF wrapper by default because of the > usability and performance issues reported. > > > On Sat, May 8, 2021 at 12:57 AM Evan Galpin > wrote: > >> Hi all, >> >> I’m experiencing very slow performance and startup delay when testing >> a pipeline locally. I’m reading data from a Google PubSub subscription as >> the data source, and before each pipeline execution I ensure that data is >> present in the subscription (readable from GCP console). >> >> I’m seeing startup delay on the order of minutes with DirectRunner >> (5-10 min). Is that expected? I did find a Jira ticket[1] that at first >> seemed related, but I think it has more to do with BQ than DirectR
Re: Extremely Slow DirectRunner
I forgot to also mention that in all tests I was setting --experiments=use_deprecated_read Thanks, Evan On Wed, May 12, 2021 at 1:39 PM Evan Galpin wrote: > Hmm, I think I spoke too soon. I'm still seeing an issue of overall > DirectRunner slowness, not just pubsub. I have a pipeline like so: > > Read pubsub | extract GCS glob patterns | FileIO.matchAll() | > FileIO.readMatches() | Read file contents | etc > > I have temporarily set up a transform between each step to log what's > going on and illustrate timing issues. I ran a series of tests changing > only the SDK version each time since I hadn't noticed this performance > issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the > pubsub subscription with the exact same contents. > > SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem > to resolve) and onward show a significant slowdown. > > Here is a snippet of logging from v2.25.0: > > *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 > processElement > INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn > process > INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file1.json > May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file2.json > May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file1.json > *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file2.json > > Note that end-to-end, these steps took about *13 minutes*. With SDK > 2.23.0 and identical user code, the same section of the pipeline took *2 > seconds*: > > *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 > processElement > INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn > process > INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** > May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file1.json > May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 > processElement > INFO: Got ReadableFile: my-file2.json > May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file1.json > *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 > processElement > INFO: Got file contents for document_id my-file2.json > > Any thoughts on what could be causing this? > > Thanks, > Evan > > On Wed, May 12, 2021 at 9:53 AM Evan Galpin wrote: > >> >> >> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote: >> >>> Hi Evan, >>> >>> What do you mean startup delay? Is it the time that from you start the >>> pipeline to the time that you notice the first output record from PubSub? >>> >> >> Yes that's what I meant, the seemingly idle system waiting for pubsub >> output despite data being in the subscription at pipeline start time. >> >> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: >>> Can you try running direct runner with the option `--experiments=use_deprecated_read` >>> >> This seems to work for me, thanks for this! 👍 >> >> Seems like an instance of https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 also reported in https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E We should rollback using the SDF wrapper by default because of the usability and performance issues reported. On Sat, May 8, 2021 at 12:57 AM Evan Galpin wrote: > Hi all, > > I’m experiencing very slow performance and startup delay when testing > a pipeline locally. I’m reading data from a Google PubSub subscription as > the data source, and before each pipeline execution I ensure that data is > present in the subscription (readable from GCP console). > > I’m seeing startup delay on the order of minutes with DirectRunner > (5-10 min). Is that expected? I did find a Jira ticket[1] that at first > seemed related, but I think it has more to do with BQ than DirectRunner. > > I’ve run the pipeline with a debugger connected and confirmed that > it’s minutes before the first DoFn in my pipeline receives any data. Is > there a way I can profile the direct runner to see what it’s churning on? > > Thanks, > Evan > > [1] > https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >
Re: Extremely Slow DirectRunner
Hmm, I think I spoke too soon. I'm still seeing an issue of overall DirectRunner slowness, not just pubsub. I have a pipeline like so: Read pubsub | extract GCS glob patterns | FileIO.matchAll() | FileIO.readMatches() | Read file contents | etc I have temporarily set up a transform between each step to log what's going on and illustrate timing issues. I ran a series of tests changing only the SDK version each time since I hadn't noticed this performance issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the pubsub subscription with the exact same contents. SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem to resolve) and onward show a significant slowdown. Here is a snippet of logging from v2.25.0: *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 processElement INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file1.json May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file2.json May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file1.json *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Note that end-to-end, these steps took about *13 minutes*. With SDK 2.23.0 and identical user code, the same section of the pipeline took *2 seconds*: *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 processElement INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/** May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file1.json May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement INFO: Got ReadableFile: my-file2.json May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file1.json *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 processElement INFO: Got file contents for document_id my-file2.json Any thoughts on what could be causing this? Thanks, Evan On Wed, May 12, 2021 at 9:53 AM Evan Galpin wrote: > > > On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote: > >> Hi Evan, >> >> What do you mean startup delay? Is it the time that from you start the >> pipeline to the time that you notice the first output record from PubSub? >> > > Yes that's what I meant, the seemingly idle system waiting for pubsub > output despite data being in the subscription at pipeline start time. > > On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: >> >>> Can you try running direct runner with the option >>> `--experiments=use_deprecated_read` >>> >> > This seems to work for me, thanks for this! 👍 > > >>> Seems like an instance of >>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 >>> also reported in >>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >>> >>> We should rollback using the SDF wrapper by default because of the >>> usability and performance issues reported. >>> >>> >>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin >>> wrote: >>> Hi all, I’m experiencing very slow performance and startup delay when testing a pipeline locally. I’m reading data from a Google PubSub subscription as the data source, and before each pipeline execution I ensure that data is present in the subscription (readable from GCP console). I’m seeing startup delay on the order of minutes with DirectRunner (5-10 min). Is that expected? I did find a Jira ticket[1] that at first seemed related, but I think it has more to do with BQ than DirectRunner. I’ve run the pipeline with a debugger connected and confirmed that it’s minutes before the first DoFn in my pipeline receives any data. Is there a way I can profile the direct runner to see what it’s churning on? Thanks, Evan [1] https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >>>
Re: Extremely Slow DirectRunner
On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang wrote: > Hi Evan, > > What do you mean startup delay? Is it the time that from you start the > pipeline to the time that you notice the first output record from PubSub? > Yes that's what I meant, the seemingly idle system waiting for pubsub output despite data being in the subscription at pipeline start time. On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: > >> Can you try running direct runner with the option >> `--experiments=use_deprecated_read` >> > This seems to work for me, thanks for this! 👍 >> Seems like an instance of >> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 >> also reported in >> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >> >> We should rollback using the SDF wrapper by default because of the >> usability and performance issues reported. >> >> >> On Sat, May 8, 2021 at 12:57 AM Evan Galpin >> wrote: >> >>> Hi all, >>> >>> I’m experiencing very slow performance and startup delay when testing a >>> pipeline locally. I’m reading data from a Google PubSub subscription as the >>> data source, and before each pipeline execution I ensure that data is >>> present in the subscription (readable from GCP console). >>> >>> I’m seeing startup delay on the order of minutes with DirectRunner (5-10 >>> min). Is that expected? I did find a Jira ticket[1] that at first seemed >>> related, but I think it has more to do with BQ than DirectRunner. >>> >>> I’ve run the pipeline with a debugger connected and confirmed that it’s >>> minutes before the first DoFn in my pipeline receives any data. Is there a >>> way I can profile the direct runner to see what it’s churning on? >>> >>> Thanks, >>> Evan >>> >>> [1] >>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >>> >>
Re: Extremely Slow DirectRunner
Hi Evan, What do you mean startup delay? Is it the time that from you start the pipeline to the time that you notice the first output record from PubSub? On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: > Can you try running direct runner with the option > `--experiments=use_deprecated_read` > > Seems like an instance of > https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 > also reported in > https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E > > We should rollback using the SDF wrapper by default because of the > usability and performance issues reported. > > > On Sat, May 8, 2021 at 12:57 AM Evan Galpin wrote: > >> Hi all, >> >> I’m experiencing very slow performance and startup delay when testing a >> pipeline locally. I’m reading data from a Google PubSub subscription as the >> data source, and before each pipeline execution I ensure that data is >> present in the subscription (readable from GCP console). >> >> I’m seeing startup delay on the order of minutes with DirectRunner (5-10 >> min). Is that expected? I did find a Jira ticket[1] that at first seemed >> related, but I think it has more to do with BQ than DirectRunner. >> >> I’ve run the pipeline with a debugger connected and confirmed that it’s >> minutes before the first DoFn in my pipeline receives any data. Is there a >> way I can profile the direct runner to see what it’s churning on? >> >> Thanks, >> Evan >> >> [1] >> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >> >
Re: Extremely Slow DirectRunner
Can you try running direct runner with the option `--experiments=use_deprecated_read` Seems like an instance of https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 also reported in https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E We should rollback using the SDF wrapper by default because of the usability and performance issues reported. On Sat, May 8, 2021 at 12:57 AM Evan Galpin wrote: > Hi all, > > I’m experiencing very slow performance and startup delay when testing a > pipeline locally. I’m reading data from a Google PubSub subscription as the > data source, and before each pipeline execution I ensure that data is > present in the subscription (readable from GCP console). > > I’m seeing startup delay on the order of minutes with DirectRunner (5-10 > min). Is that expected? I did find a Jira ticket[1] that at first seemed > related, but I think it has more to do with BQ than DirectRunner. > > I’ve run the pipeline with a debugger connected and confirmed that it’s > minutes before the first DoFn in my pipeline receives any data. Is there a > way I can profile the direct runner to see what it’s churning on? > > Thanks, > Evan > > [1] > https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >