Hi Evan, It seems like the slow step is not the read that use_deprecated_read targets for. Would you like to share your pipeline code if possible?
On Wed, May 12, 2021 at 1:35 PM Evan Galpin <evan.gal...@gmail.com> wrote: > I just tried with v2.29.0 and use_deprecated_read but unfortunately I > observed slow behavior again. Is it possible that use_deprecated_read is > broken in 2.29.0 as well? > > Thanks, > Evan > > On Wed, May 12, 2021 at 3:21 PM Steve Niemitz <sniem...@apache.org> wrote: > >> oops sorry I was off by 10...I meant 2.29 not 2.19. >> >> On Wed, May 12, 2021 at 2:55 PM Evan Galpin <evan.gal...@gmail.com> >> wrote: >> >>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster" >>> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for >>> my use case at least) regardless of use_deprecated_read setting. >>> >>> Thanks, >>> Evan >>> >>> >>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz <sniem...@apache.org> >>> wrote: >>> >>>> use_deprecated_read was broken in 2.19 on the direct runner and didn't >>>> do anything. [1] I don't think the fix is in 2.20 either, but will be in >>>> 2.21. >>>> >>>> [1] https://github.com/apache/beam/pull/14469 >>>> >>>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin <evan.gal...@gmail.com> >>>> wrote: >>>> >>>>> I forgot to also mention that in all tests I was setting >>>>> --experiments=use_deprecated_read >>>>> >>>>> Thanks, >>>>> Evan >>>>> >>>>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin <evan.gal...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall >>>>>> DirectRunner slowness, not just pubsub. I have a pipeline like so: >>>>>> >>>>>> Read pubsub | extract GCS glob patterns | FileIO.matchAll() | >>>>>> FileIO.readMatches() | Read file contents | etc >>>>>> >>>>>> I have temporarily set up a transform between each step to log what's >>>>>> going on and illustrate timing issues. I ran a series of tests changing >>>>>> only the SDK version each time since I hadn't noticed this performance >>>>>> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded >>>>>> the >>>>>> pubsub subscription with the exact same contents. >>>>>> >>>>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't >>>>>> seem to resolve) and onward show a significant slowdown. >>>>>> >>>>>> Here is a snippet of logging from v2.25.0: >>>>>> >>>>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>>>>> processElement >>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>>>>> May 12, 2021 11:16:59 A.M. >>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>>>>> INFO: Matched 2 files for pattern >>>>>> gs://my-bucket/my-dir/5004728247517184/** >>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>> processElement >>>>>> INFO: Got ReadableFile: my-file1.json >>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>> processElement >>>>>> INFO: Got ReadableFile: my-file2.json >>>>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>>>>> processElement >>>>>> INFO: Got file contents for document_id my-file1.json >>>>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >>>>>> processElement >>>>>> INFO: Got file contents for document_id my-file2.json >>>>>> >>>>>> Note that end-to-end, these steps took about *13 minutes*. With SDK >>>>>> 2.23.0 and identical user code, the same section of the pipeline took *2 >>>>>> seconds*: >>>>>> >>>>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>>>>> processElement >>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>>>>> May 12, 2021 11:03:40 A.M. >>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>>>>> INFO: Matched 2 files for pattern >>>>>> gs://my-bucket/my-dir/5004728247517184/** >>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>> processElement >>>>>> INFO: Got ReadableFile: my-file1.json >>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>> processElement >>>>>> INFO: Got ReadableFile: my-file2.json >>>>>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>>>>> processElement >>>>>> INFO: Got file contents for document_id my-file1.json >>>>>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 >>>>>> processElement >>>>>> INFO: Got file contents for document_id my-file2.json >>>>>> >>>>>> Any thoughts on what could be causing this? >>>>>> >>>>>> Thanks, >>>>>> Evan >>>>>> >>>>>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin <evan.gal...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang <boyu...@google.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Evan, >>>>>>>> >>>>>>>> What do you mean startup delay? Is it the time that from you start >>>>>>>> the pipeline to the time that you notice the first output record from >>>>>>>> PubSub? >>>>>>>> >>>>>>> >>>>>>> Yes that's what I meant, the seemingly idle system waiting for >>>>>>> pubsub output despite data being in the subscription at pipeline start >>>>>>> time. >>>>>>> >>>>>>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía <ieme...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Can you try running direct runner with the option >>>>>>>>> `--experiments=use_deprecated_read` >>>>>>>>> >>>>>>>> >>>>>>> This seems to work for me, thanks for this! 👍 >>>>>>> >>>>>>> >>>>>>>>> Seems like an instance of >>>>>>>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 >>>>>>>>> also reported in >>>>>>>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >>>>>>>>> >>>>>>>>> We should rollback using the SDF wrapper by default because of the >>>>>>>>> usability and performance issues reported. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin <evan.gal...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> I’m experiencing very slow performance and startup delay when >>>>>>>>>> testing a pipeline locally. I’m reading data from a Google PubSub >>>>>>>>>> subscription as the data source, and before each pipeline execution I >>>>>>>>>> ensure that data is present in the subscription (readable from GCP >>>>>>>>>> console). >>>>>>>>>> >>>>>>>>>> I’m seeing startup delay on the order of minutes with >>>>>>>>>> DirectRunner (5-10 min). Is that expected? I did find a Jira >>>>>>>>>> ticket[1] that >>>>>>>>>> at first seemed related, but I think it has more to do with BQ than >>>>>>>>>> DirectRunner. >>>>>>>>>> >>>>>>>>>> I’ve run the pipeline with a debugger connected and confirmed >>>>>>>>>> that it’s minutes before the first DoFn in my pipeline receives any >>>>>>>>>> data. >>>>>>>>>> Is there a way I can profile the direct runner to see what it’s >>>>>>>>>> churning >>>>>>>>>> on? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Evan >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> >>>>>>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >>>>>>>>>> >>>>>>>>>