Hmm, I think I spoke too soon. I'm still seeing an issue of overall
DirectRunner slowness, not just pubsub. I have a pipeline like so:

Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
FileIO.readMatches()  |  Read file contents  |  etc

I have temporarily set up a transform between each step to log what's going
on and illustrate timing issues.  I ran a series of tests changing only the
SDK version each time since I hadn't noticed this performance issue with
2.19.0 (effectively git-bisect). Before each test, I seeded the pubsub
subscription with the exact same contents.

SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't seem to
resolve) and onward show a significant slowdown.

Here is a snippet of logging from v2.25.0:

*May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
processElement
INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:16:59 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
process
INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file1.json
May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file2.json
May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement
INFO: Got file contents for document_id my-file1.json
*May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
processElement
INFO: Got file contents for document_id my-file2.json

Note that end-to-end, these steps took about *13 minutes*. With SDK 2.23.0
and identical user code, the same section of the pipeline took *2 seconds*:

*May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
processElement
INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:03:40 A.M. org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn
process
INFO: Matched 2 files for pattern gs://my-bucket/my-dir/5004728247517184/**
May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file1.json
May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 processElement
INFO: Got ReadableFile: my-file2.json
May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 processElement
INFO: Got file contents for document_id my-file1.json
*May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
processElement
INFO: Got file contents for document_id my-file2.json

Any thoughts on what could be causing this?

Thanks,
Evan

On Wed, May 12, 2021 at 9:53 AM Evan Galpin <evan.gal...@gmail.com> wrote:

>
>
> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang <boyu...@google.com> wrote:
>
>> Hi Evan,
>>
>> What do you mean startup delay? Is it the time that from you start the
>> pipeline to the time that you notice the first output record from PubSub?
>>
>
> Yes that's what I meant, the seemingly idle system waiting for pubsub
> output despite data being in the subscription at pipeline start time.
>
> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía <ieme...@gmail.com> wrote:
>>
>>> Can you try running direct runner with the option
>>> `--experiments=use_deprecated_read`
>>>
>>
> This seems to work for me, thanks for this! 👍
>
>
>>> Seems like an instance of
>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>>> also reported in
>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>>
>>> We should rollback using the SDF wrapper by default because of the
>>> usability and performance issues reported.
>>>
>>>
>>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin <evan.gal...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I’m experiencing very slow performance and startup delay when testing a
>>>> pipeline locally. I’m reading data from a Google PubSub subscription as the
>>>> data source, and before each pipeline execution I ensure that data is
>>>> present in the subscription (readable from GCP console).
>>>>
>>>> I’m seeing startup delay on the order of minutes with DirectRunner
>>>> (5-10 min). Is that expected? I did find a Jira ticket[1] that at first
>>>> seemed related, but I think it has more to do with BQ than DirectRunner.
>>>>
>>>> I’ve run the pipeline with a debugger connected and confirmed that it’s
>>>> minutes before the first DoFn in my pipeline receives any data. Is there a
>>>> way I can profile the direct runner to see what it’s churning on?
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> [1]
>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>>>>
>>>

Reply via email to