Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

Jan Lukavský Wed, 06 Jan 2021 01:14:04 -0800

Sorry for the typo in your name. :-)

On 1/6/21 10:11 AM, Jan Lukavský wrote:

Hi Antonie,
yes, for instance. I'd just like to rule out possibility that a singleDoFn processing multiple partitions (restrictions) brings someoverhead in your case.
Jan

On 12/31/20 10:36 PM, Antonio Si wrote:
Hi Jan,
Sorry for the late reply. My topic has 180 partitions. Do you meanrun with a
parallelism set to 900?

Thanks.

Antonio.

On 2020/12/23 20:30:34, Jan Lukavský <[email protected]> wrote:
OK,

could you make an experiment and increase the parallelism to something
significantly higher than the total number of partitions? Say 5 times
higher? Would that have impact on throughput in your case?

Jan

On 12/23/20 7:03 PM, Antonio Si wrote:
Hi Jan,
The performance data that I reported was run with parallelism = 8.We also ran with parallelism = 15 and we observed similar behaviorsalthough I don't have the exact numbers. I can get you the numbersif needed.
Regarding number of partitions, since we have multiple topics, thenumber of partitions varies from 180 to 12. The highest TPS topichas 180 partitions, while the lowest TPS topic has 12 partitions.
Thanks.

Antonio.

On 2020/12/23 12:28:42, Jan Lukavský <[email protected]> wrote:
Hi Antonio,

can you please clarify a few things:

    a) what parallelism you use for your sources

    b) how many partitions there is in your topic(s)

Thanks,

    Jan

On 12/22/20 10:07 PM, Antonio Si wrote:
Hi Boyuan,
Let me clarify, I have tried with and without using--experiments=beam_fn_api,use_sdf_kafka_read option:
- with --experiments=use_deprecated_read --fasterrCopy=true, Iam able to achieve 13K TPS- with --experiments="beam_fn_api,use_sdf_kafka_read"--fasterCopy=true, I am able to achieve 10K
-  with --fasterCopy=true alone, I am only able to achieve 5K TPS
In our testcase, we have multiple topics, checkpoint intervals is60s. Some topics have a lot higher traffics than others. We lookat the case with --experiments="beam_fn_api,use_sdf_kafka_read"--fasterCopy=true options a little. Based on our observation,each consumer poll() in ReadFromKafkaDoFn.processElement() takesabout 0.8ms. So for topic with high traffics, it will continue inthe loop because every poll() will return some records. Everypoll returns about 200 records. So, it takes about 0.8ms forevery 200 records. I am not sure if that is part of the reasonfor the performance.
Thanks.

Antonio.

On 2020/12/21 19:03:19, Boyuan Zhang <[email protected]> wrote:
Hi Antonio,

Thanks for the data point. That's very valuable information!

I didn't use DirectRunner. I am using FlinkRunner.
We measured the number of Kafka messages that we can processedper second.
With Beam v2.26 with --experiments=use_deprecated_read and
--fasterCopy=true,
we are able to consume 13K messages per second, but with Beamv2.26without the use_deprecated_read option, we are only able toprocess 10K
messages
per second for the same pipeline.
We do have SDF implementation of Kafka Read instead of using thewrapper.Would you like to have a try to see whether it helps you improveyoursituation? You can use--experiments=beam_fn_api,use_sdf_kafka_read to
switch to the Kafka SDF Read.
On Mon, Dec 21, 2020 at 10:54 AM Boyuan Zhang<[email protected]> wrote:
Hi Jan,
it seems that what we would want is to couple the lifecycle ofthe Reader
not with the restriction but with the particular instance of
(Un)boundedSource (after being split). That could be done inthe processingDoFn, if it contained a cache mapping instance of the sourceto the(possibly null - i.e. not yet open) reader. In @NewTracker wecould assign(or create) the reader to the tracker, as the tracker iscreated for each
restriction.

WDYT?
I was thinking about this but it seems like it is notapplicable to the
way how UnboundedSource and UnboundedReader work together.
Please correct me if I'm wrong. The UnboundedReader is createdfromUnboundedSource per CheckpointMark[1], which means for certainsources, theCheckpointMark could affect some attributes like start positionof thereader when resuming. So a single UnboundedSource could bemapped tomultiple readers because of different instances ofCheckpointMarl. That's
also the reason why we use CheckpointMark as the restriction.

Please let me know if I misunderstand your suggestion.

[1]
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java#L73-L78
On Mon, Dec 21, 2020 at 9:18 AM Antonio Si<[email protected]> wrote:
Hi Boyuan,

Sorry for my late reply. I was off for a few days.

I didn't use DirectRunner. I am using FlinkRunner.
We measured the number of Kafka messages that we can processedper second.
With Beam v2.26 with --experiments=use_deprecated_read and
--fasterCopy=true,
we are able to consume 13K messages per second, but with Beamv2.26without the use_deprecated_read option, we are only able toprocess 10K
messages
per second for the same pipeline.

Thanks and regards,

Antonio.

On 2020/12/11 22:19:40, Boyuan Zhang <[email protected]> wrote:
Hi Antonio,
Thanks for the details! Which version of Beam SDK are youusing? And areyou using --experiments=beam_fn_api with DirectRunner tolaunch your
pipeline?

For ReadFromKafkaDoFn.processElement(), it will take a Kafka
topic+partition as input element and a KafkaConsumer will beassigned tothis topic+partition then poll records continuously. TheKafka consumer
will resume reading and return from the process fn when
- There are no available records currently(this is afeature of SDF
      which calls SDF self-initiated checkpoint)
- TheOutputAndTimeBoundedSplittableProcessElementInvoker issues checkpoint request to ReadFromKafkaDoFn for gettingpartial results.
The
checkpoint frequency for DirectRunner is every 100output records or
every
      1 seconds.
It seems like either the self-initiated checkpoint orDirectRunner
issued
checkpoint gives you the performance regression since thereis overheadwhen rescheduling residuals. In your case, it's more likethat the
checkpoint behavior of
OutputAndTimeBoundedSplittableProcessElementInvoker
gives you 200 elements a batch. I want to understand whatkind ofperformance regression you are noticing? Is it slower tooutput the same
amount of records?
On Fri, Dec 11, 2020 at 1:31 PM Antonio Si<[email protected]>
wrote:
Hi Boyuan,
This is Antonio. I reported the KafkaIO.read() performanceissue on
the
slack channel a few days ago.

I am not sure if this is helpful, but I have been doing some
debugging on
the SDK KafkaIO performance issue for our pipeline and Iwould like to
provide some observations.
It looks like in my case theReadFromKafkaDoFn.processElement() wasinvoked within the same thread and every timekafaconsumer.poll() iscalled, it returns some records, from 1 up to 200 records.So, it willproceed to run the pipeline steps. Each kafkaconsumer.poll()takes
about
0.8ms. So, in this case, the polling and running of thepipeline areexecuted sequentially within a single thread. So, afterprocessing a
batch
of records, it will need to wait for 0.8ms before it canprocess the
next
batch of records again.

Any suggestions would be appreciated.

Hope that helps.

Thanks and regards,

Antonio.
On 2020/12/04 19:17:46, Boyuan Zhang <[email protected]>wrote:
Opened https://issues.apache.org/jira/browse/BEAM-11403 for
tracking.
On Fri, Dec 4, 2020 at 10:52 AM Boyuan Zhang<[email protected]>
wrote:
Thanks for the pointer, Steve! I'll check it out. Theexecution
paths
for
UnboundedSource and SDF wrapper are different. It's highly
possible
that
the regression either comes from the invocation path for SDF
wrapper,
or
the implementation of SDF wrapper itself.
On Fri, Dec 4, 2020 at 6:33 AM Steve Niemitz<[email protected]
wrote:
Coincidentally, someone else in the ASF slack mentioned [1]
yesterday
that they were seeing significantly reduced performanceusing
KafkaIO.Read
w/ the SDF wrapper vs the unbounded source. Theymentioned they
were
using
flink 1.9.
https://the-asf.slack.com/archives/C9H0YNP3P/p1607057900393900
On Thu, Dec 3, 2020 at 1:56 PM Boyuan Zhang<[email protected]>
wrote:
Hi Steve,

I think the major performance regression comes from
OutputAndTimeBoundedSplittableProcessElementInvoker[1],which
will
checkpoint the DoFn based on time/output limit and use
timers/state
to
reschedule works.

[1]
https://github.com/apache/beam/blob/master/runners/core-java/src/main/java/org/apache/beam/runners/core/OutputAndTimeBoundedSplittableProcessElementInvoker.java
On Thu, Dec 3, 2020 at 9:40 AM Steve Niemitz <
[email protected]>
wrote:
I have a pipeline that reads from pubsub, does some
aggregation, and
writes to various places. Previously, in olderversions of
beam,
when
running this in the DirectRunner, messages would gothrough the
pipeline
almost instantly, making it very easy to debug locally,etc.
However, after upgrading to beam 2.25, I noticed thatit could
take
on
the order of 5-10 minutes for messages to get from thepubsub
read
step to
the next step in the pipeline (deserializing them,etc). The
subscription
being read from has on the order of 100,000 elements/sec
arriving
in it.
Setting --experiments=use_deprecated_read fixes it, andmakes
the
pipeline behave as it did before.
It seems like the SDF implementation in theDirectRunner here
is
causing some kind of issue, either buffering a very large
amount of
data
before emitting it in a bundle, or something else. Hasanyone
else
run
into this?

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

Reply via email to