Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Zhijiang Thu, 16 Jul 2020 21:20:25 -0700

Hi Thomas,

Thanks for your further profiling information and glad to see we already 
finalized the location to cause the regression. 
Actually I was also suspicious of the point of #snapshotState in previous 
discussions since it indeed cost much time to block normal operator processing.


Based on your below feedback, the sleep time during #snapshotState might be the 
main concern, and I also digged into the implementation of 
FlinkKinesisProducer#snapshotState.
while (producer.getOutstandingRecordsCount() > 0) {
   producer.flush();
   try {
      Thread.sleep(500);
   } catch (InterruptedException e) {
      LOG.warn("Flushing was interrupted.");
      break;
   }
}
It seems that the sleep time is mainly affected by the internal operations 
inside KinesisProducer implementation provided by amazonaws, which I am not 
quite familiar with. 
But I noticed there were two upgrades related to it in release-1.11.0. One is 
for upgrading amazon-kinesis-producer to 0.14.0 [1] and another is for 
upgrading aws-sdk-version to 1.11.754 [2].
You mentioned that you already reverted the SDK upgrade to verify no changes. 
Did you also revert the [1] to verify?
[1] https://issues.apache.org/jira/browse/FLINK-17496
[2] https://issues.apache.org/jira/browse/FLINK-14881

Best,
Zhijiang
------------------------------------------------------------------
From:Thomas Weise <[email protected]>
Send Time:2020年7月17日(星期五) 05:29
To:dev <[email protected]>
Cc:Zhijiang <[email protected]>; Stephan Ewen <[email protected]>; 
Arvid Heise <[email protected]>; Aljoscha Krettek <[email protected]>
Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release 
candidate #4)

Sorry for the delay.

I confirmed that the regression is due to the sink (unsurprising, since
another job with the same consumer, but not the producer, runs as expected).

As promised I did CPU profiling on the problematic application, which gives
more insight into the regression [1]

The screenshots show that the average time for snapshotState increases from
~9s to ~28s. The data also shows the increase in sleep time during
snapshotState.

Does anyone, based on changes made in 1.11, have a theory why?

I had previously looked at the changes to the Kinesis connector and also
reverted the SDK upgrade, which did not change the situation.

It will likely be necessary to drill into the sink / checkpointing details
to understand the cause of the problem.

Let me know if anyone has specific questions that I can answer from the
profiling results.

Thomas

[1]
https://docs.google.com/presentation/d/159IVXQGXabjnYJk3oVm3UP2UW_5G-TGs_u9yzYb030I/edit?usp=sharing

On Mon, Jul 13, 2020 at 11:14 AM Thomas Weise <[email protected]> wrote:

> + dev@ for visibility
>
> I will investigate further today.
>
>
> On Wed, Jul 8, 2020 at 4:42 AM Aljoscha Krettek <[email protected]>
> wrote:
>
>> On 06.07.20 20:39, Stephan Ewen wrote:
>> >    - Did sink checkpoint notifications change in a relevant way, for
>> example
>> > due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>>
>> I think that's unrelated: the Kafka fixes were isolated in Kafka and the
>> one bug I discovered on the way was about the Task reaper.
>>
>>
>> On 07.07.20 17:51, Zhijiang wrote:
>> > Sorry for my misunderstood of the previous information, Thomas. I was
>> assuming that the sync checkpoint duration increased after upgrade as it
>> was mentioned before.
>> >
>> > If I remembered correctly, the memory state backend also has the same
>> issue? If so, we can dismiss the rocksDB state changes. As the slot sharing
>> enabled, the downstream and upstream should
>> > probably deployed into the same slot, then no network shuffle effect.
>> >
>> > I think we need to find out whether it has other symptoms changed
>> besides the performance regression to further figure out the scope.
>> > E.g. any metrics changes, the number of TaskManager and the number of
>> slots per TaskManager from deployment changes.
>> > 40% regression is really big, I guess the changes should also be
>> reflected in other places.
>> >
>> > I am not sure whether we can reproduce the regression in our AWS
>> environment by writing any Kinesis jobs, since there are also normal
>> Kinesis jobs as Thomas mentioned after upgrade.
>> > So it probably looks like to touch some corner case. I am very willing
>> to provide any help for debugging if possible.
>> >
>> >
>> > Best,
>> > Zhijiang
>> >
>> >
>> > ------------------------------------------------------------------
>> > From:Thomas Weise <[email protected]>
>> > Send Time:2020年7月7日(星期二) 23:01
>> > To:Stephan Ewen <[email protected]>
>> > Cc:Aljoscha Krettek <[email protected]>; Arvid Heise <
>> [email protected]>; Zhijiang <[email protected]>
>> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
>> release candidate #4)
>> >
>> > We are deploying our apps with FlinkK8sOperator. We have one job that
>> works as expected after the upgrade and the one discussed here that has the
>> performance regression.
>> >
>> > "The performance regression is obvious caused by long duration of sync
>> checkpoint process in Kinesis sink operator, which would block the normal
>> data processing until back pressure the source."
>> >
>> > That's a constant. Before (1.10) and upgrade have the same sync
>> checkpointing time. The question is what change came in with the upgrade.
>> >
>> >
>> >
>> > On Tue, Jul 7, 2020 at 7:33 AM Stephan Ewen <[email protected]> wrote:
>> >
>> > @Thomas Just one thing real quick: Are you using the standalone setup
>> scripts (like start-cluster.sh, and the former "slaves" file) ?
>> > Be aware that this is now called "workers" because of avoiding
>> sensitive names.
>> > In one internal benchmark we saw quite a lot of slowdown initially,
>> before seeing that the cluster was not a distributed cluster any more ;-)
>> >
>> >
>> > On Tue, Jul 7, 2020 at 9:08 AM Zhijiang <[email protected]>
>> wrote:
>> > Thanks for this kickoff and help analysis, Stephan!
>> > Thanks for the further feedback and investigation, Thomas!
>> >
>> > The performance regression is obvious caused by long duration of sync
>> checkpoint process in Kinesis sink operator, which would block the normal
>> data processing until back pressure the source.
>> > Maybe we could dig into the process of sync execution in checkpoint.
>> E.g. break down the steps inside respective operator#snapshotState to
>> statistic which operation cost most of the time, then
>> > we might probably find the root cause to bring such cost.
>> >
>> > Look forward to the further progress. :)
>> >
>> > Best,
>> > Zhijiang
>> >
>> > ------------------------------------------------------------------
>> > From:Stephan Ewen <[email protected]>
>> > Send Time:2020年7月7日(星期二) 14:52
>> > To:Thomas Weise <[email protected]>
>> > Cc:Stephan Ewen <[email protected]>; Zhijiang <
>> [email protected]>; Aljoscha Krettek <[email protected]>;
>> Arvid Heise <[email protected]>
>> > Subject:Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0,
>> release candidate #4)
>> >
>> > Thank you for the digging so deeply.
>> > Mysterious think this regression.
>> >
>> > On Mon, Jul 6, 2020, 22:56 Thomas Weise <[email protected]> wrote:
>> > @Stephan: yes, I refer to sync time in the web UI (it is unchanged
>> between 1.10 and 1.11 for the specific pipeline).
>> >
>> > I verified that increasing the checkpointing interval does not make a
>> difference.
>> >
>> > I looked at the Kinesis connector changes since 1.10.1 and don't see
>> anything that could cause this.
>> >
>> > Another pipeline that is using the Kinesis consumer (but not the
>> producer) performs as expected.
>> >
>> > I tried reverting the AWS SDK version change, symptoms remain unchanged:
>> >
>> > diff --git a/flink-connectors/flink-connector-kinesis/pom.xml
>> b/flink-connectors/flink-connector-kinesis/pom.xml
>> > index a6abce23ba..741743a05e 100644
>> > --- a/flink-connectors/flink-connector-kinesis/pom.xml
>> > +++ b/flink-connectors/flink-connector-kinesis/pom.xml
>> > @@ -33,7 +33,7 @@ under the License.
>> >
>> <artifactId>flink-connector-kinesis_${scala.binary.version}</artifactId>
>> >          <name>flink-connector-kinesis</name>
>> >          <properties>
>> > -               <aws.sdk.version>1.11.754</aws.sdk.version>
>> > +               <aws.sdk.version>1.11.603</aws.sdk.version>
>> >
>> <aws.kinesis-kcl.version>1.11.2</aws.kinesis-kcl.version>
>> >
>> <aws.kinesis-kpl.version>0.14.0</aws.kinesis-kpl.version>
>> >
>> <aws.dynamodbstreams-kinesis-adapter.version>1.5.0</aws.dynamodbstreams-kinesis-adapter.version>
>> >
>> > I'm planning to take a look with a profiler next.
>> >
>> > Thomas
>> >
>> >
>> > On Mon, Jul 6, 2020 at 11:40 AM Stephan Ewen <[email protected]> wrote:
>> > Hi all!
>> >
>> > Forking this thread out of the release vote thread.
>> >  From what Thomas describes, it really sounds like a sink-specific
>> issue.
>> >
>> > @Thomas: When you say sink has a long synchronous checkpoint time, you
>> mean the time that is shown as "sync time" on the metrics and web UI? That
>> is not including any network buffer related operations. It is purely the
>> operator's time.
>> >
>> > Can we dig into the changes we did in sinks:
>> >    - Kinesis version upgrade, AWS library updates
>> >
>> >    - Could it be that some call (checkpoint complete) that was
>> previously (1.10) in a separate thread is not in the mailbox and this
>> simply reduces the number of threads that do the work?
>> >
>> >    - Did sink checkpoint notifications change in a relevant way, for
>> example due to some Kafka issues we addressed in 1.11 (@Aljoscha maybe?)
>> >
>> > Best,
>> > Stephan
>> >
>> >
>> > On Sun, Jul 5, 2020 at 7:10 AM Zhijiang 
>> > <[email protected]>
>> wrote:
>> > Hi Thomas,
>> >
>> >   Regarding [2], it has more detail infos in the Jira description (
>> https://issues.apache.org/jira/browse/FLINK-16404).
>> >
>> >   I can also give some basic explanations here to dismiss the concern.
>> >   1. In the past, the following buffers after the barrier will be
>> cached on downstream side before alignment.
>> >   2. In 1.11, the upstream would not send the buffers after the
>> barrier. When the downstream finishes the alignment, it will notify the
>> downstream of continuing sending following buffers, since it can process
>> them after alignment.
>> >   3. The only difference is that the temporary blocked buffers are
>> cached either on downstream side or on upstream side before alignment.
>> >   4. The side effect would be the additional notification cost for
>> every barrier alignment. If the downstream and upstream are deployed in
>> separate TaskManager, the cost is network transport delay (the effect can
>> be ignored based on our testing with 1s checkpoint interval). For sharing
>> slot in your case, the cost is only one method call in processor, can be
>> ignored also.
>> >
>> >   You mentioned "In this case, the downstream task has a high average
>> checkpoint duration(~30s, sync part)." This duration is not reflecting the
>> changes above, and it is only indicating the duration for calling
>> `Operation.snapshotState`.
>> >   If this duration is beyond your expectation, you can check or debug
>> whether the source/sink operations might take more time to finish
>> `snapshotState` in practice. E.g. you can
>> >   make the implementation of this method as empty to further verify the
>> effect.
>> >
>> >   Best,
>> >   Zhijiang
>> >
>> >
>> >   ------------------------------------------------------------------
>> >   From:Thomas Weise <[email protected]>
>> >   Send Time:2020年7月5日(星期日) 12:22
>> >   To:dev <[email protected]>; Zhijiang <[email protected]>
>> >   Cc:Yingjie Cao <[email protected]>
>> >   Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >
>> >   Hi Zhijiang,
>> >
>> >   Could you please point me to more details regarding: "[2]: Delay send
>> the
>> >   following buffers after checkpoint barrier on upstream side until
>> barrier
>> >   alignment on downstream side."
>> >
>> >   In this case, the downstream task has a high average checkpoint
>> duration
>> >   (~30s, sync part). If there was a change to hold buffers depending on
>> >   downstream performance, could this possibly apply to this case (even
>> when
>> >   there is no shuffle that would require alignment)?
>> >
>> >   Thanks,
>> >   Thomas
>> >
>> >
>> >   On Sat, Jul 4, 2020 at 7:39 AM Zhijiang <[email protected]
>> .invalid>
>> >   wrote:
>> >
>> >   > Hi Thomas,
>> >   >
>> >   > Thanks for the further update information.
>> >   >
>> >   > I guess we can dismiss the network stack changes, since in your
>> case the
>> >   > downstream and upstream would probably be deployed in the same slot
>> >   > bypassing the network data shuffle.
>> >   > Also I guess release-1.11 will not bring general performance
>> regression in
>> >   > runtime engine, as we also did the performance testing for all
>> general
>> >   > cases by [1] in real cluster before and the testing results should
>> fit the
>> >   > expectation. But we indeed did not test the specific source and sink
>> >   > connectors yet as I known.
>> >   >
>> >   > Regarding your performance regression with 40%, I wonder it is
>> probably
>> >   > related to specific source/sink changes (e.g. kinesis) or
>> environment
>> >   > issues with corner case.
>> >   > If possible, it would be helpful to further locate whether the
>> regression
>> >   > is caused by kinesis, by replacing the kinesis source & sink and
>> keeping
>> >   > the others same.
>> >   >
>> >   > As you said, it would be efficient to contact with you directly
>> next week
>> >   > to further discuss this issue. And we are willing/eager to provide
>> any help
>> >   > to resolve this issue soon.
>> >   >
>> >   > Besides that, I guess this issue should not be the blocker for the
>> >   > release, since it is probably a corner case based on the current
>> analysis.
>> >   > If we really conclude anything need to be resolved after the final
>> >   > release, then we can also make the next minor release-1.11.1 come
>> soon.
>> >   >
>> >   > [1] https://issues.apache.org/jira/browse/FLINK-18433
>> >   >
>> >   > Best,
>> >   > Zhijiang
>> >   >
>> >   >
>> >   > ------------------------------------------------------------------
>> >   > From:Thomas Weise <[email protected]>
>> >   > Send Time:2020年7月4日(星期六) 12:26
>> >   > To:dev <[email protected]>; Zhijiang <[email protected]
>> >
>> >   > Cc:Yingjie Cao <[email protected]>
>> >   > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   >
>> >   > Hi Zhijiang,
>> >   >
>> >   > It will probably be best if we connect next week and discuss the
>> issue
>> >   > directly since this could be quite difficult to reproduce.
>> >   >
>> >   > Before the testing result on our side comes out for your respective
>> job
>> >   > case, I have some other questions to confirm for further analysis:
>> >   >     -  How much percentage regression you found after switching to
>> 1.11?
>> >   >
>> >   > ~40% throughput decline
>> >   >
>> >   >     -  Are there any network bottleneck in your cluster? E.g. the
>> network
>> >   > bandwidth is full caused by other jobs? If so, it might have more
>> effects
>> >   > by above [2]
>> >   >
>> >   > The test runs on a k8s cluster that is also used for other
>> production jobs.
>> >   > There is no reason be believe network is the bottleneck.
>> >   >
>> >   >     -  Did you adjust the default network buffer setting? E.g.
>> >   > "taskmanager.network.memory.floating-buffers-per-gate" or
>> >   > "taskmanager.network.memory.buffers-per-channel"
>> >   >
>> >   > The job is using the defaults, i.e we don't configure the settings.
>> If you
>> >   > want me to try specific settings in the hope that it will help to
>> isolate
>> >   > the issue please let me know.
>> >   >
>> >   >     -  I guess the topology has three vertexes "KinesisConsumer ->
>> Chained
>> >   > FlatMap -> KinesisProducer", and the partition mode for
>> "KinesisConsumer ->
>> >   > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If so,
>> the edge
>> >   > connection is one-to-one, not all-to-all, then the above [1][2]
>> should no
>> >   > effects in theory with default network buffer setting.
>> >   >
>> >   > There are only 2 vertices and the edge is "forward".
>> >   >
>> >   >     - By slot sharing, I guess these three vertex parallelism task
>> would
>> >   > probably be deployed into the same slot, then the data shuffle is
>> by memory
>> >   > queue, not network stack. If so, the above [2] should no effect.
>> >   >
>> >   > Yes, vertices share slots.
>> >   >
>> >   >     - I also saw some Jira changes for kinesis in this release,
>> could you
>> >   > confirm that these changes would not effect the performance?
>> >   >
>> >   > I will need to take a look. 1.10 already had a regression
>> introduced by the
>> >   > Kinesis producer update.
>> >   >
>> >   >
>> >   > Thanks,
>> >   > Thomas
>> >   >
>> >   >
>> >   > On Thu, Jul 2, 2020 at 11:46 PM Zhijiang <
>> [email protected]
>> >   > .invalid>
>> >   > wrote:
>> >   >
>> >   > > Hi Thomas,
>> >   > >
>> >   > > Thanks for your reply with rich information!
>> >   > >
>> >   > > We are trying to reproduce your case in our cluster to further
>> verify it,
>> >   > > and  @Yingjie Cao is working on it now.
>> >   > >  As we have not kinesis consumer and producer internally, so we
>> will
>> >   > > construct the common source and sink instead in the case of
>> backpressure.
>> >   > >
>> >   > > Firstly, we can dismiss the rockdb factor in this release, since
>> you also
>> >   > > mentioned that "filesystem leads to same symptoms".
>> >   > >
>> >   > > Secondly, if my understanding is right, you emphasis that the
>> regression
>> >   > > only exists for the jobs with low checkpoint interval (10s).
>> >   > > Based on that, I have two suspicions with the network related
>> changes in
>> >   > > this release:
>> >   > >     - [1]: Limited the maximum backlog value (default 10) in
>> subpartition
>> >   > > queue.
>> >   > >     - [2]: Delay send the following buffers after checkpoint
>> barrier on
>> >   > > upstream side until barrier alignment on downstream side.
>> >   > >
>> >   > > These changes are motivated for reducing the in-flight buffers to
>> speedup
>> >   > > checkpoint especially in the case of backpressure.
>> >   > > In theory they should have very minor performance effect and
>> actually we
>> >   > > also tested in cluster to verify within expectation before
>> merging them,
>> >   > >  but maybe there are other corner cases we have not thought of
>> before.
>> >   > >
>> >   > > Before the testing result on our side comes out for your
>> respective job
>> >   > > case, I have some other questions to confirm for further analysis:
>> >   > >     -  How much percentage regression you found after switching
>> to 1.11?
>> >   > >     -  Are there any network bottleneck in your cluster? E.g. the
>> network
>> >   > > bandwidth is full caused by other jobs? If so, it might have more
>> effects
>> >   > > by above [2]
>> >   > >     -  Did you adjust the default network buffer setting? E.g.
>> >   > > "taskmanager.network.memory.floating-buffers-per-gate" or
>> >   > > "taskmanager.network.memory.buffers-per-channel"
>> >   > >     -  I guess the topology has three vertexes "KinesisConsumer ->
>> >   > Chained
>> >   > > FlatMap -> KinesisProducer", and the partition mode for
>> "KinesisConsumer
>> >   > ->
>> >   > > FlatMap" and "FlatMap->KinesisProducer" are both "forward"? If
>> so, the
>> >   > edge
>> >   > > connection is one-to-one, not all-to-all, then the above [1][2]
>> should no
>> >   > > effects in theory with default network buffer setting.
>> >   > >     - By slot sharing, I guess these three vertex parallelism
>> task would
>> >   > > probably be deployed into the same slot, then the data shuffle is
>> by
>> >   > memory
>> >   > > queue, not network stack. If so, the above [2] should no effect.
>> >   > >     - I also saw some Jira changes for kinesis in this release,
>> could you
>> >   > > confirm that these changes would not effect the performance?
>> >   > >
>> >   > > Best,
>> >   > > Zhijiang
>> >   > >
>> >   > >
>> >   > > ------------------------------------------------------------------
>> >   > > From:Thomas Weise <[email protected]>
>> >   > > Send Time:2020年7月3日(星期五) 01:07
>> >   > > To:dev <[email protected]>; Zhijiang <
>> [email protected]>
>> >   > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   > >
>> >   > > Hi Zhijiang,
>> >   > >
>> >   > > The performance degradation manifests in backpressure which leads
>> to
>> >   > > growing backlog in the source. I switched a few times between
>> 1.10 and
>> >   > 1.11
>> >   > > and the behavior is consistent.
>> >   > >
>> >   > > The DAG is:
>> >   > >
>> >   > > KinesisConsumer -> (Flat Map, Flat Map, Flat Map)   --------
>> forward
>> >   > > ---------> KinesisProducer
>> >   > >
>> >   > > Parallelism: 160
>> >   > > No shuffle/rebalance.
>> >   > >
>> >   > > Checkpointing config:
>> >   > >
>> >   > > Checkpointing Mode Exactly Once
>> >   > > Interval 10s
>> >   > > Timeout 10m 0s
>> >   > > Minimum Pause Between Checkpoints 10s
>> >   > > Maximum Concurrent Checkpoints 1
>> >   > > Persist Checkpoints Externally Enabled (delete on cancellation)
>> >   > >
>> >   > > State backend: rocksdb  (filesystem leads to same symptoms)
>> >   > > Checkpoint size is tiny (500KB)
>> >   > >
>> >   > > An interesting difference to another job that I had upgraded
>> successfully
>> >   > > is the low checkpointing interval.
>> >   > >
>> >   > > Thanks,
>> >   > > Thomas
>> >   > >
>> >   > >
>> >   > > On Wed, Jul 1, 2020 at 9:02 PM Zhijiang <
>> [email protected]
>> >   > > .invalid>
>> >   > > wrote:
>> >   > >
>> >   > > > Hi Thomas,
>> >   > > >
>> >   > > > Thanks for the efficient feedback.
>> >   > > >
>> >   > > > Regarding the suggestion of adding the release notes document,
>> I agree
>> >   > > > with your point. Maybe we should adjust the vote template
>> accordingly
>> >   > in
>> >   > > > the respective wiki to guide the following release processes.
>> >   > > >
>> >   > > > Regarding the performance regression, could you provide some
>> more
>> >   > details
>> >   > > > for our better measurement or reproducing on our sides?
>> >   > > > E.g. I guess the topology only includes two vertexes source and
>> sink?
>> >   > > > What is the parallelism for every vertex?
>> >   > > > The upstream shuffles data to the downstream via rebalance
>> partitioner
>> >   > or
>> >   > > > other?
>> >   > > > The checkpoint mode is exactly-once with rocksDB state backend?
>> >   > > > The backpressure happened in this case?
>> >   > > > How much percentage regression in this case?
>> >   > > >
>> >   > > > Best,
>> >   > > > Zhijiang
>> >   > > >
>> >   > > >
>> >   > > >
>> >   > > >
>> ------------------------------------------------------------------
>> >   > > > From:Thomas Weise <[email protected]>
>> >   > > > Send Time:2020年7月2日(星期四) 09:54
>> >   > > > To:dev <[email protected]>
>> >   > > > Subject:Re: [VOTE] Release 1.11.0, release candidate #4
>> >   > > >
>> >   > > > Hi Till,
>> >   > > >
>> >   > > > Yes, we don't have the setting in flink-conf.yaml.
>> >   > > >
>> >   > > > Generally, we carry forward the existing configuration and any
>> change
>> >   > to
>> >   > > > default configuration values would impact the upgrade.
>> >   > > >
>> >   > > > Yes, since it is an incompatible change I would state it in the
>> release
>> >   > > > notes.
>> >   > > >
>> >   > > > Thanks,
>> >   > > > Thomas
>> >   > > >
>> >   > > > BTW I found a performance regression while trying to upgrade
>> another
>> >   > > > pipeline with this RC. It is a simple Kinesis to Kinesis job.
>> Wasn't
>> >   > able
>> >   > > > to pin it down yet, symptoms include increased checkpoint
>> alignment
>> >   > time.
>> >   > > >
>> >   > > > On Wed, Jul 1, 2020 at 12:04 AM Till Rohrmann <
>> [email protected]>
>> >   > > > wrote:
>> >   > > >
>> >   > > > > Hi Thomas,
>> >   > > > >
>> >   > > > > just to confirm: When starting the image in local mode, then
>> you
>> >   > don't
>> >   > > > have
>> >   > > > > any of the JobManager memory configuration settings
>> configured in the
>> >   > > > > effective flink-conf.yaml, right? Does this mean that you have
>> >   > > explicitly
>> >   > > > > removed `jobmanager.heap.size: 1024m` from the default
>> configuration?
>> >   > > If
>> >   > > > > this is the case, then I believe it was more of an
>> unintentional
>> >   > > artifact
>> >   > > > > that it worked before and it has been corrected now so that
>> one needs
>> >   > > to
>> >   > > > > specify the memory of the JM process explicitly. Do you think
>> it
>> >   > would
>> >   > > > help
>> >   > > > > to explicitly state this in the release notes?
>> >   > > > >
>> >   > > > > Cheers,
>> >   > > > > Till
>> >   > > > >
>> >   > > > > On Wed, Jul 1, 2020 at 7:01 AM Thomas Weise <[email protected]>
>> wrote:
>> >   > > > >
>> >   > > > > > Thanks for preparing another RC!
>> >   > > > > >
>> >   > > > > > As mentioned in the previous RC thread, it would be super
>> helpful
>> >   > if
>> >   > > > the
>> >   > > > > > release notes that are part of the documentation can be
>> included
>> >   > [1].
>> >   > > > > It's
>> >   > > > > > a significant time-saver to have read those first.
>> >   > > > > >
>> >   > > > > > I found one more non-backward compatible change that would
>> be worth
>> >   > > > > > addressing/mentioning:
>> >   > > > > >
>> >   > > > > > It is now necessary to configure the jobmanager heap size in
>> >   > > > > > flink-conf.yaml (with either jobmanager.heap.size
>> >   > > > > > or jobmanager.memory.heap.size). Why would I not want to do
>> that
>> >   > > > anyways?
>> >   > > > > > Well, we set it dynamically for a cluster deployment via the
>> >   > > > > > flinkk8soperator, but the container image can also be used
>> for
>> >   > > testing
>> >   > > > > with
>> >   > > > > > local mode (./bin/jobmanager.sh start-foreground local).
>> That will
>> >   > > fail
>> >   > > > > if
>> >   > > > > > the heap wasn't configured and that's how I noticed it.
>> >   > > > > >
>> >   > > > > > Thanks,
>> >   > > > > > Thomas
>> >   > > > > >
>> >   > > > > > [1]
>> >   > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > >
>> >   >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/release-notes/flink-1.11.html
>> >   > > > > >
>> >   > > > > > On Tue, Jun 30, 2020 at 3:18 AM Zhijiang <
>> >   > [email protected]
>> >   > > > > > .invalid>
>> >   > > > > > wrote:
>> >   > > > > >
>> >   > > > > > > Hi everyone,
>> >   > > > > > >
>> >   > > > > > > Please review and vote on the release candidate #4 for the
>> >   > version
>> >   > > > > > 1.11.0,
>> >   > > > > > > as follows:
>> >   > > > > > > [ ] +1, Approve the release
>> >   > > > > > > [ ] -1, Do not approve the release (please provide
>> specific
>> >   > > comments)
>> >   > > > > > >
>> >   > > > > > > The complete staging area is available for your review,
>> which
>> >   > > > includes:
>> >   > > > > > > * JIRA release notes [1],
>> >   > > > > > > * the official Apache source release and binary
>> convenience
>> >   > > releases
>> >   > > > to
>> >   > > > > > be
>> >   > > > > > > deployed to dist.apache.org [2], which are signed with
>> the key
>> >   > > with
>> >   > > > > > > fingerprint 2DA85B93244FDFA19A6244500653C0A2CEA00D0E [3],
>> >   > > > > > > * all artifacts to be deployed to the Maven Central
>> Repository
>> >   > [4],
>> >   > > > > > > * source code tag "release-1.11.0-rc4" [5],
>> >   > > > > > > * website pull request listing the new release and adding
>> >   > > > announcement
>> >   > > > > > > blog post [6].
>> >   > > > > > >
>> >   > > > > > > The vote will be open for at least 72 hours. It is
>> adopted by
>> >   > > > majority
>> >   > > > > > > approval, with at least 3 PMC affirmative votes.
>> >   > > > > > >
>> >   > > > > > > Thanks,
>> >   > > > > > > Release Manager
>> >   > > > > > >
>> >   > > > > > > [1]
>> >   > > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > >
>> >   >
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12346364
>> >   > > > > > > [2]
>> >   > https://dist.apache.org/repos/dist/dev/flink/flink-1.11.0-rc4/
>> >   > > > > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
>> >   > > > > > > [4]
>> >   > > > > > >
>> >   > > > >
>> >   > >
>> https://repository.apache.org/content/repositories/orgapacheflink-1377/
>> >   > > > > > > [5]
>> >   > > https://github.com/apache/flink/releases/tag/release-1.11.0-rc4
>> >   > > > > > > [6] https://github.com/apache/flink-web/pull/352
>> >   > > > > > >
>> >   > > > > > >
>> >   > > > > >
>> >   > > > >
>> >   > > >
>> >   > > >
>> >   > >
>> >   > >
>> >   >
>> >   >
>> >
>> >
>> >
>>
>>

Re: Kinesis Performance Issue (was [VOTE] Release 1.11.0, release candidate #4)

Reply via email to