Re: Question about ZooKeeper HA new structure [FLINK-22636]

2021-09-03 Thread Till Rohrmann
For patch versions the Flink community is very careful not to introduce
breaking changes. Hence, for patch releases it should be possible to
upgrade via failover. However, I don't think that this is properly guarded
by tests at the moment and also no official guarantee we are giving.

Cheers,
Till

On Fri, Sep 3, 2021 at 9:40 AM Juha Mynttinen
 wrote:

> OK, thanks,
>
> I see, this savepointing and creating a new cluster is the documented [1]
> way of upgrading Flink version. However, I think at least for some version
> upgrades it has been fine to just switch the code to the new version. I
> might be wrong.
>
> What about patch versions like 1.13.X? The doc says "the general way of
> upgrading Flink across version". Can there be breaking changes in patch
> versions too?
>
> Regards,
> Juha
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/upgrading/#upgrading-the-flink-framework-version
>
>
>
> On Fri, Sep 3, 2021 at 9:49 AM Till Rohrmann  wrote:
>
> > Hi Juha,
> >
> > Flink does not give backwards compatibility wrt to its internal data
> > structures. The recommended way is to stop the jobs with a savepoint and
> > then resume these jobs on the new Flink cluster. Failing over the
> processes
> > with a new version is not guaranteed to work atm. I hope this answers
> your
> > question.
> >
> > Cheers,
> > Till
> >
> > On Fri, Sep 3, 2021 at 7:35 AM Juha Mynttinen
> >  wrote:
> >
> > > Hello,
> > >
> > > I noticed there's a change [1] coming up in Flink 1.14.0 in the
> ZooKeeper
> > > tree structure ZooKeeper HA services maintains.
> > >
> > > I didn't spot any migration logic from the old (< 1.14.0) structure to
> > the
> > > new. Did I miss something?
> > >
> > > If you have a Flink cluster running with 1.13.X and let's say add a
> > > JobManage with 1.14.0 and terminate the original so that the new one
> > > becomes the leader, how's the new one going to understand the data in
> > > ZooKeeper? There are naturally more cases where the same compatibility
> > > issue should be handled, this example should illustrate the issue.
> > >
> > > Regards,
> > > Juha
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-22636
> > >
> >
>
>
> --
> Regards,
> Juha
>


Re: Question about ZooKeeper HA new structure [FLINK-22636]

2021-09-02 Thread Till Rohrmann
Hi Juha,

Flink does not give backwards compatibility wrt to its internal data
structures. The recommended way is to stop the jobs with a savepoint and
then resume these jobs on the new Flink cluster. Failing over the processes
with a new version is not guaranteed to work atm. I hope this answers your
question.

Cheers,
Till

On Fri, Sep 3, 2021 at 7:35 AM Juha Mynttinen
 wrote:

> Hello,
>
> I noticed there's a change [1] coming up in Flink 1.14.0 in the ZooKeeper
> tree structure ZooKeeper HA services maintains.
>
> I didn't spot any migration logic from the old (< 1.14.0) structure to the
> new. Did I miss something?
>
> If you have a Flink cluster running with 1.13.X and let's say add a
> JobManage with 1.14.0 and terminate the original so that the new one
> becomes the leader, how's the new one going to understand the data in
> ZooKeeper? There are naturally more cases where the same compatibility
> issue should be handled, this example should illustrate the issue.
>
> Regards,
> Juha
>
> [1] https://issues.apache.org/jira/browse/FLINK-22636
>


[jira] [Created] (FLINK-24133) PartitionRequestClientFactoryTest.testThrowsWhenNetworkFailure fails on Azure

2021-09-02 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-24133:
-

 Summary: 
PartitionRequestClientFactoryTest.testThrowsWhenNetworkFailure fails on Azure
 Key: FLINK-24133
 URL: https://issues.apache.org/jira/browse/FLINK-24133
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Network
Affects Versions: 1.13.2, 1.12.5, 1.14.0, 1.15.0
Reporter: Till Rohrmann
 Fix For: 1.14.0, 1.12.6, 1.13.3


The test {{PartitionRequestClientFactoryTest.testThrowsWhenNetworkFailure}} 
fails on Azure with

{code}
2021-09-01T17:01:08.2338015Z Sep 01 17:01:08 [ERROR] Tests run: 7, Failures: 1, 
Errors: 0, Skipped: 0, Time elapsed: 1.465 s <<< FAILURE! - in 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactoryTest
2021-09-01T17:01:08.2341002Z Sep 01 17:01:08 [ERROR] 
testThrowsWhenNetworkFailure  Time elapsed: 0.02 s  <<< FAILURE!
2021-09-01T17:01:08.2341956Z Sep 01 17:01:08 java.lang.AssertionError: Expected 
exception: 
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException
2021-09-01T17:01:08.2342854Z Sep 01 17:01:08at 
org.junit.internal.runners.statements.ExpectException.evaluate(ExpectException.java:34)
2021-09-01T17:01:08.2343408Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2021-09-01T17:01:08.2343916Z Sep 01 17:01:08at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
2021-09-01T17:01:08.2348096Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
2021-09-01T17:01:08.2483997Z Sep 01 17:01:08at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
2021-09-01T17:01:08.2484833Z Sep 01 17:01:08at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
2021-09-01T17:01:08.2485521Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
2021-09-01T17:01:08.2486189Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
2021-09-01T17:01:08.2486892Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
2021-09-01T17:01:08.2487565Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
2021-09-01T17:01:08.2488276Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
2021-09-01T17:01:08.2488999Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2021-09-01T17:01:08.2489818Z Sep 01 17:01:08at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)
2021-09-01T17:01:08.2490535Z Sep 01 17:01:08at 
org.junit.runner.JUnitCore.run(JUnitCore.java:137)
2021-09-01T17:01:08.2491159Z Sep 01 17:01:08at 
org.junit.runner.JUnitCore.run(JUnitCore.java:115)
2021-09-01T17:01:08.2491860Z Sep 01 17:01:08at 
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
2021-09-01T17:01:08.2492760Z Sep 01 17:01:08at 
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
2021-09-01T17:01:08.2493506Z Sep 01 17:01:08at 
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
2021-09-01T17:01:08.2494192Z Sep 01 17:01:08at 
java.util.Iterator.forEachRemaining(Iterator.java:116)
2021-09-01T17:01:08.2494910Z Sep 01 17:01:08at 
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
2021-09-01T17:01:08.2495889Z Sep 01 17:01:08at 
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
2021-09-01T17:01:08.2496648Z Sep 01 17:01:08at 
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
2021-09-01T17:01:08.2497416Z Sep 01 17:01:08at 
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
2021-09-01T17:01:08.2498201Z Sep 01 17:01:08at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
2021-09-01T17:01:08.2498950Z Sep 01 17:01:08at 
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
2021-09-01T17:01:08.2499767Z Sep 01 17:01:08at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
2021-09-01T17:01:08.2500567Z Sep 01 17:01:08at 
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
2021-09-01T17:01:08.2501381Z Sep 01 17:01:08at 
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
2021-09-01T17:01:08.2502189Z Sep 01 17:01:08at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
2021-09-01T17:01:08.2503111Z Sep 01 17:01:08at 
org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
2021-09-01T17:01:08.2503992Z Sep 01 17:01:08at 
org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(De

[jira] [Created] (FLINK-24129) TopicRangeTest.rangeCreationHaveALimitedScope fails on Azure

2021-09-02 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-24129:
-

 Summary: TopicRangeTest.rangeCreationHaveALimitedScope fails on 
Azure
 Key: FLINK-24129
 URL: https://issues.apache.org/jira/browse/FLINK-24129
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Pulsar
Affects Versions: 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0


The test {{TopicRangeTest.rangeCreationHaveALimitedScope}} fails on Azure with

{code}

Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:126)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:84)
Sep 02 12:41:55 at java.util.ArrayList.forEach(ArrayList.java:1259)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.invokeAll(SameThreadHierarchicalTestExecutorService.java:38)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$5(NodeTestTask.java:143)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$7(NodeTestTask.java:129)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.Node.around(Node.java:137)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.lambda$executeRecursively$8(NodeTestTask.java:127)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.ThrowableCollector.execute(ThrowableCollector.java:73)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.executeRecursively(NodeTestTask.java:126)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.NodeTestTask.execute(NodeTestTask.java:84)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.SameThreadHierarchicalTestExecutorService.submit(SameThreadHierarchicalTestExecutorService.java:32)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.HierarchicalTestExecutor.execute(HierarchicalTestExecutor.java:57)
Sep 02 12:41:55 at 
org.junit.platform.engine.support.hierarchical.HierarchicalTestEngine.execute(HierarchicalTestEngine.java:51)
Sep 02 12:41:55 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
Sep 02 12:41:55 at 
org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
Sep 02 12:41:55 at 
org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
Sep 02 12:41:55 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
Sep 02 12:41:55 at 
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
Sep 02 12:41:55 at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150)
Sep 02 12:41:55 at 
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:124)
Sep 02 12:41:55 at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
Sep 02 12:41:55 at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
Sep 02 12:41:55 at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
Sep 02 12:41:55 at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=23392&view=logs&j=fc5181b0-e452-5c8f-68de-1097947f6483&t=995c650b-6573-581c-9ce6-7ad4cc038461&l=24826



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Automated architectural tests

2021-09-02 Thread Till Rohrmann
If it is possible to automate these kinds of checks, then I am all for it
because everything else breaks eventually. So +1 for this proposal.

I don't have experience with what tools are available, though.

I would like to add a rule that every test class extends directly or
indirectly TestLogger because otherwise it is super hard to make sense of
the test logs (Arvid will probably chime in stating that this will be
solved with Junit5 eventually).

Not sure whether this is possible or not but if we can check that all
interfaces have properly defined JavaDocs on each method, then this could
also be helpful in my opinion.

Cheers,
Till

On Thu, Sep 2, 2021 at 11:16 AM Timo Walther  wrote:

> Hi Ingo,
>
> thanks for starting this discussion. Having more automation is
> definitely desirable. Esp. in the API / SDK areas where we frequently
> have to add similar comments to PRs. The more checks the better. We
> definitely also need more guidelines (e.g. how to develop a Flink
> connector) but automation is safer then long checklists that might be
> out of date quickly.
>
> +1 to the proposal. I don't have an opinion on the tool though.
>
> Regards,
> Timo
>
>
> On 01.09.21 11:03, Ingo Bürk wrote:
> > Hello everyone,
> >
> > I would like to start a discussion on introducing automated tests for
> more
> > architectural rather than stilistic topics. For example, here are a few
> > things that seem worth checking to me (this is Table-API-focused since it
> > is the subsystem I'm involved in):
> >
> > (a) All classes in o.a.f.table.api should be annotated with one
> > of @Internal, @PublicEvolving, or @Public.
> > (b) Classes whose name ends in *ConnectorOptions should be located in
> > o.a.f.connector.*.table
> > (c) Classes implementing DynamicSourceFactory / DynamicSinkFactory should
> > have no static members of type ConfigOption
> >
> > There are probably significantly more cases worth checking, and also more
> > involved ones (these are rather simple examples), like disallowing access
> > between certain packages etc. There are two questions I would like to ask
> > to the community:
> >
> > (1) Do you think such tests are useful in general?
> > (2) What use cases come to mind for you?
> >
> > If the idea finds consensus, I would like to use (2) to investigate which
> > tooling to use. An obvious candidate is Checkstyle, as this is already
> > used. It also has the advantage of being well integrated in the IDE.
> > However, it is limited to looking at single files only, and custom checks
> > are pretty complicated and involved to implement[1]. Another possible
> tool
> > is ArchUnit[2], which would be significantly easier to maintain and is
> more
> > powerful, but in turn requires tests to be executed. If you have further
> > suggestions (or thoughts) they would of course also be quite welcome,
> > though for now I would focus on (1) and (2) and go from there to
> evaluate.
> >
> > [1] https://checkstyle.sourceforge.io/writingchecks.html
> > [2] https://www.archunit.org/
> >
> >
> > Best
> > Ingo
> >
>
>


Re: Do we still maintain travis build system?

2021-09-01 Thread Till Rohrmann
Alright, then let's do it after the release.

Cheers,
Till

On Wed, Sep 1, 2021 at 11:09 AM Chesnay Schepler  wrote:

> Yes we could remove it, but I would have to update the CiBot source and
> redeploy the whole thing, which I don't wanna do during the release
> testing period.
>
> On 30/08/2021 10:42, Till Rohrmann wrote:
> > I think Flink 1.10.x used Travis. So I agree with Tison's proposal. +1
> for
> > removing the "@flinkbot run travis" from the command documentation.
> >
> > cc @Chesnay Schepler 
> >
> > Cheers,
> > Till
> >
> > On Sun, Aug 29, 2021 at 4:48 AM tison  wrote:
> >
> >> Hi,
> >>
> >> I can still see "@flinkbot run travis" in flinkbot's toast but it seems
> we
> >> already migrate to azure
> >> pipeline and this command becomes invalid? If so, let's omit it from the
> >> toast.
> >>
> >> Best,
> >> tison.
> >>
>
>


Re: [ANNOUNCE] Apache Flink Stateful Functions 3.1.0 released

2021-09-01 Thread Till Rohrmann
Great news! Thanks a lot for all your work on the new release :-)

Cheers,
Till

On Wed, Sep 1, 2021 at 9:07 AM Johannes Moser  wrote:

> Congratulations, great job. 🎉
>
> On 31.08.2021, at 17:09, Igal Shilman  wrote:
>
> The Apache Flink community is very happy to announce the release of Apache
> Flink Stateful Functions (StateFun) 3.1.0.
>
> StateFun is a cross-platform stack for building Stateful Serverless
> applications, making it radically simpler to develop scalable, consistent,
> and elastic distributed applications.
>
> Please check out the release blog post for an overview of the release:
> https://flink.apache.org/news/2021/08/31/release-statefun-3.1.0.html
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Maven artifacts for StateFun can be found at:
> https://search.maven.org/search?q=g:org.apache.flink%20statefun
>
> Python SDK for StateFun published to the PyPI index can be found at:
> https://pypi.org/project/apache-flink-statefun/
>
> Official Docker images for StateFun are published to Docker Hub:
> https://hub.docker.com/r/apache/flink-statefun
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350038&projectId=12315522
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Thanks,
> Igal
>
>
>


Re: Do we still maintain travis build system?

2021-08-30 Thread Till Rohrmann
I think Flink 1.10.x used Travis. So I agree with Tison's proposal. +1 for
removing the "@flinkbot run travis" from the command documentation.

cc @Chesnay Schepler 

Cheers,
Till

On Sun, Aug 29, 2021 at 4:48 AM tison  wrote:

> Hi,
>
> I can still see "@flinkbot run travis" in flinkbot's toast but it seems we
> already migrate to azure
> pipeline and this command becomes invalid? If so, let's omit it from the
> toast.
>
> Best,
> tison.
>


[jira] [Created] (FLINK-24038) DispatcherResourceManagerComponent fails to deregister application if no leading ResourceManager

2021-08-28 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-24038:
-

 Summary: DispatcherResourceManagerComponent fails to deregister 
application if no leading ResourceManager
 Key: FLINK-24038
 URL: https://issues.apache.org/jira/browse/FLINK-24038
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0


With FLINK-21667 we introduced a change that can cause the 
{{DispatcherResourceManagerComponent}} to fail when trying to stop the 
application. The problem is that the {{DispatcherResourceManagerComponent}} 
needs a leading {{ResourceManager}} to successfully execute the stop/deregister 
application call. If this is not the case, then it will fail fatally. In the 
case of multiple standby JobManager processes it can happen that the leading 
{{ResourceManager}} runs somewhere else.

I do see two possible solutions:

1. Run the leader election process for the whole JobManager process
2. Move the registration/deregistration of the application out of the 
{{ResourceManager}} so that it can be executed w/o a leader



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-24015) Dispatcher does not log JobMaster initialization error on info level

2021-08-27 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-24015:
-

 Summary: Dispatcher does not log JobMaster initialization error on 
info level
 Key: FLINK-24015
 URL: https://issues.apache.org/jira/browse/FLINK-24015
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.2, 1.14.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.14.0, 1.13.3


The {{Dispatcher}} does not log JobMaster initialization errors. This can make 
it very hard to understand why a job has failed if the client does not receive 
the {{JobResult}}. Therefore, I propose to log the failure cause for a job when 
it finishes on the {{Dispatcher}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: 【Announce】Zeppelin 0.10.0 is released, Flink on Zeppelin Improved

2021-08-26 Thread Till Rohrmann
Cool, thanks for letting us know Jeff. Hopefully, many users use Zeppelin
together with Flink.

Cheers,
Till

On Thu, Aug 26, 2021 at 4:47 AM Leonard Xu  wrote:

> Thanks Jeff for the great work !
>
> Best,
> Leonard
>
> 在 2021年8月25日,22:48,Jeff Zhang  写道:
>
> Hi Flink users,
>
> We (Zeppelin community) are very excited to announce Zeppelin 0.10.0 is
> officially released. In this version, we made several improvements on Flink
> interpreter.  Here's the main features of Flink on Zeppelin:
>
>- Support multiple versions of Flink
>- Support multiple versions of Scala
>- Support multiple languages
>- Support multiple execution modes
>- Support Hive
>- Interactive development
>- Enhancement on Flink SQL
>- Multi-tenancy
>- Rest API Support
>
> Take a look at this document for more details:
> https://zeppelin.apache.org/docs/0.10.0/interpreter/flink.html
> The quickest way to try Flink on Zeppelin is via its docker image
> https://zeppelin.apache.org/docs/0.10.0/interpreter/flink.html#play-flink-in-zeppelin-docker
>
> Besides these, here’s one blog about how to run Flink sql cookbook on
> Zeppelin,
> https://medium.com/analytics-vidhya/learn-flink-sql-the-easy-way-d9d48a95ae57
> The easy way to learn Flink Sql.
>
> Hope it would be helpful for you and welcome to join our community to
> discuss with others. http://zeppelin.apache.org/community.html
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
>


[jira] [Created] (FLINK-23965) E2E do not execute locally on MacOS

2021-08-25 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23965:
-

 Summary: E2E do not execute locally on MacOS
 Key: FLINK-23965
 URL: https://issues.apache.org/jira/browse/FLINK-23965
 Project: Flink
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.13.2, 1.14.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.14.0, 1.13.3


After FLINK-21346, the e2e tests are no longer executing locally on MacOS. The 
problem seems to be that the e2e configure a log directory that does not exist 
and this fails starting a Flink cluster.

I suggest to change the directory to the old directory {{FLINK_DIR/log}} 
instead of {{FLINK_DIR/logs}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Flink 1.14 Weekly 2021-08-24

2021-08-25 Thread Till Rohrmann
Thanks for the update Joe!

Cheers,
Till

On Tue, Aug 24, 2021 at 3:30 PM Johannes Moser  wrote:

> Dear Apache Flink community,
>
> Today we moved from the two week sync rhythm to a one week one as we
> entered the stabilisation phase last week.
>
> *Status check:*
> All features got the state of "done" or "done done". There have been two
> features that have no cross team testing ticket, but they will be clarified
> or created right now.
> We hope to keep the push towards "done done" up over the next couple of
> weeks.
>
> *Cross team testing*
> There has been some confusion around the cross team testing efforts.
> That's why we added the paragraph "Cross team testing" to the "Creating a
> Flink Release" page [1].
> We ask every community member to have a look at the board [2] and pick
> some of the testing issues.
>
> *Documentation*
> For the cross team testing effort documentation is required. That's why we
> are currently struggling a bit with this effort. We  urge the authors of
> features to document the features as soon as possible.
> An overview of all outstanding documentation issues can be found on this
> board [3]
>
> *Blockers and critical issues*
> It looks like we got all blockers [4] assigned and a big share of them is
> already in progress. Assignees have been confident they can be resolved
> within this week. We also did good progress with critical issues [5]. See
> [6] for a detailed overview of blockers and critical issues.
>
> *Release candidate*
> We plan to cut off the branch for 1.14 and create the first release
> candidate this weekend.
>
> *What can you contribute to this release?*
> - *** Complete feature documentation as soon as possible ***
> - Pick-up cross-team testing tickets
> - Test RC0 that will be created along with the 1.14 branch this weekend
>
> Let's make this happen,
> Xintong, Dawid, and Joe
>
> -
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release
> [2]
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=468&quickFilter=2115
> [3]
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=468&quickFilter=2176
> [4]
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=468&quickFilter=2120
> [5]
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=468&quickFilter=2121
> [6]
> https://cwiki.apache.org/confluence/display/FLINK/1.14+Release?focusedCommentId=186878532#comment-186878532


[jira] [Created] (FLINK-23947) DefaultDispatcherRunner does not log when the Dispatcher gains and loses leadership

2021-08-24 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23947:
-

 Summary: DefaultDispatcherRunner does not log when the Dispatcher 
gains and loses leadership
 Key: FLINK-23947
 URL: https://issues.apache.org/jira/browse/FLINK-23947
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0, 1.12.6, 1.13.3


The {{DefaultDispatcherRunner}} does not log when it gains and loses 
leadership. This can make it hard to understand the behaviour from Flink's logs 
because {{*DispatcherLeaderProcess}} can suddenly be stopped w/o knowing why. I 
suggest to improve the logging in the {{DefaultDispatcherRunner}} to make the 
logs easier to understand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23946) Application mode fails fatally when being shut down

2021-08-24 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23946:
-

 Summary: Application mode fails fatally when being shut down
 Key: FLINK-23946
 URL: https://issues.apache.org/jira/browse/FLINK-23946
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0, 1.12.6, 1.13.3


The application mode fails fatally when being shut down (e.g. if the 
{{Dispatcher}} loses its leadership). The problem seems to be that the 
{{ApplicationDispatcherBootstrap}} cancels the {{applicationExecutionTask}} and 
{{applicationCompletionFuture}} that can trigger the execution of the fatal 
exception handler in the handler of the {{applicationCompletionFuture}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Scala 2.12 End to End Tests

2021-08-24 Thread Till Rohrmann
Great, thanks Danny.

On Mon, Aug 23, 2021 at 9:41 PM Danny Cranmer 
wrote:

> Thanks Till!
>
> +1 making Scala 2.12 the default. I will raise a Jira tomorrow, not at the
> laptop today.
>
> On Mon, 23 Aug 2021, 09:43 Martijn Visser,  wrote:
>
> > +1 to switch to 2.12 by default
> >
> > On Sat, 21 Aug 2021 at 13:43, Chesnay Schepler 
> wrote:
> >
> > > I'm with Till that we should switch to 2.12 by default .
> > >
> > > On 21/08/2021 11:12, Till Rohrmann wrote:
> > > > Hi Danny,
> > > >
> > > > I think in the nightly builds we do run the e2e with Scala 2.12 [1].
> > The
> > > > way it is configured is via the PROFILE env variable if I am not
> > > mistaken.
> > > >
> > > > Independent of this we might wanna start a discussion whether we
> don't
> > > want
> > > > to switch to Scala 2.12. per default.
> > > >
> > > > [1]
> > > >
> > >
> >
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22579&view=logs&j=08866332-78f7-59e4-4f7e-49a56faa3179
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 20, 2021 at 10:59 PM Danny Cranmer <
> > dannycran...@apache.org>
> > > > wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I am working on reviewing a PR [1] to add JSON support for the AWS
> > Glue
> > > >> Schema Registry integration. The module only builds for Scala 2.12
> due
> > > to a
> > > >> dependency incompatibility. The end-to-end tests have been
> implemented
> > > >> using the newer Java framework rather than bash scripts. However, it
> > > >> appears as though end-to-end tests are only run for Scala 2.11,
> > > therefore
> > > >> the new tests are not actually being exercised. Questions:
> > > >> - Please correct me if I am wrong here, E2E java tests are only run
> > for
> > > >> Scala 2.11 [2]?
> > > >> - Is there a reason we are not running tests for Scala 2.11 AND
> Scala
> > > 2.12
> > > >> - How do we proceed? I would be inclined with number 1, unless there
> > is
> > > a
> > > >> good reason not to, besides increase in time (they already take a
> long
> > > >> time)
> > > >>1. Enable ALL Scala 2.12 tests?
> > > >>2. Just run the new tests with Scala 2.12?
> > > >>3. Do not run the new tests
> > > >>
> > > >> [1] https://github.com/apache/flink/pull/16513
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/run-nightly-tests.sh#L264
> > > >>
> > > >> Thanks,
> > > >> Danny Cranmer.
> > > >>
> > >
> > >
> >
>


[jira] [Created] (FLINK-23932) KafkaTableITKafkaTableITCase.testKafkaSourceSinkWithKeyAndPartialValue hangs on AzureCase

2021-08-23 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23932:
-

 Summary: 
KafkaTableITKafkaTableITCase.testKafkaSourceSinkWithKeyAndPartialValue hangs on 
AzureCase
 Key: FLINK-23932
 URL: https://issues.apache.org/jira/browse/FLINK-23932
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka, Table SQL / Ecosystem
Affects Versions: 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0


The test {{KafkaTableITCase.testKafkaSourceSinkWithKeyAndPartialValue}} hangs 
on Azure. Interestingly, the test case seems to spawn 400 
{{kafka-admin-client-thread | adminclient}} threads. I think there is something 
wrong with the test case setup.

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22674&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=15a22db7-8faa-5b34-3920-d33c9f0ca23c&l=22875



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Projection pushdown for metadata columns

2021-08-23 Thread Till Rohrmann
Does this also affect Flink 1.14.0? If yes, do we want to fix this issue
for the upcoming release? If yes, then please make this issue a blocker or
at least critical.

Cheers,
Till

On Mon, Aug 23, 2021 at 8:39 AM Ingo Bürk  wrote:

> Thanks Timo for the confirmation. I've also raised FLINK-23911[1] for this.
>
> [1] https://issues.apache.org/jira/browse/FLINK-23911
>
>
> Best
> Ingo
>
> On Mon, Aug 23, 2021 at 8:34 AM Timo Walther  wrote:
>
> > Hi everyone,
> >
> > this sounds definitely like a bug to me. Computing metadata might be
> > very expensive and a connector might expose a long list of metadata
> > keys. It was therefore intended to project the metadata if possible. I'm
> > pretty sure that this worked before (at least when implementing
> > SupportsProjectionPushDown). Maybe a bug was introduced when adding the
> > Spec support.
> >
> > Regards,
> > Timo
> >
> >
> > On 23.08.21 08:24, Ingo Bürk wrote:
> > > Hi Jingsong,
> > >
> > > thanks for your answer. Even if the source implements
> > > SupportsProjectionPushDown, #applyProjections will never be called with
> > > projections for metadata columns. For example, I have the following
> test:
> > >
> > > @Test
> > > def test(): Unit = {
> > >val tableId = TestValuesTableFactory.registerData(Seq())
> > >
> > >tEnv.createTemporaryTable("T",
> TableDescriptor.forConnector("values")
> > >  .schema(Schema.newBuilder()
> > >.column("f0", DataTypes.INT())
> > >.columnByMetadata("m1", DataTypes.STRING())
> > >.columnByMetadata("m2", DataTypes.STRING())
> > >.build())
> > >  .option("data-id", tableId)
> > >  .option("bounded", "true")
> > >  .option("readable-metadata", "m1:STRING,m2:STRING")
> > >  .build())
> > >
> > >tEnv.sqlQuery("SELECT f0, m1 FROM T").execute().collect().toList
> > > }
> > >
> > > Regardless of whether I select only f0 or f0 + m1,
> #applyReadableMetadata
> > > is always called with m1 + m2, and #applyProjections only ever sees f0.
> > So
> > > as far as I can tell, the source has no way of knowing which metadata
> > > columns are actually needed (under the projection), it always has to
> > > produce metadata for all metadata columns declared in the table's
> schema.
> > >
> > > In PushProjectIntoTableSourceScanRule I also haven't yet found anything
> > > that would suggest that metadata are first projected and only then
> pushed
> > > to the source. I think the correct behavior should be to call
> > > #applyReadableMetadata only after they have been considered in the
> > > projection.
> > >
> > >
> > > Best
> > > Ingo
> > >
> > >
> > > On Mon, Aug 23, 2021 at 5:05 AM Jingsong Li 
> > wrote:
> > >
> > >> Hi,
> > >>
> > >> I remember the projection only works with SupportsProjectionPushDown.
> > >>
> > >> You can take a look at
> > >> `PushProjectIntoTableSourceScanRuleTest.testNestProjectWithMetadata`.
> > >>
> > >> Will applyReadableMetadata again in the
> > PushProjectIntoTableSourceScanRule.
> > >>
> > >> But there may be bug in
> > >> PushProjectIntoTableSourceScanRule.applyPhysicalAndMetadataPushDown:
> > >>
> > >> if (!usedMetadataNames.isEmpty()) {
> > >>  sourceAbilitySpecs.add(new ReadingMetadataSpec(usedMetadataNames,
> > >> newProducedType));
> > >> }
> > >>
> > >> If there is no meta column left, we should apply again, We should tell
> > >> the source that there is no meta column left after projection.
> > >>
> > >> Best,
> > >> Jingsong
> > >>
> > >> On Fri, Aug 20, 2021 at 7:56 PM Ingo Bürk  wrote:
> > >>>
> > >>> Hi everyone,
> > >>>
> > >>> according to the SupportsReadableMetadata interface, the planner is
> > >>> supposed to project required metadata columns prior to applying them:
> > >>>
> >  The planner will select required metadata columns (i.e. perform
> > >>> projection push down) and will call applyReadableMetadata(List,
> > DataType)
> > >>> with a list of metadata keys.
> > >>>
> > >>> However, from my experiments it seems that this is not true:
> regardless
> > >> of
> > >>> what columns I select from a table, #applyReadableMetadata always
> seems
> > >> to
> > >>> be called with all metadata declared in the schema of the table.
> > Metadata
> > >>> columns are also excluded from
> > >> SupportsProjectionPushDown#applyProjection,
> > >>> so the source cannot perform the projection either.
> > >>>
> > >>> This is in Flink 1.13.2. Am I misreading the docs here or is this not
> > >>> working as intended?
> > >>>
> > >>>
> > >>> Best
> > >>> Ingo
> > >>
> > >>
> > >>
> > >> --
> > >> Best, Jingsong Lee
> > >>
> > >
> >
> >
>


[jira] [Created] (FLINK-23916) Add documentation for how to configure the CheckpointFailureManager

2021-08-23 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23916:
-

 Summary: Add documentation for how to configure the 
CheckpointFailureManager
 Key: FLINK-23916
 URL: https://issues.apache.org/jira/browse/FLINK-23916
 Project: Flink
  Issue Type: Improvement
  Components: Documentation, Runtime / Checkpointing
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0


With FLINK-12364 we introduced the {{CheckpointFailureManager}} that allows us 
to configure how to handle checkpoint failures. In order to properly use this 
feature, I think we should extend our documentation for how to configure it. 
Ideally, it would be added to this page: 
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23906) Increase akka.ask.timeout for tests using the MiniCluster

2021-08-21 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23906:
-

 Summary: Increase akka.ask.timeout for tests using the MiniCluster
 Key: FLINK-23906
 URL: https://issues.apache.org/jira/browse/FLINK-23906
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination, Tests
Affects Versions: 1.13.2, 1.12.5, 1.14.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.14.0, 1.12.6, 1.13.3


We have seen over the last couple of weeks/months an increased number of test 
failures because of {{TimeoutException}} that were triggered because the 
{{akka.ask.timeout}} was exceeded. The reason for this was that on our CI 
infrastructure it can happen that there are pauses of more than 10s (not sure 
about the exact reason). 

In order to harden all tests relying on the {{MiniCluster}} I propose to 
increase the {{akka.ask.timeout}} to a minute if nothing else has been 
configured.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Scala 2.12 End to End Tests

2021-08-21 Thread Till Rohrmann
Hi Danny,

I think in the nightly builds we do run the e2e with Scala 2.12 [1]. The
way it is configured is via the PROFILE env variable if I am not mistaken.

Independent of this we might wanna start a discussion whether we don't want
to switch to Scala 2.12. per default.

[1]
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22579&view=logs&j=08866332-78f7-59e4-4f7e-49a56faa3179

Cheers,
Till

On Fri, Aug 20, 2021 at 10:59 PM Danny Cranmer 
wrote:

> Hello,
>
> I am working on reviewing a PR [1] to add JSON support for the AWS Glue
> Schema Registry integration. The module only builds for Scala 2.12 due to a
> dependency incompatibility. The end-to-end tests have been implemented
> using the newer Java framework rather than bash scripts. However, it
> appears as though end-to-end tests are only run for Scala 2.11, therefore
> the new tests are not actually being exercised. Questions:
> - Please correct me if I am wrong here, E2E java tests are only run for
> Scala 2.11 [2]?
> - Is there a reason we are not running tests for Scala 2.11 AND Scala 2.12
> - How do we proceed? I would be inclined with number 1, unless there is a
> good reason not to, besides increase in time (they already take a long
> time)
>   1. Enable ALL Scala 2.12 tests?
>   2. Just run the new tests with Scala 2.12?
>   3. Do not run the new tests
>
> [1] https://github.com/apache/flink/pull/16513
> [2]
>
> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/run-nightly-tests.sh#L264
>
> Thanks,
> Danny Cranmer.
>


Re: [DISCUSS] FLIP-171: Async Sink

2021-08-20 Thread Till Rohrmann
Thanks for the update Steffen. I'll try to take a look at it asap.

Cheers,
Till

On Fri, Aug 20, 2021 at 1:34 PM Hausmann, Steffen  wrote:

> Hi Till,
>
> I've updated the wiki page as per the discussion on flip-177. I hope it
> makes more sense now.
>
> Cheers, Steffen
>
> On 16.07.21, 18:28, "Till Rohrmann"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Sure, thanks for the pointers.
>
> Cheers,
> Till
>
> On Fri, Jul 16, 2021 at 6:19 PM Hausmann, Steffen
> 
> wrote:
>
> > Hi Till,
> >
> > You are right, I’ve left out some implementation details, which have
> > actually changed a couple of time as part of the ongoing discussion.
> You
> > can find our current prototype here [1] and a sample implementation
> of the
> > KPL free Kinesis sink here [2].
> >
> > I plan to update the FLIP. But I think would it be make sense to wait
> > until the implementation has stabilized enough before we update the
> FLIP to
> > the final state.
> >
> > Does that make sense?
> >
> > Cheers, Steffen
> >
> > [1]
> >
> https://github.com/sthm/flink/tree/flip-171-177/flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink
> > [2]
>     >
> https://github.com/sthm/flink/blob/flip-171-177/flink-connectors/flink-connector-kinesis-171/src/main/java/software/amazon/flink/connectors/AmazonKinesisDataStreamSink.java
> >
> > From: Till Rohrmann 
> > Date: Friday, 16. July 2021 at 18:10
> > To: Piotr Nowojski 
> > Cc: Steffen Hausmann , "dev@flink.apache.org" <
> > dev@flink.apache.org>, Arvid Heise 
> > Subject: RE: [EXTERNAL] [DISCUSS] FLIP-171: Async Sink
> >
> >
> > CAUTION: This email originated from outside of the organization. Do
> not
> > click links or open attachments unless you can confirm the sender
> and know
> > the content is safe.
> >
> >
> > Hi Steffen,
> >
> > I've taken another look at the FLIP and I stumbled across a couple of
> > inconsistencies. I think it is mainly because of the lacking code.
> For
> > example, it is not fully clear to me based on the current FLIP how we
> > ensure that there are no in-flight requests when
> > AsyncSinkWriter.snapshotState is called. Also the concrete
> implementation
> > of the AsyncSinkCommitter could be helpful for understanding how the
> > AsyncSinkWriter works in the end. Do you plan to update the FLIP
> > accordingly?
> >
> > Cheers,
> > Till
> >
> > On Wed, Jun 30, 2021 at 8:36 AM Piotr Nowojski  > <mailto:pnowoj...@apache.org>> wrote:
> > Thanks for addressing this issue :)
> >
> > Best, Piotrek
> >
> > wt., 29 cze 2021 o 17:58 Hausmann, Steffen   > shau...@amazon.de>> napisał(a):
> > Hey Poitr,
> >
> > I've just adapted the FLIP and changed the signature for the
> > `submitRequestEntries` method:
> >
> > protected abstract void submitRequestEntries(List
> > requestEntries, ResultFuture requestResult);
> >
> > In addition, we are likely to use an AtomicLong to track the number
> of
> > outstanding requests, as you have proposed in 2b). I've already
> indicated
> > this in the FLIP, but it's not fully fleshed out. But as you have
> said,
> > that seems to be an implementation detail and the important part is
> the
> > change of the `submitRequestEntries` signature.
> >
> > Thanks for your feedback!
> >
> > Cheers, Steffen
> >
> >
> > On 25.06.21, 17:05, "Hausmann, Steffen" 
> wrote:
> >
> > CAUTION: This email originated from outside of the organization.
> Do
> > not click links or open attachments unless you can confirm the
> sender and
> > know the content is safe.
> >
> >
> >
> > Hi Piotr,
> >
> > I’m happy to take your guidance on this. I need to think through
> your
> > proposals and I’ll follow-up on Monday with some more context so
> that we
> > can close the discussion on these details. But for now, I’

Re: [DISCUSS] Backport HybridSource to 1.13.x

2021-08-20 Thread Till Rohrmann
Hi everyone,

if the HybridSource is self contained, then I don't see a lot of harm in
doing the backport. We should probably explicitly state in the docs that a
certain minor version is required for this source to be available because
our docs don't distinguish between minor versions and having to use a
specific minor version for a feature might be confusing.

Speaking of which, I would actually prefer it if we first completed the
pending documentation and e2e test task because I believe they are more
important than the backport. W/o documentation, the HybridSource is
effectively non-existent. Also if the e2e test adds better test coverage,
then we probably should wait a bit with the backport.

Cheers,
Till

On Thu, Aug 19, 2021 at 6:54 PM Thomas Weise  wrote:

> PR: https://github.com/apache/flink/pull/16899
>
>
> On Mon, Aug 16, 2021 at 3:49 PM Thomas Weise  wrote:
>
> > Hi Arvid,
> >
> > I would not recommend to mix minor Flink versions:
> >
> > 1. As you said, 1.14 isn't released yet and most users would prefer
> > to not depend on snapshots
> > 2. There could be unrelated changes between 1.13 and 1.14 that make the
> > use of 1.14 artifact with 1.13 impossible or lead to unforeseen side
> > effects due to transitive dependencies
> >
> > Thanks,
> > Thomas
> >
> > On Mon, Aug 16, 2021 at 12:07 PM Arvid Heise  wrote:
> >
> >> Hi Thomas,
> >>
> >> that's neat. I forgot for a moment that connector-base is not part of
> >> flink-dist.
> >>
> >> I guess in theory, we could also omit the backport and simply refer
> users
> >> to 1.14 version. I'm assuming you want to have it in 1.13 since 1.14
> still
> >> takes a bit. Am I correct?
> >>
> >> On Mon, Aug 16, 2021 at 7:43 PM Thomas Weise  wrote:
> >>
> >> > Hi Arvid,
> >> >
> >> > Thank you for the reply. Can you please explain a bit more
> >> > your concern regarding an earlier bugfix level?
> >> >
> >> > I should have maybe made clear that the HybridSource can be used by
> >> > updating flink-connector-base in the application jar. It does not
> >> require
> >> > any addition to the runtime and therefore would work on any 1.13.x
> dist.
> >> >
> >> > For reference, I use it internally on top of 1.12.4.
> >> >
> >> > Thanks,
> >> > Thomas
> >> >
> >> >
> >> > On Mon, Aug 16, 2021 at 10:13 AM Arvid Heise 
> wrote:
> >> >
> >> > > Hi Thomas,
> >> > >
> >> > > since the change didn't modify any existing classes, I'm weakly in
> >> favor
> >> > of
> >> > > backporting. My reluctance mainly stems from possible
> disappointments
> >> > from
> >> > > 1.13 users that use an earlier bugfix level. So we need to make
> >> > > documentation clear.
> >> > >
> >> > > In general, I'm seeing connectors as something extra (and plan to
> make
> >> > that
> >> > > more transparent), so I think we have more freedom for backports
> >> there in
> >> > > contrast to other components. But it would be good to hear other
> >> opinions
> >> > > on that matter.
> >> > >
> >> > > On Mon, Aug 16, 2021 at 5:26 PM Thomas Weise 
> wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > HybridSource [1] [2] was recently merged to the main branch. I
> would
> >> > like
> >> > > > to propose to backport it to release-1.13. Although it is a new
> >> > feature,
> >> > > it
> >> > > > is also strictly additive and does not affect any existing code.
> The
> >> > > > benefit is that we will have a released runtime version that the
> >> > feature
> >> > > > can be used with for those that are interested in it and dependent
> >> on a
> >> > > > distribution they cannot modify.
> >> > > >
> >> > > > Are there any concerns backporting the change?
> >> > > >
> >> > > > Thanks,
> >> > > > Thomas
> >> > > >
> >> > > > [1] https://issues.apache.org/jira/browse/FLINK-22670
> >> > > > [2] https://github.com/apache/flink/pull/15924
> >> > > >
> >> > >
> >> >
> >>
> >
>


Re: [DISCUSS] Merge FLINK-23757 after feature freeze

2021-08-18 Thread Till Rohrmann
@Dian Fu  could you assess how involved this change is?
If the change is not very involved and the risk is limited, then I'd be in
favour of merging it because feature parity of APIs is quite important for
our users.

Cheers,
Till

On Wed, Aug 18, 2021 at 1:46 PM Ingo Bürk  wrote:

> Hello dev,
>
> I was wondering whether we could also consider merging FLINK-23757[1][2]
> after the freeze. This is about exposing two built-in functions which we
> added to Table API & SQL prior to the freeze also for PyFlink. Meaning that
> the feature itself isn't new, we only expose it on the Python API, and as
> such it's also entirely isolated from the rest of PyFlink and Flink itself.
> As such I'm not sure this is considered a new feature, but I'd rather ask.
> The main motivation for this would be to retain parity on the APIs. Thanks!
>
> [1] https://issues.apache.org/jira/browse/FLINK-23757
> [2] https://github.com/apache/flink/pull/16874
>
>
> Best
> Ingo
>


Re: Configuring flink-python in IntelliJ

2021-08-18 Thread Till Rohrmann
Thanks for starting this discussion Ingo. I guess that Dian can probably
answer this question. I am big +1 for updating our documentation for how to
set up the IDE since this problem will probably be encountered a couple of
times.

Cheers,
Till

On Wed, Aug 18, 2021 at 11:03 AM Ingo Bürk  wrote:

> Hello @dev,
>
> like probably most of you, I am using IntelliJ to work on Flink. Lately I
> needed to also work with flink-python, and thus was wondering about how to
> properly set it up to work in IntelliJ. So far, I have done the following:
>
> 1. Install the Python plugin
> 2. Set up a custom Python SDK using a virtualenv
> 3. Configure the flink-python module to use this Python SDK (rather than a
> Java SDK)
> 4. Install as many of the Python dependencies as possible when IntelliJ
> prompted me to do so
>
> This got me into a mostly working state, for example I can run tests from
> the IDE, at least the ones I tried so far. However, there are two concerns:
>
> a) Step (3) will be undone on every Maven import since Maven sets the SDK
> for the module
> b) Step (4) installed most, but a couple of Python dependencies could not
> be installed, though so far that didn't cause noticeable problems
>
> I'm wondering if there are additional / different steps to do, specifically
> for (a), and maybe how the PyFlink developers are configuring this in their
> IDE. The IDE setup guide didn't seem to contain information about that
> (only about separately setting up this module in PyCharm). Ideally, we
> could even update the IDE setup guide.
>
>
> Best
> Ingo
>


Re: [DISCUSS] Merging FLINK-21867 after feature freeze

2021-08-17 Thread Till Rohrmann
One addition: The change only affects the job-exceptions component of
Flink's web ui and is, thus, quite isolated.

Cheers,
Till

On Tue, Aug 17, 2021 at 4:42 PM David Morávek  wrote:

> Hi,
>
> We have a small UI change [1][2] that we'd like to get merged into 1.14
> [3]. The change allows displaying concurrent exceptions alongside the main
> exception in job's exception history (these were already present in the
> Rest API). This could bring a nice improvement for debugging of failed
> jobs.
>
> Are there any objections on merging this improvement?
>
> [1] https://issues.apache.org/jira/browse/FLINK-21867
> [2] https://github.com/apache/flink/pull/16807
> [3]
>
> https://lists.apache.org/thread.html/rc6cd39f467c42873ca2e9fa31dbe117c267d22ee3aa69bd8071219ff%40%3Cdev.flink.apache.org%3E
>
> Best,
> D.
>


[jira] [Created] (FLINK-23785) SinkITCase.testMetrics fails with ConcurrentModification on Azure

2021-08-15 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23785:
-

 Summary: SinkITCase.testMetrics fails with ConcurrentModification 
on Azure
 Key: FLINK-23785
 URL: https://issues.apache.org/jira/browse/FLINK-23785
 Project: Flink
  Issue Type: Bug
  Components: API / DataStream
Affects Versions: 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0


The {{SinkITCase.testMetrics}} fails with a ConcurrentModification

{code}
Aug 15 14:26:40 java.util.ConcurrentModificationException
Aug 15 14:26:40 at 
java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
Aug 15 14:26:40 at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
Aug 15 14:26:40 at java.util.AbstractSet.removeAll(AbstractSet.java:174)
Aug 15 14:26:40 at 
org.apache.flink.runtime.testutils.InMemoryReporter.applyRemovals(InMemoryReporter.java:78)
Aug 15 14:26:40 at 
org.apache.flink.runtime.testutils.InMemoryReporterRule.afterTestSuccess(InMemoryReporterRule.java:61)
Aug 15 14:26:40 at 
org.apache.flink.util.ExternalResource$1.evaluate(ExternalResource.java:57)
Aug 15 14:26:40 at 
org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
Aug 15 14:26:40 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Aug 15 14:26:40 at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
Aug 15 14:26:40 at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
Aug 15 14:26:40 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
Aug 15 14:26:40 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
Aug 15 14:26:40 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22215&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=86f654fa-ab48-5c1a-25f4-7e7f6afb9bba&l=5225



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23778) UpsertKafkaTableITCase.testKafkaSourceSinkWithKeyAndFullValue hangs on Azure

2021-08-15 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23778:
-

 Summary: 
UpsertKafkaTableITCase.testKafkaSourceSinkWithKeyAndFullValue hangs on Azure
 Key: FLINK-23778
 URL: https://issues.apache.org/jira/browse/FLINK-23778
 Project: Flink
  Issue Type: Bug
  Components: Connectors / Kafka
Affects Versions: 1.14.0
Reporter: Till Rohrmann
 Fix For: 1.14.0


The test {{UpsertKafkaTableITCase.testKafkaSourceSinkWithKeyAndFullValue}} 
hangs on Azure.

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=22123&view=logs&j=c5f0071e-1851-543e-9a45-9ac140befc32&t=15a22db7-8faa-5b34-3920-d33c9f0ca23c&l=8847



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [ANNOUNCE] Apache Flink 1.11.4 released

2021-08-10 Thread Till Rohrmann
This is great news. Thanks a lot for being our release manager Godfrey and
also to everyone who made this release possible.

Cheers,
Till

On Tue, Aug 10, 2021 at 11:09 AM godfrey he  wrote:

> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.11.4, which is the fourth bugfix release for the Apache Flink 1.11
> series.
>
>
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
>
>
> The release is available for download at:
>
> https://flink.apache.org/downloads.html
>
>
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
>
> https://flink.apache.org/news/2021/08/09/release-1.11.4.html
>
>
>
> The full release notes are available in Jira:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12349404
>
>
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
>
>
> Regards,
>
> Godfrey
>


Re: [ANNOUNCE] Apache Flink 1.12.5 released

2021-08-10 Thread Till Rohrmann
This is great news. Thanks a lot for being our release manager Jingsong and
also to everyone who made this release possible.

Cheers,
Till

On Tue, Aug 10, 2021 at 10:57 AM Jingsong Lee 
wrote:

> The Apache Flink community is very happy to announce the release of Apache
> Flink 1.12.5, which is the fifth bugfix release for the Apache Flink 1.12
> series.
>
>
>
> Apache Flink® is an open-source stream processing framework for
> distributed, high-performing, always-available, and accurate data streaming
> applications.
>
>
>
> The release is available for download at:
>
> https://flink.apache.org/downloads.html
>
>
>
> Please check out the release blog post for an overview of the improvements
> for this bugfix release:
>
> https://flink.apache.org/news/2021/08/06/release-1.12.5.html
>
>
>
> The full release notes are available in Jira:
>
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12350166
>
>
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
>
>
> Regards,
>
> Jingsong Lee
>


Re: Incompatible RAW types in Table API

2021-08-09 Thread Till Rohrmann
Hi Dominik,

I am pulling in Timo who might know more about this.

Cheers,
Till

On Mon, Aug 9, 2021 at 3:21 PM Dominik Wosiński  wrote:

> Hey all,
>
> I think I've hit some weird issue in Flink TypeInformation generation. I
> have the following code:
>
> val stream: DataStream[Event] = ...
> tableEnv.createTemporaryView("TableName",stream)
> val table = tableEnv
> .sqlQuery("SELECT id, timestamp, eventType from TableName")
> tableEnvironment.toAppendStream[NewEvent](table)
>
> In this particual example *Event* is an avro generated class and *NewEvent
> *is just POJO. This is just a toy example so please ignore the fact that
> this operation doesn't make much sense.
>
> When I try to run the code I am getting the following error:
>
>
>
>
>
> *org.apache.flink.table.api.ValidationException: Column types of query
> result and sink for unregistered table do not match.Cause: Incompatible
> types for sink column 'licence' at position 0.Query schema: [id:
> RAW('org.apache.avro.util.Utf8', '...'), timestamp: BIGINT NOT NULL, kind:
> RAW('org.test.EventType', '...')]*
>
> *Sink schema:  id: RAW('org.apache.avro.util.Utf8', '?'), timestamp:
> BIGINT, kind: RAW('org.test.EventType', '?')]*
>
> So, it seems that the type is recognized correctly but for some reason
> there is still mismatch according to Flink, maybe because of different type
> serializer used ?
>
> Thanks in advance for any help,
> Best Regards,
> Dom.
>


Re: [ANNOUNCE] Apache Flink 1.13.2 released

2021-08-09 Thread Till Rohrmann
Thanks Yun Tang for being our release manager and the great work! Also
thanks a lot to everyone who contributed to this release.

Cheers,
Till

On Mon, Aug 9, 2021 at 9:48 AM Yu Li  wrote:

> Thanks Yun Tang for being our release manager and everyone else who made
> the release possible!
>
> Best Regards,
> Yu
>
>
> On Fri, 6 Aug 2021 at 13:52, Yun Tang  wrote:
>
>>
>> The Apache Flink community is very happy to announce the release of
>> Apache Flink 1.13.2, which is the second bugfix release for the Apache
>> Flink 1.13 series.
>>
>> Apache Flink® is an open-source stream processing framework for
>> distributed, high-performing, always-available, and accurate data streaming
>> applications.
>>
>> The release is available for download at:
>> https://flink.apache.org/downloads.html
>>
>> Please check out the release blog post for an overview of the
>> improvements for this bugfix release:
>> https://flink.apache.org/news/2021/08/06/release-1.13.2.html
>>
>> The full release notes are available in Jira:
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350218&styleName=&projectId=12315522
>>
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>>
>> Regards,
>> Yun Tang
>>
>


Re: [VOTE] FLIP-180: Adjust StreamStatus and Idleness definition

2021-08-09 Thread Till Rohrmann
+1 (binding)

Cheers,
Till

On Thu, Aug 5, 2021 at 9:09 PM Arvid Heise  wrote:

> Dear devs,
>
> I'd like to open a vote on FLIP-180: Adjust StreamStatus and Idleness
> definition [1] which was discussed in this thread [2].
> The vote will be open for at least 72 hours unless there is an objection or
> not enough votes.
>
> Best,
>
> Arvid
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-180%3A+Adjust+StreamStatus+and+Idleness+definition
> [2]
>
> https://lists.apache.org/thread.html/r8357d64b9cfdf5a233c53a20d9ac62b75c07c925ce2c43e162f1e39c%40%3Cdev.flink.apache.org%3E
>


Re: Looking for Maintainers for Flink on YARN

2021-08-05 Thread Till Rohrmann
Great to hear from everybody. I've created an invite for Thu, Aug 11, 9am
CEST and invited Gabor and Yangze. Let me know if somebody else wants to
join as well. I'll add her then to the calendar invite.

Cheers,
Till

On Thu, Aug 5, 2021 at 3:50 AM Yangze Guo  wrote:

> Thanks Konstantin for bringing this up, and thanks Marton and Gabor
> for stepping up!
>
> Both dates work for me. If there is something our team can help with
> for the kick-off, please let me know.
>
> Best,
> Yangze Guo
>
> On Thu, Aug 5, 2021 at 9:39 AM Xintong Song  wrote:
> >
> > Yangze will join the kick-off on behalf of our team.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Wed, Aug 4, 2021 at 8:36 PM Gabor Somogyi 
> wrote:
> >>
> >> Both are good to me.
> >>
> >> G
> >>
> >>
> >> On Wed, Aug 4, 2021 at 9:30 AM Konstantin Knauf 
> wrote:
> >>
> >> > Hi everyone
> >> >
> >> > Thank you Marton & team for stepping up and thanks, Xintong, for
> offering
> >> > your continued support. I'd like to leave this thread "open" for a bit
> >> > longer so that others can continue to chime in. In the meantime, I
> would
> >> > already like to propose two dates for an informal kick-off:
> >> >
> >> > * Thu, Aug 11, 9am CEST
> >> > * Fri, Aug 12, 9am CEST
> >> >
> >> > Would any of these dates work for you? As an agenda we propose
> something
> >> > along the lines of:
> >> >
> >> > * Introductions
> >> > * Architecture Introduction by Till
> >> > * Handover of Ongoing Tickets
> >> > * Open Questions
> >> >
> >> > As I am out of office for a month starting tomorrow, Till will take
> this
> >> > thread over from hereon.
> >> >
> >> > Best,
> >> >
> >> > Konstantin
> >> >
> >> > Cheers,
> >> >
> >> > Konstantin
> >> >
> >> >
> >> > On Mon, Aug 2, 2021 at 9:36 AM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi All,
> >> > >
> >> > > Just arrived back from holiday and seen this thread.
> >> > > I'm working with Marton and you can ping me directly on jira/PR in
> need.
> >> > >
> >> > > BR,
> >> > > G
> >> > >
> >> > >
> >> > > On Mon, Aug 2, 2021 at 5:02 AM Xintong Song 
> >> > wrote:
> >> > >
> >> > > > Thanks Konstantin for bringing this up, and thanks Marton and
> your team
> >> > > for
> >> > > > volunteering.
> >> > > >
> >> > > > @Marton,
> >> > > > Our team at Alibaba will keep helping with the Flink on YARN
> >> > maintenance.
> >> > > > However, as Konstaintin said, the time our team dedicated to this
> will
> >> > > > probably be less than it used to be. Your offer of help is
> significant
> >> > > and
> >> > > > greatly appreciated. Please feel free to reach out to us if you
> need
> >> > any
> >> > > > help on this.
> >> > > >
> >> > > > Thank you~
> >> > > >
> >> > > > Xintong Song
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Jul 29, 2021 at 5:04 PM Márton Balassi <
> >> > balassi.mar...@gmail.com
> >> > > >
> >> > > > wrote:
> >> > > >
> >> > > > > Hi Konstantin,
> >> > > > >
> >> > > > > Thank you for raising this topic, our development team at
> Cloudera
> >> > > would
> >> > > > be
> >> > > > > happy to step up to address this responsibility.
> >> > > > >
> >> > > > > Best,
> >> > > > > Marton
> >> > > > >
> >> > > > > On Thu, Jul 29, 2021 at 10:15 AM Konstantin Knauf <
> kna...@apache.org
> >> > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Dear community,
> >> > > > > >
> >> > > > > > We are looking for community members, who would like to
> maintain
> >> > > > Flink's
> >> > > > > > YARN support going forward. So far, this has been handled by
> teams
> >> > at
> >> > > > > > Ververica & Alibaba. The focus of these teams has shifted
> over the
> >> > > past
> >> > > > > > months so that we only have little time left for this topic.
> Still,
> >> > > we
> >> > > > > > think, it is important to maintain high quality support for
> Flink
> >> > on
> >> > > > > YARN.
> >> > > > > >
> >> > > > > > What does "Maintaining Flink on YARN" mean? There are no known
> >> > bigger
> >> > > > > > efforts outstanding. We are mainly talking about addressing
> >> > > > > > "test-stability" issues, bugs, version upgrades, community
> >> > > > contributions
> >> > > > > &
> >> > > > > > smaller feature requests. The prioritization of these would
> be up
> >> > to
> >> > > > the
> >> > > > > > future maintainers, except "test-stability" issues which are
> >> > > important
> >> > > > to
> >> > > > > > address for overall productivity.
> >> > > > > >
> >> > > > > > If a group of community members forms itself, we are happy to
> give
> >> > an
> >> > > > > > introduction to relevant pieces of the code base, principles,
> >> > > > > assumptions,
> >> > > > > > ... and hand over open threads.
> >> > > > > >
> >> > > > > > If you would like to take on this responsibility or can join
> this
> >> > > > effort
> >> > > > > in
> >> > > > > > a supporting role, please reach out!
> >> > > > > >
> >> > > > > > Cheers,
> >> > > > > >
> >> > > > > > Konstantin
> >> > > > > > for the Deployment & Coordination Team 

Re: [FLINK-23555] needs your attention

2021-08-05 Thread Till Rohrmann
Hi,

At the moment many community members are busy with finishing their work for
the upcoming feature freeze. This can cause slow responsiveness on the JIRA
tickets. Once the feature freeze is past, people will surely get back to
you.

Cheers,
Till

On Mon, Aug 2, 2021 at 8:43 AM 卫博文  wrote:

> Hi, PMC & commiters
>
>
> I have come up with an issue about common subexpression elimination
> https://issues.apache.org/jira/browse/FLINK-23555
> I hacked on the flink-table-planner module to improve common subexpression
> elimination by using localRef
> And the result tells that my hacked codes make a big difference.
> The issue has been proposed several days but got little attention.
> Now I don't know how to show you my code hence don't know if it's suitable.
> I need help and long for your attention.
>
>
> Yours sincerely


Re: About [FLINK-22405] Support fixed-lengh chars in the LeadLag built-in function

2021-08-05 Thread Till Rohrmann
Hi Liwei,

Thanks for helping the community. At the moment most of the committers are
busy with finishing their work for the upcoming feature freeze. This might
cause some delay in responses. We'll get back to your PR as soon as someone
has capacity.

cc Jark and Timo.

Cheers,
Till

On Sun, Aug 1, 2021 at 8:31 AM liwei li  wrote:

> Hi:
> flink provides very strong support for our project, I want to do
> something for the community.  I selected an issue with a starer label and
> pulled a requet. If my submission method is incorrect, I am very sorry.
> thanks.
>
> ISSUE:
> [FLINK-22405] Support fixed-lengh chars in the LeadLag built-in function -
> ASF JIRA (apache.org) 
> [image: image.png]
> Requet:
> [FLINK-22405] [Table SQL / API] Support fixed-lengh chars in the Lead… by
> hililiwei · Pull Request #16650 · apache/flink · GitHub
> 
>
>
> thx.
> liwei li
>


Re: [DISCUSS] FLIP-180: Adjust StreamStatus and Idleness definition

2021-08-05 Thread Till Rohrmann
Coming back to my previous comment: I would actually propose to separate
the discussion about whether to expose the WatermarkStatus in the sinks or
not from correcting the StreamStatus and Idleness definition in order to
keep the scope of this FLIP as small as possible. If there is a good reason
to expose the WatermarkStatus, then we can probably do it.

Cheers,
Till

On Fri, Jul 30, 2021 at 2:29 PM Arvid Heise  wrote:

> Hi Martijn,
>
> 1. Good question. The watermarks and statuses of the splits are first
> aggregated before emitted through the reader. The watermark strategy of the
> user is actually applied on all SourceOutputs (=splits). Since one split is
> active and one is idle, the watermark of the reader will not advance until
> the user-defined idleness is triggered on the idle split. At this point,
> the combined watermark solely depends on the active split. The combined
> status remains ACTIVE.
> 2. Kafka has no dynamic partitions. This is a complete misnomer on Flink
> side. In fact, if you search for Kafka and partition discovery, you will
> only find Flink resources. What we actually do is dynamic topic discovery
> and that can only be triggered through pattern afaik. We could go for topic
> discovery on all patterns by default if we don't do that already.
> 3. Yes, idleness on assigned partitions would even work with dynamic
> assignments. I will update the FLIP to reflect that.
> 4. Afaik it was only meant for scenario 2 (and your question 3) and it
> should be this way after the FLIP. I don't know of any source
> implementation that uses the user-specified idleness to handle scenario 3.
> The thing that is currently extra is that some readers go idle, when the
> reader doesn't have an active assignment.
>
> Best,
>
> Arvid
>
> On Fri, Jul 30, 2021 at 12:17 PM Martijn Visser 
> wrote:
>
> > Hi all,
> >
> > I have a couple of questions after studying the FLIP and the docs:
> >
> > 1. What happens when one of the readers has two splits assigned and one
> of
> > the splits actually receives data?
> >
> > 2. If I understand it correctly the Kinesis Source uses dynamic shard
> > discovery by default (so in case of idleness scenario 3 would happen
> there)
> > and the FileSource also has a dynamic assignment. The Kafka Source
> doesn't
> > use dynamic partition discovery by default (so scenario 2 would be the
> > default to happen there). Why did we choose to not enable dynamic
> partition
> > discovery by default and should we actually change that?
> >
> > 3. To be sure, is it correct that in case of a dynamic assignment and
> there
> > is temporarily no data, that scenario 2 is applicable?
> >
> > 4. Does WatermarkStrategy#withIdleness currently cover scenario 2, 3 and
> > the one from my 3rd question? (edited)
> >
> > Best regards,
> >
> > Martijn
> >
> > On Fri, 23 Jul 2021 at 15:57, Till Rohrmann 
> wrote:
> >
> > > Hi everyone,
> > >
> > > I would be in favour of what Arvid said about not exposing the
> > > WatermarkStatus to the Sink. Unless there is a very strong argument
> that
> > > this is required I think that keeping this concept internal seems to me
> > the
> > > better choice right now. Moreover, as Arvid said the downstream
> > application
> > > can derive the WatermarkStatus on their own depending on its business
> > > logic.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Fri, Jul 23, 2021 at 2:15 PM Arvid Heise  wrote:
> > >
> > > > Hi Eron,
> > > >
> > > > thank you very much for your feedback.
> > > >
> > > > Please mention that the "temporary status toggle" code will be
> removed.
> > > > >
> > > > This code is already removed but there is still some automation of
> > going
> > > > idle when temporary no splits are assigned. I will include it in the
> > > FLIP.
> > > >
> > > > I agree with adding the markActive() functionality, for symmetry.
> > > Speaking
> > > > > of symmetry, could we now include the minor enhancement we
> discussed
> > in
> > > > > FLIP-167, the exposure of watermark status changes on the Sink
> > > interface.
> > > > > I drafted a PR and would be happy to revisit it.
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/streamnative/flink/pull/2/files#diff-64d9c652ffc3c60b6d838200a24b106eeeda4b2d853deae94dbbdf16d8d694c2R62-R70
> > > >
>

[jira] [Created] (FLINK-23637) MeasurementInfoProviderTest.testNormalizingTags fails on Azure

2021-08-05 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23637:
-

 Summary: MeasurementInfoProviderTest.testNormalizingTags fails on 
Azure
 Key: FLINK-23637
 URL: https://issues.apache.org/jira/browse/FLINK-23637
 Project: Flink
  Issue Type: Technical Debt
  Components: Runtime / Metrics
Affects Versions: 1.13.1
Reporter: Till Rohrmann


The test {{MeasurementInfoProviderTest.testNormalizingTags}} fails on Azure with

{code}
Aug 04 15:10:31 [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time 
elapsed: 0.954 s <<< FAILURE! - in 
org.apache.flink.metrics.influxdb.MeasurementInfoProviderTest
Aug 04 15:10:31 [ERROR] 
testNormalizingTags(org.apache.flink.metrics.influxdb.MeasurementInfoProviderTest)
  Time elapsed: 0.007 s  <<< ERROR!
Aug 04 15:10:31 java.lang.UnsupportedOperationException: unexpected method call
Aug 04 15:10:31 at 
org.apache.flink.metrics.influxdb.MeasurementInfoProviderTest.lambda$testNormalizingTags$1(MeasurementInfoProviderTest.java:79)
Aug 04 15:10:31 at 
org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:103)
Aug 04 15:10:31 at 
org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
Aug 04 15:10:31 at 
org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:35)
Aug 04 15:10:31 at 
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:63)
Aug 04 15:10:31 at 
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:49)
Aug 04 15:10:31 at 
org.mockito.internal.creation.bytebuddy.MockMethodInterceptor$DispatcherDefaultingToRealMethod.interceptSuperCallable(MockMethodInterceptor.java:110)
Aug 04 15:10:31 at 
org.apache.flink.runtime.metrics.groups.FrontMetricGroup$MockitoMock$251976889.getLogicalScope(Unknown
 Source)
Aug 04 15:10:31 at 
org.apache.flink.metrics.influxdb.MeasurementInfoProvider.getLogicalScope(MeasurementInfoProvider.java:70)
Aug 04 15:10:31 at 
org.apache.flink.metrics.influxdb.MeasurementInfoProvider.getScopedName(MeasurementInfoProvider.java:65)
Aug 04 15:10:31 at 
org.apache.flink.metrics.influxdb.MeasurementInfoProvider.getMetricInfo(MeasurementInfoProvider.java:49)
Aug 04 15:10:31 at 
org.apache.flink.metrics.influxdb.MeasurementInfoProviderTest.testNormalizingTags(MeasurementInfoProviderTest.java:83)
Aug 04 15:10:31 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
Aug 04 15:10:31 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
Aug 04 15:10:31 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
Aug 04 15:10:31 at java.lang.reflect.Method.invoke(Method.java:498)
Aug 04 15:10:31 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
Aug 04 15:10:31 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
Aug 04 15:10:31 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
Aug 04 15:10:31 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
Aug 04 15:10:31 at 
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
Aug 04 15:10:31 at 
org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
Aug 04 15:10:31 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
Aug 04 15:10:31 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
Aug 04 15:10:31 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
Aug 04 15:10:31 at 
org.junit.runners.ParentRunner.run(ParentRunner.java:363)
Aug 04 15:10:31 at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
Aug 04 15:10:31 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
Aug 04 15:10:31 at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
Aug 04 15:10:31 at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
Aug 04 1

Re: [DISCUSS] FLIP-180: Adjust StreamStatus and Idleness definition

2021-07-23 Thread Till Rohrmann
Hi everyone,

I would be in favour of what Arvid said about not exposing the
WatermarkStatus to the Sink. Unless there is a very strong argument that
this is required I think that keeping this concept internal seems to me the
better choice right now. Moreover, as Arvid said the downstream application
can derive the WatermarkStatus on their own depending on its business logic.

Cheers,
Till

On Fri, Jul 23, 2021 at 2:15 PM Arvid Heise  wrote:

> Hi Eron,
>
> thank you very much for your feedback.
>
> Please mention that the "temporary status toggle" code will be removed.
> >
> This code is already removed but there is still some automation of going
> idle when temporary no splits are assigned. I will include it in the FLIP.
>
> I agree with adding the markActive() functionality, for symmetry.  Speaking
> > of symmetry, could we now include the minor enhancement we discussed in
> > FLIP-167, the exposure of watermark status changes on the Sink interface.
> > I drafted a PR and would be happy to revisit it.
> >
> >
> https://github.com/streamnative/flink/pull/2/files#diff-64d9c652ffc3c60b6d838200a24b106eeeda4b2d853deae94dbbdf16d8d694c2R62-R70
>
> I'm still not sure if that's a good idea.
>
> If we have now refined idleness to be an user-specified,
> application-specific way to handle with temporarily stalled partitions,
> then downstream applications should actually implement their own idleness
> definition. Let's see what other devs think. I'm pinging the once that have
> been most involved in the discussion: @Stephan Ewen 
> @Till
> Rohrmann  @Dawid Wysakowicz 
> .
>
> The flip mentions a 'watermarkstatus' package for the WatermarkStatus
> > class.  Should it be 'eventtime' package?
> >
> Are you proposing org.apache.flink.api.common.eventtime? I was simply
> suggesting to simply rename
> org.apache.flink.streaming.runtime.streamstatus but I'm very open for other
> suggestions (given that there are only 2 classes in the package).
>
>
> > Regarding the change of 'streamStatus' to 'watermarkStatus', could you
> > spell out what the new method names will be on each interface? May I
> > suggest that Input.emitStreamStatus be Input.processStreamStatus?  This
> is
> > to help decouple the input's watermark status from the output's watermark
> > status.
> >
> I haven't found
> org.apache.flink.streaming.api.operators.Input#emitStreamStatus in master.
> Could you double-check if I'm looking at the correct class?
>
> The current idea was mainly to grep+replace /streamStatus/watermarkStatus/
> and /StreamStatus/WatermarkStatus/. But again I'm very open for more
> descriptive names. I can add an explicit list later. I'm assuming you are
> only interested in (semi-)public classes.
>
>
> > I observe that AbstractStreamOperator is hardcoded to derive the output
> > channel's status from the input channel's status.  May I suggest
> > we refactor "AbstractStreamOperator::emitStreamStatus(StreamStatus,int)"
> to
> > allow for the operator subclass to customize the processing of the
> > aggregated watermark and watermark status.
> >
> Can you add a motivation for that?
> @Dawid Wysakowicz  , I think you are the last
> person that touched the code. Do you have some example operators in your
> head that would change it?
>
> Maybe the FLIP should spell out the expected behavior of the generic
> > watermark generator (TimestampsAndWatermarksOperator).  Should the
> > generator ignore the upstream idleness signal?  I believe it propagates
> the
> > signal, even though it also generates its own signals.   Given that
> > source-based and generic watermark generation shouldn't be combined, one
> > could argue that the generic watermark generator should activate only
> when
> > its input channel's watermark status is idle.
> >
> I will add a section. In general, we assume that we only have source-based
> watermark generators once FLIP-27 is properly adopted.
>
> Best,
>
> Arvid
>
> On Wed, Jul 21, 2021 at 12:40 AM Eron Wright
>  wrote:
>
> > This proposal to narrow the definition of idleness to focus on the
> > event-time clock is great.
> >
> > Please mention that the "temporary status toggle" code will be removed.
> >
> > I agree with adding the markActive() functionality, for symmetry.
> Speaking
> > of symmetry, could we now include the minor enhancement we discussed in
> > FLIP-167, the exposure of watermark status changes on the Sink interface.
> > I drafted a PR and would be happy to revisit 

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-23 Thread Till Rohrmann
 able to skip
> creating new transactions after finish() is called.
> > Perhaps we may treat the process as a protocol between the framework and
> the operators, and the operators
> > might need to follow the protocol. Of course it would be better that the
> framework could handle all the cases
> > elegantly, and put less implicit limitation to the operators, and at
> least we should guarantee not cause compatibility
> > problem after changing the process.
> >
> > Very thanks for the careful checks on the whole process.
> >
> > Best,
> > Yun
> >
> >
> > --
> > From:Piotr Nowojski 
> > Send Time:2021 Jul. 22 (Thu.) 15:33
> > To:dev ; Yun Gao 
> > Cc:Till Rohrmann ; Yun Gao
> ; Piotr Nowojski 
> > Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
> Finished
> >
> > Hi Guowei,
> >
> >> Thank Dawid and Piotr for sharing the problem. +1 to EndInput/Finish
> can becalled repeatedly.
> > Just to clarify. It's not about calling `finish()` and `endInput()`
> repeatedly, but about (from the perspective of operator's state)
> > 1. seeing `finish()`
> > 2. checkpoint X triggered and completed
> > 3. failure + recovery from X
> > 4. potentially processing more records
> > 5. another `finish()`
> >
> > But from the context of the remaining part of your message Guowei I
> presume that you have already got that point :)
> >
> > Yun:
> >
> >> For this issue perhaps we could explicitly requires the task to wait
> for a checkpoint triggered after finish()> method is called for all the
> operators ? We could be able to achieve this target by maintaining
> >> some state inside the task.
> > Isn't this exactly the "WAITING_FOR_FINAL_CP" from the FLIP document?
> That we always need to wait for a checkpoint triggered after `finish()` to
> complete, before shutting down a task?
> >
> > What Dawid was describing is a scenario where:
> > 1. task/operator received `finish()`
> > 2. checkpoint 42 triggered (not yet completed)
> > 3. checkpoint 43 triggered (not yet completed)
> > 4. checkpoint 44 triggered (not yet completed)
> > 5. notifyCheckpointComplete(43)
> >
> > And what should we do now? We can of course commit all transactions
> until checkpoint 43. But should we keep waiting for
> `notyifyCheckpointComplete(44)`? What if in the meantime another checkpoint
> is triggered? We could end up waiting indefinitely.
> >
> > Our proposal is to shutdown the task immediately after seeing first
> `notifyCheckpointComplete(X)`, where X is any triggered checkpoint AFTER
> `finish()`. This should be fine, as:
> > a) ideally there should be no new pending transactions opened after
> checkpoint 42
> > b) even if operator/function is opening some transactions for checkpoint
> 43 and checkpoint 44 (`FlinkKafkaProducer`), those transactions after
> checkpoint 42 should be empty
> >
> > Hence comment from Dawid
> >> We must make sure those checkpoints do not leave lingering external
> resources.
> > After seeing 5. (notifyCheckpointComplete(43)) It should be good enough
> to:
> > - commit transactions from checkpoint 42, (and 43 if they were created,
> depends on the user code)
> > - close operator, aborting any pending transactions (for checkpoint 44
> if they were opened, depends on the user code)
> >
> > If checkpoint 44 completes afterwards, it will still be valid. Ideally
> we would recommend that after seeing `finish()` operators/functions should
> not be opening any new transactions, but that shouldn't be required.
> >
> > Best,
> > Piotrek
> > czw., 22 lip 2021 o 09:00 Yun Gao 
> napisał(a):
> > Hi Dawid, Piotr, Steven,
> >
> >  Very thanks for pointing out these issues and very thanks for the
> discussion !
> >
> >  Failure before notifyCheckpointComplete()
> >
> >  For this issue I would agree with what Piotr has proposed. I tried to
> use some
> >  operators like sink / window as example and currently I also do not
> found
> >  explicit scenarios that might cause problems if records are processed
> by a
> >  task that is assigned with states snapshotted after calling finish()
> before.
> >  For the future cases it seems users should be able to implement their
> >  target logic by explicitly add a flag regarding finished, and perhaps
> have
> >  different logic if this part of states are referred to. Besides, this
> case would
> >  only happen on rescaling or topology change, which embedded some kind
> of user
> >  knowledge inside the action. Thus it lo

Re: [VOTE] FLIP-183: Dynamic buffer size adjustment

2021-07-22 Thread Till Rohrmann
+1 (binding)

@Anton it is usually a good practice to start a new mailing list thread for
the vote. It should refer to the discussion thread and have a subject line
of the form "[VOTE] FLIP-183: Dynamic buffer size adjustment". Next time,
let's do it like this.

Cheers,
Till

On Wed, Jul 21, 2021 at 1:43 PM Yuan Mei  wrote:

> +1 (binding)
>
> Best
> Yuan
>
> On Wed, Jul 21, 2021 at 7:40 PM Piotr Nowojski 
> wrote:
>
> > +1 (binding)
> >
> > Piotrek
> >
> > śr., 21 lip 2021 o 13:21 Anton Kalashnikov 
> > napisał(a):
> >
> > > Hi everyone,
> > >
> > > I would like to start a vote on FLIP-183 [1] which was discussed in
> this
> > > thread [2].
> > > The vote will be open for at least 72 hours unless there is an
> objection
> > > or not enough votes.
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-183%3A+Dynamic+buffer+size+adjustment
> > > [2]
> > >
> > >
> >
> https://lists.apache.org/thread.html/r0d06131b35fe641df787c16e8bcd3784161f901062c25778ed92871b%40%3Cdev.flink.apache.org%3E
> > > --
> > > Best regards,
> > > Anton Kalashnikov
> > >
> >
>


Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-22 Thread Till Rohrmann
FLIP document? That
> we always need to wait for a checkpoint triggered after `finish()` to
> complete, before shutting down a task?
>
> What Dawid was describing is a scenario where:
> 1. task/operator received `finish()`
> 2. checkpoint 42 triggered (not yet completed)
> 3. checkpoint 43 triggered (not yet completed)
> 4. checkpoint 44 triggered (not yet completed)
> 5. notifyCheckpointComplete(43)
>
> And what should we do now? We can of course commit all transactions until
> checkpoint 43. But should we keep waiting for
> `notyifyCheckpointComplete(44)`? What if in the meantime another checkpoint
> is triggered? We could end up waiting indefinitely.
>
> Our proposal is to shutdown the task immediately after seeing first
> `notifyCheckpointComplete(X)`, where X is any triggered checkpoint AFTER
> `finish()`. This should be fine, as:
> a) ideally there should be no new pending transactions opened after
> checkpoint 42
> b) even if operator/function is opening some transactions for checkpoint
> 43 and checkpoint 44 (`FlinkKafkaProducer`), those transactions after
> checkpoint 42 should be empty
>
> Hence comment from Dawid
> > We must make sure those checkpoints do not leave lingering external
> resources.
>
> After seeing 5. (notifyCheckpointComplete(43)) It should be good enough to:
> - commit transactions from checkpoint 42, (and 43 if they were created,
> depends on the user code)
> - close operator, aborting any pending transactions (for checkpoint 44 if
> they were opened, depends on the user code)
>
> If checkpoint 44 completes afterwards, it will still be valid. Ideally we
> would recommend that after seeing `finish()` operators/functions should not
> be opening any new transactions, but that shouldn't be required.
>
> Best,
> Piotrek
>
> czw., 22 lip 2021 o 09:00 Yun Gao 
> napisał(a):
> Hi Dawid, Piotr, Steven,
>
> Very thanks for pointing out these issues and very thanks for the
> discussion !
>
> Failure before notifyCheckpointComplete()
>
> For this issue I would agree with what Piotr has proposed. I tried to use
> some
> operators like sink / window as example and currently I also do not found
> explicit scenarios that might cause problems if records are processed by a
> task that is assigned with states snapshotted after calling finish()
> before.
> For the future cases it seems users should be able to implement their
> target logic by explicitly add a flag regarding finished, and perhaps have
> different logic if this part of states are referred to. Besides, this case
> would
> only happen on rescaling or topology change, which embedded some kind of
> user
> knowledge inside the action. Thus it looks acceptable that we still split
> the operators
> state from the task lifecycle, and do not treat checkpoint after finish()
> differently.
>
> Finishing upon receiving notifyCheckpointComplete() of not the latest
> checkpoint
>
> For this issue perhaps we could explicitly requires the task to wait for a
> checkpoint triggered after finish()
> method is called for all the operators ? We could be able to achieve this
> target by maintaining
> some state inside the task.
>
> Checkpointing from a single subtask / UnionListState case
>
> This should indeed cause problems, and I also agree with that we could
> focus on this thread in the
> https://issues.apache.org/jira/browse/FLINK-21080.
>
> Best,
> Yun
>
>
> --
> From:Piotr Nowojski 
> Send Time:2021 Jul. 22 (Thu.) 02:46
> To:dev 
> Cc:Yun Gao ; Till Rohrmann ;
> Yun Gao ; Piotr Nowojski <
> pi...@ververica.com>
> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
> Finished
>
> Hi Steven,
>
> > I probably missed sth here. isn't this the case today already? Why is it
> a concern for the proposed change?
>
> The problem is with the newly added `finish()` method and the already
> existing `endInput()` call. Currently on master there are no issues,
> because we are not checkpointing any operators after some operators have
> finished. The purpose of this FLIP-147 is to exactly enable this and this
> opens a new problem described by Dawid.
>
> To paraphrase and to give a concrete example.  Assume we have an operator
> with parallelism of two. Subtask 0 and subtask 1.
>
> 1. Subtask 0 received both `endInput()` and `finish()`, but subtask 1
> hasn't (yet).
> 2. Checkpoint 42 is triggered, and it completes.
> 3. Job fails and is restarted, but at the same time it's rescaled. User
> has chosen to scale down his operator down to 1.
>
> Now we have a pickle. We don't know if `notifyCheckpointComplete(42)` ha

Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread Till Rohrmann
Thanks for your inputs Gen and Arnaud.

I do agree with you, Gen, that we need better guidance for our users on
when to change the heartbeat configuration. I think this should happen in
any case. I am, however, not so sure whether we can give hard threshold
like 5000 tasks, for example, because as Arnaud said it strongly depends on
the workload. Maybe we can explain it based on symptoms a user might
experience and what to do then.

Concerning your workloads, Arnaud, I'd be interested to learn a bit more.
The user code runs in its own thread. This means that its operation won't
block the main thread/heartbeat. The only thing that can happen is that the
user code starves the heartbeat in terms of CPU cycles or causes a lot of
GC pauses. If you are observing the former problem, then we might think
about changing the priorities of the respective threads. This should then
improve Flink's stability for these workloads and a shorter heartbeat
timeout should be possible.

Also for the RAM-cached repositories, what exactly is causing the heartbeat
to time out? Is it because you have a lot of GC or that the heartbeat
thread does not get enough CPU cycles?

Cheers,
Till

On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud 
wrote:

> Hello,
>
>
>
> From a user perspective: we have some (rare) use cases where we use
> “coarse grain” datasets, with big beans and tasks that do lengthy operation
> (such as ML training). In these cases we had to increase the time out to
> huge values (heartbeat.timeout: 50) so that our app is not killed.
>
> I’m aware this is not the way Flink was meant to be used, but it’s a
> convenient way to distribute our workload on datanodes without having to
> use another concurrency framework (such as M/R) that would require the
> recoding of sources and sinks.
>
>
>
> In some other (most common) cases, our tasks do some R/W accesses to
> RAM-cached repositories backed by a key-value storage such as Kudu (or
> Hbase). If most of those calls are very fast, sometimes when the system is
> under heavy load they may block more than a few seconds, and having our app
> killed because of a short timeout is not an option.
>
>
>
> That’s why I’m not in favor of very short timeouts… Because in my
> experience it really depends on what user code does in the tasks. (I
> understand that normally, as user code is not a JVM-blocking activity such
> as a GC, it should have no impact on heartbeats, but from experience, it
> really does)
>
>
>
> Cheers,
>
> Arnaud
>
>
>
>
>
> *De :* Gen Luo 
> *Envoyé :* jeudi 22 juillet 2021 05:46
> *À :* Till Rohrmann 
> *Cc :* Yang Wang ; dev ;
> user 
> *Objet :* Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval
> default values
>
>
>
> Hi,
>
> Thanks for driving this @Till Rohrmann  . I would
> give +1 on reducing the heartbeat timeout and interval, though I'm not sure
> if 15s and 3s would be enough either.
>
>
>
> IMO, except for the standalone cluster, where the heartbeat mechanism in
> Flink is totally relied, reducing the heartbeat can also help JM to find
> out faster TaskExecutors in abnormal conditions that can not respond to the
> heartbeat requests, e.g., continuously Full GC, though the process of
> TaskExecutor is alive and may not be known by the deployment system. Since
> there are cases that can benefit from this change, I think it could be done
> if it won't break the experience in other scenarios.
>
>
>
> If we can address what will block the main threads from processing
> heartbeats, or enlarge the GC costs, we can try to get rid of them to have
> a more predictable response time of heartbeat, or give some advices to
> users if their jobs may encounter these issues. For example, as far as I
> know JM of a large scale job will be more busy and may not able to process
> heartbeats in time, then we can give a advice that users working with job
> large than 5000 tasks should enlarge there heartbeat interval to 10s and
> timeout to 50s. The numbers are written casually.
>
>
>
> As for the issue in FLINK-23216, I think it should be fixed and may not be
> a main concern for this case.
>
>
>
> On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann 
> wrote:
>
> Thanks for sharing these insights.
>
>
>
> I think it is no longer true that the ResourceManager notifies the
> JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details.
>
>
>
> Given the GC pauses, would you then be ok with decreasing the heartbeat
> timeout to 20 seconds? This should give enough time to do the GC and then
> still send/receive a heartbeat request.
>
>
>
> I also wanted to add that we are about to get rid of one big cause of
> blocking I/O operat

Re: Flink 1.14. Bi-weekly 2021-07-20

2021-07-21 Thread Till Rohrmann
Thanks for the update, Joe, Xintong and Dawid. This is very helpful.

Cheers,
Till

On Wed, Jul 21, 2021 at 9:27 AM Johannes Moser  wrote:

> Dear Flink Community,
>
> here's the latest update from the bi-weekly.
> Short summary: still a lot of stuff is going on, the feature freeze has
> been
> pushed by two weeks to the >>>16th of August<<< and we are working
> on the release communication. Good stuff: the test and blockers are
> looking very good.
>
> Get all the updates on the release page. [1]
>
> *Feature freeze:*
> Initially because of issues with FLIP-147 we have been considering
> moving the feature freeze, when we did a general update it became
> obvious that a lot of teams/efforts would benefit from more time.
> So we did what has to be done and moved it by two weeks.
>
> *On-going efforts:*
> What also became obvious, the pure list of links to Jira Issues
> doesn't provide a clear overview on the progress of the efforts.
> Often the linked issues have a subset of other issues, that also
> went into other releases. Some of the subtasks are optional.
> We, as a community, need to improve that going forward. For now
> I introduced a traffic light style state for each row in the
> feature list as a quick fix. I will make sure this will be
> updated before the next bi-weekly.
> The update has shown that ~20 of 36 planned features have a
> positive state. 6 are rather looking bad.
>
> *Test instabilities:*
> The tests became more stable recently, which is a really, really good
> thing.
> Thanks for all the efforts that went into this.
>
> *Highlight features:*
> We are working on the release communication in general. So we are
> collecting
> the highlights of this release and did a quick brainstorming which
> brought us to this list: ML, Table API, Log based checkpoints...
> If there's something worth mentioning please add it on the
> release page.
>
> What can you do to make 1.14. a good release:
> * Get an realistic state of the ongoing efforts and update the page
> * Start testing the "done" features
> * Think of the things that go beyond the implementation: documentation and
>   release/feature evangelism.
>
> Sincerely, your release managers
> Xintong, Dawid & Joe
>
> -
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/1.14+Release
>


Re: [DISCUSS] Releasing Flink 1.11.4

2021-07-21 Thread Till Rohrmann
Great, thanks Godfrey.

Cheers,
Till

On Wed, Jul 21, 2021 at 7:42 AM godfrey he  wrote:

> Hi Till,
>
> Sorry for the late reply. The previous period, I focused on another urgent
> work,
> and suspended the releasing work. I've recently restarted it.
>
> Best,
> Godfrey
>
> Till Rohrmann  于2021年7月13日周二 下午8:36写道:
>
> > Hi Godfrey,
> >
> > Are you continuing with the 1.11.4 release process?
> >
> > Cheers,
> > Till
> >
> > On Tue, Jul 6, 2021 at 1:15 PM Chesnay Schepler 
> > wrote:
> >
> > > Since 1.11.4 is about releasing the commits we already have merged
> > > between 1.11.3 and 1.13.0, I would suggest to not add additional fixes.
> > >
> > > On 06/07/2021 12:47, Matthias Pohl wrote:
> > > > Hi Godfrey,
> > > > Thanks for volunteering to be the release manager for 1.11.4.
> > FLINK-21445
> > > > [1] has a backport PR for 1.11.4 [2] prepared. I wouldn't label it
> as a
> > > > blocker but it would be nice to have it included in 1.11.4
> considering
> > > that
> > > > it's quite unlikely to have another 1.11.5 release. Right now,
> AzureCI
> > is
> > > > running as a final step. I'm CC'ing Chesnay because he would be in
> > charge
> > > > of merging the PR.
> > > >
> > > > Matthias
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-21445
> > > > [2] https://github.com/apache/flink/pull/16387
> > > >
> > > > On Wed, Jun 30, 2021 at 2:15 PM godfrey he 
> > wrote:
> > > >
> > > >> Hi devs,
> > > >>
> > > >> As discussed in [1], I would like to start a discussion for
> releasing
> > > Flink
> > > >> 1.11.4.
> > > >>
> > > >> I would like to volunteer as the release manger for 1.11.4, and will
> > > start
> > > >> the release process on the next Wednesday (July 7th).
> > > >>
> > > >> There are 75 issues that have been closed or resolved [2],
> > > >> and no blocker issues left [3] so far.
> > > >>
> > > >> If any issues need to be marked as blocker for 1.11.4, please let me
> > > know
> > > >> in this thread!
> > > >>
> > > >> Best,
> > > >> Godfrey
> > > >>
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://lists.apache.org/thread.html/r40a541027c6a04519f37c61f2a6f3dabdb821b3760cda9cc6ebe6ce9%40%3Cdev.flink.apache.org%3E
> > > >> [2]
> > > >>
> > > >>
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.11.4%20AND%20status%20in%20(Closed%2C%20Resolved)
> > > >> [3]
> > > >>
> > > >>
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.11.4%20AND%20status%20not%20in%20(Closed%2C%20Resolved)%20ORDER%20BY%20priority%20DESC
> > > >>
> > >
> > >
> >
>


Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-21 Thread Till Rohrmann
Thanks for sharing these insights.

I think it is no longer true that the ResourceManager notifies the
JobMaster about lost TaskExecutors. See FLINK-23216 [1] for more details.

Given the GC pauses, would you then be ok with decreasing the heartbeat
timeout to 20 seconds? This should give enough time to do the GC and then
still send/receive a heartbeat request.

I also wanted to add that we are about to get rid of one big cause of
blocking I/O operations from the main thread. With FLINK-22483 [2] we will
get rid of Filesystem accesses to retrieve completed checkpoints. This
leaves us with one additional file system access from the main thread which
is the one completing a pending checkpoint. I think it should be possible
to get rid of this access because as Stephan said it only writes
information to disk that is already written before. Maybe solving these two
issues could ease concerns about long pauses of unresponsiveness of Flink.

[1] https://issues.apache.org/jira/browse/FLINK-23216
[2] https://issues.apache.org/jira/browse/FLINK-22483

Cheers,
Till

On Wed, Jul 21, 2021 at 4:58 AM Yang Wang  wrote:

> Thanks @Till Rohrmann   for starting this discussion
>
> Firstly, I try to understand the benefit of shorter heartbeat timeout.
> IIUC, it will make the JobManager aware of
> TaskManager faster. However, it seems that only the standalone cluster
> could benefit from this. For Yarn and
> native Kubernetes deployment, the Flink ResourceManager should get the
> TaskManager lost event in a very short time.
>
> * About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
> * Less than 1 second, Flink RM has a watch for all the TaskManager pods
>
> Secondly, I am not very confident to decrease the timeout to 15s. I have
> quickly checked the TaskManager GC logs
> in the past week of our internal Flink workloads and find more than 100
> 10-seconds Full GC logs, but no one is bigger than 15s.
> We are using CMS GC for old generation.
>
>
> Best,
> Yang
>
> Till Rohrmann  于2021年7月17日周六 上午1:05写道:
>
>> Hi everyone,
>>
>> Since Flink 1.5 we have the same heartbeat timeout and interval default
>> values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
>> 10s. These values were mainly chosen to compensate for lengthy GC pauses
>> and blocking operations that were executed in the main threads of Flink's
>> components. Since then, there were quite some advancements wrt the JVM's
>> GCs and we also got rid of a lot of blocking calls that were executed in
>> the main thread. Moreover, a long heartbeat.timeout causes long recovery
>> times in case of a TaskManager loss because the system can only properly
>> recover after the dead TaskManager has been removed from the scheduler.
>> Hence, I wanted to propose to change the timeout and interval to:
>>
>> heartbeat.timeout: 15s
>> heartbeat.interval: 3s
>>
>> Since there is no perfect solution that fits all use cases, I would really
>> like to hear from you what you think about it and how you configure these
>> heartbeat options. Based on your experience we might actually come up with
>> better default values that allow us to be resilient but also to detect
>> failed components fast. FLIP-185 can be found here [1].
>>
>> [1] https://cwiki.apache.org/confluence/x/GAoBCw
>>
>> Cheers,
>> Till
>>
>


Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-19 Thread Till Rohrmann
Yun Gao wrote:
>>
>> > Hi Till, Piotr
>>
>> >
>>
>> > Very thanks for the comments!
>>
>> >
>>
>> >> 1) Does endOfInput entail sending of the MAX_WATERMARK?
>>
>> > I also agree with Piotr that currently they are independent mechanisms,
>> and they are basically the same
>>
>> > for the event time.
>>
>> >
>>
>> > For more details, first there are some difference among the three
>> scenarios regarding the finish:
>>
>> > For normal finish and stop-with-savepoint --drain, the job would not be
>> expected to be restarted,
>>
>> > and for stop-with-savepoint the job would be expected restart later.
>>
>> >
>>
>> > Then for finish / stop-with-savepoint --drain, currently Flink would
>> emit MAX_WATERMARK before the
>>
>> > EndOfPartition. Besides, as we have discussed before [1], endOfInput /
>> finish() should also only be called
>>
>> > for finish / stop-with-savepoint --drain. Thus currently they always
>> occurs at the same time. After the change,
>>
>> > we could emit MAX_WATERMARK before endOfInput event for the finish /
>> stop-with-savepoint --drain cases.
>>
>> >
>>
>> >> 2) StreamOperator.finish says to flush all buffered events. Would a
>>
>> >> WindowOperator close all windows and emit the results upon calling
>>
>> >> finish, for example?
>>
>> > As discussed above for stop-with-savepoint, we would always keep the
>> window as is, and restore them after restart.
>>
>> > Then for the finish / stop-with-savepoint --drain, I think perhaps it
>> depends on the Triggers. For
>>
>> > event-time triggers / process time triggers, it would be reasonable to
>> flush all the windows since logically
>>
>> > the time would always elapse and the window would always get triggered
>> in a logical future. But for triggers
>>
>> > like CountTrigger, no matter how much time pass logically, the windows
>> would not trigger, thus we may not
>>
>> > flush these windows. If there are requirements we may provide
>> additional triggers.
>>
>> >
>>
>> >> It's a bit messy and I'm not sure if this should be strengthened out?
>> Each one of those has a little bit different semantic/meaning,
>>
>> >> but at the same time they are very similar. For single input operators
>> `endInput()` and `finish()` are actually the very same thing.
>>
>> > Currently MAX_WATERMARK / endInput / finish indeed always happen at the
>> same time, and for single input operators `endInput()` and `finish()`
>>
>> > are indeed the same thing. During the last discussion we ever mentioned
>> this issue and at then we thought that we might deprecate `endInput()`
>>
>> > in the future, then we would only have endInput(int input) and
>> finish().
>>
>> >
>>
>> > Best,
>>
>> > Yun
>>
>> >
>>
>> >
>>
>> > [1] https://issues.apache.org/jira/browse/FLINK-21132
>>
>> >
>>
>> >
>>
>> >
>>
>> > --
>>
>> > From:Piotr Nowojski
>>
>> > Send Time:2021 Jul. 16 (Fri.) 13:48
>>
>> > To:dev
>>
>> > Cc:Yun Gao
>>
>> > Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
>> Finished
>>
>> >
>>
>> > Hi Till,
>>
>> >
>>
>> >> 1) Does endOfInput entail sending of the MAX_WATERMARK?
>>
>> >>
>>
>> >> 2) StreamOperator.finish says to flush all buffered events. Would a>
>> WindowOperator close all windows and emit the results upon calling
>>
>> >> finish, for example?
>>
>> > 1) currently they are independent but parallel mechanisms. With event
>> time, they are basically the same.
>>
>> > 2) it probably should for the sake of processing time windows.
>>
>> >
>>
>> > Here you are touching the bit of the current design that I like the
>> least. We basically have now three different ways of conveying very similar
>> things:
>>
>> > a) sending `MAX_WATERMARK`, used by event time WindowOperator (what
>> about processing time?)
>>
>> > b) endInput(), used for example by AsyncWaitOperator to flush it's
>> internal state
>>
&

Re: Introduction email

2021-07-19 Thread Till Rohrmann
Hi Srini,

Welcome to the Flink community :-) Great to hear what you are planning to
do with Flink at LinkedIn. I think sharing this is very motivational for
the community and also gives context for what you are focusing on. Looking
forward to working with you and improving Flink.

Cheers,
Till

On Fri, Jul 16, 2021 at 8:36 PM Srinivasulu Punuru 
wrote:

> Hi Flink Devs,
>
> I am Srini, I work for stream processing team at LinkedIn. LinkedIn is
> taking a big bet on Apache Flink and migrating all the existing streaming
> SQL apps to Flink. You might have seen mails from some of our team members
> past few months. Thanks a lot for your support!
>
> I just wanted to Say Hi to everyone before I take up some of the starter
> Jiras and start contributing.
>
> Thanks Again! Looking forward to collaboration :)
>
> Here are some of the quick notes about our Flink scenarios.
>
>1. We will be using Flink SQL just for stream processing applications.
>2. Most of our current SQL apps are stateless, But stateful SQL
>capabilities is one of the reasons we are migrating to Flink. SQL state
>management is an area of interest.
>3. We also have customers asking for batch and streaming convergence, So
>SQL based batch <-> streaming convergence or engine portability of SQL
> apps
>is an area of interest.
>4. We are initially on prem. But LinkedIn as a whole is betting on
>Cloud. So taking advantage some of the cloud capabilities like Storage
>compute disaggregation, Elastic compute (for auto-scaling) for Flink
> would
>be interesting.
>5. We also provide a managed streaming SQL service i.e. We manage the
>SQL jobs for our developers. So reliability, operability and quick
> recovery
>is critical as well :).
>
> Thanks,
> Srini.
>


[DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-16 Thread Till Rohrmann
Hi everyone,

Since Flink 1.5 we have the same heartbeat timeout and interval default
values that are defined as heartbeat.timeout: 50s and heartbeat.interval:
10s. These values were mainly chosen to compensate for lengthy GC pauses
and blocking operations that were executed in the main threads of Flink's
components. Since then, there were quite some advancements wrt the JVM's
GCs and we also got rid of a lot of blocking calls that were executed in
the main thread. Moreover, a long heartbeat.timeout causes long recovery
times in case of a TaskManager loss because the system can only properly
recover after the dead TaskManager has been removed from the scheduler.
Hence, I wanted to propose to change the timeout and interval to:

heartbeat.timeout: 15s
heartbeat.interval: 3s

Since there is no perfect solution that fits all use cases, I would really
like to hear from you what you think about it and how you configure these
heartbeat options. Based on your experience we might actually come up with
better default values that allow us to be resilient but also to detect
failed components fast. FLIP-185 can be found here [1].

[1] https://cwiki.apache.org/confluence/x/GAoBCw

Cheers,
Till


Re: [DISCUSS] FLIP-171: Async Sink

2021-07-16 Thread Till Rohrmann
Sure, thanks for the pointers.

Cheers,
Till

On Fri, Jul 16, 2021 at 6:19 PM Hausmann, Steffen 
wrote:

> Hi Till,
>
> You are right, I’ve left out some implementation details, which have
> actually changed a couple of time as part of the ongoing discussion. You
> can find our current prototype here [1] and a sample implementation of the
> KPL free Kinesis sink here [2].
>
> I plan to update the FLIP. But I think would it be make sense to wait
> until the implementation has stabilized enough before we update the FLIP to
> the final state.
>
> Does that make sense?
>
> Cheers, Steffen
>
> [1]
> https://github.com/sthm/flink/tree/flip-171-177/flink-connectors/flink-connector-base/src/main/java/org/apache/flink/connector/base/sink
> [2]
> https://github.com/sthm/flink/blob/flip-171-177/flink-connectors/flink-connector-kinesis-171/src/main/java/software/amazon/flink/connectors/AmazonKinesisDataStreamSink.java
>
> From: Till Rohrmann 
> Date: Friday, 16. July 2021 at 18:10
> To: Piotr Nowojski 
> Cc: Steffen Hausmann , "dev@flink.apache.org" <
> dev@flink.apache.org>, Arvid Heise 
> Subject: RE: [EXTERNAL] [DISCUSS] FLIP-171: Async Sink
>
>
> CAUTION: This email originated from outside of the organization. Do not
> click links or open attachments unless you can confirm the sender and know
> the content is safe.
>
>
> Hi Steffen,
>
> I've taken another look at the FLIP and I stumbled across a couple of
> inconsistencies. I think it is mainly because of the lacking code. For
> example, it is not fully clear to me based on the current FLIP how we
> ensure that there are no in-flight requests when
> AsyncSinkWriter.snapshotState is called. Also the concrete implementation
> of the AsyncSinkCommitter could be helpful for understanding how the
> AsyncSinkWriter works in the end. Do you plan to update the FLIP
> accordingly?
>
> Cheers,
> Till
>
> On Wed, Jun 30, 2021 at 8:36 AM Piotr Nowojski  <mailto:pnowoj...@apache.org>> wrote:
> Thanks for addressing this issue :)
>
> Best, Piotrek
>
> wt., 29 cze 2021 o 17:58 Hausmann, Steffen  shau...@amazon.de>> napisał(a):
> Hey Poitr,
>
> I've just adapted the FLIP and changed the signature for the
> `submitRequestEntries` method:
>
> protected abstract void submitRequestEntries(List
> requestEntries, ResultFuture requestResult);
>
> In addition, we are likely to use an AtomicLong to track the number of
> outstanding requests, as you have proposed in 2b). I've already indicated
> this in the FLIP, but it's not fully fleshed out. But as you have said,
> that seems to be an implementation detail and the important part is the
> change of the `submitRequestEntries` signature.
>
> Thanks for your feedback!
>
> Cheers, Steffen
>
>
> On 25.06.21, 17:05, "Hausmann, Steffen"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Hi Piotr,
>
> I’m happy to take your guidance on this. I need to think through your
> proposals and I’ll follow-up on Monday with some more context so that we
> can close the discussion on these details. But for now, I’ll close the vote.
>
> Thanks, Steffen
>
> From: Piotr Nowojski mailto:pnowoj...@apache.org
> >>
> Date: Friday, 25. June 2021 at 14:48
> To: Till Rohrmann mailto:trohrm...@apache.org>>
> Cc: Steffen Hausmann mailto:shau...@amazon.de>>, "
> dev@flink.apache.org<mailto:dev@flink.apache.org>"  <mailto:dev@flink.apache.org>>, Arvid Heise  ar...@apache.org>>
> Subject: RE: [EXTERNAL] [DISCUSS] FLIP-171: Async Sink
>
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
> Hey,
>
> I've just synced with Arvid about a couple of more remarks from my
> side and he shared mine concerns.
>
> 1. I would very strongly recommend ditching `CompletableFuture `
> from the  `protected abstract CompletableFuture
> submitRequestEntries(List requestEntries);`  in favor of
> something like
> `org.apache.flink.streaming.api.functions.async.ResultFuture` interface.
> `CompletableFuture` would partially make the threading model of the
> `AsyncSincWriter` part of the public API and it would tie our hands.
> Regardless how `CompletableFuture` is used, it imposes performance
> overhead because it's synchronisation/volatile inside of it. On the other
> hand something like:
>
> protect

Re: [DISCUSS] FLIP-171: Async Sink

2021-07-16 Thread Till Rohrmann
Hi Steffen,

I've taken another look at the FLIP and I stumbled across a couple of
inconsistencies. I think it is mainly because of the lacking code. For
example, it is not fully clear to me based on the current FLIP how we
ensure that there are no in-flight requests when
AsyncSinkWriter.snapshotState is called. Also the concrete implementation
of the AsyncSinkCommitter could be helpful for understanding how the
AsyncSinkWriter works in the end. Do you plan to update the FLIP
accordingly?

Cheers,
Till

On Wed, Jun 30, 2021 at 8:36 AM Piotr Nowojski  wrote:

> Thanks for addressing this issue :)
>
> Best, Piotrek
>
> wt., 29 cze 2021 o 17:58 Hausmann, Steffen  napisał(a):
>
>> Hey Poitr,
>>
>> I've just adapted the FLIP and changed the signature for the
>> `submitRequestEntries` method:
>>
>> protected abstract void submitRequestEntries(List
>> requestEntries, ResultFuture requestResult);
>>
>> In addition, we are likely to use an AtomicLong to track the number of
>> outstanding requests, as you have proposed in 2b). I've already indicated
>> this in the FLIP, but it's not fully fleshed out. But as you have said,
>> that seems to be an implementation detail and the important part is the
>> change of the `submitRequestEntries` signature.
>>
>> Thanks for your feedback!
>>
>> Cheers, Steffen
>>
>>
>> On 25.06.21, 17:05, "Hausmann, Steffen" 
>> wrote:
>>
>> CAUTION: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>>
>> Hi Piotr,
>>
>> I’m happy to take your guidance on this. I need to think through your
>> proposals and I’ll follow-up on Monday with some more context so that we
>> can close the discussion on these details. But for now, I’ll close the vote.
>>
>> Thanks, Steffen
>>
>> From: Piotr Nowojski 
>> Date: Friday, 25. June 2021 at 14:48
>> To: Till Rohrmann 
>> Cc: Steffen Hausmann , "dev@flink.apache.org" <
>> dev@flink.apache.org>, Arvid Heise 
>> Subject: RE: [EXTERNAL] [DISCUSS] FLIP-171: Async Sink
>>
>>
>> CAUTION: This email originated from outside of the organization. Do
>> not click links or open attachments unless you can confirm the sender and
>> know the content is safe.
>>
>>
>> Hey,
>>
>> I've just synced with Arvid about a couple of more remarks from my
>> side and he shared mine concerns.
>>
>> 1. I would very strongly recommend ditching `CompletableFuture `
>> from the  `protected abstract CompletableFuture
>> submitRequestEntries(List requestEntries);`  in favor of
>> something like
>> `org.apache.flink.streaming.api.functions.async.ResultFuture` interface.
>> `CompletableFuture` would partially make the threading model of the
>> `AsyncSincWriter` part of the public API and it would tie our hands.
>> Regardless how `CompletableFuture` is used, it imposes performance
>> overhead because it's synchronisation/volatile inside of it. On the other
>> hand something like:
>>
>> protected abstract void submitRequestEntries(List
>> requestEntries, ResultFuture requestResult);
>>
>> Would allow us to implement the threading model as we wish.
>> `ResultFuture` could be backed via `CompletableFuture` underneath, but
>> it could also be something more efficient.  I will explain what I have in
>> mind in a second.
>>
>> 2. It looks to me that proposed `AsyncSinkWriter` Internals are not
>> very efficient and maybe the threading model hasn't been thought through?
>> Especially private fields:
>>
>> private final BlockingDeque bufferedRequestEntries;
>> private BlockingDeque> inFlightRequests;
>>
>> are a bit strange to me. Why do we need two separate thread safe
>> collections? Why do we need a `BlockingDeque` of `CompletableFuture`s?
>> If we are already using a fully synchronised collection, there should be no
>> need for another layer of thread safe `CompletableFuture`.
>>
>> As I understand, the threading model of the `AsyncSinkWriter` is very
>> similar to that of the `AsyncWaitOperator`, with very similar requirements
>> for inducing backpressure. How I would see it implemented is for example:
>>
>> a) Having a single lock, that would encompass the whole
>> `AsyncSinkWriter#flush()` method. `flush()` would be called from the task
>> thread (mailbox). To induce back

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-16 Thread Till Rohrmann
nding the last mail, I think as a
> whole we are on the same page for the
> 1. For normal stop-with-savepoint, we do not call finish() and do not emit
> MAX_WATERMARK
> 2. For normal finish / stop-with-savepoint --drain, we would call finish()
> and emit MAX_WATERMARK
>
> > But then there is the question, how do we signal the operator that the
> next checkpoint is supposed to stop the operator
> > (how will the operator's lifecycle look in this case)? Maybe we simply
> don't tell the operator and handle this situation on the
> > StreamTask level.
>
> Logically I think in this case UDF seems do not need to know the next
> checkpoint is supposed to stop the operator since the final
> checkpoint in this case have no difference with the ordinary checkpoints.
>
> > So I guess the question is will finish() advance the time to the end or
> is this a separate mechanism (e.g. explicit watermarks).
>
> I tend to have an explicit MAX_WATERMARK since it makes watermark
> processing to be unified with normal cases and make the meanings of
> each event explicit. But this might be a private preference and both
> methods would work.
>
> Very sorry for not making the whole thing clear in the FLIP again, if
> there are no other concerns I'll update the FLIP with the above conclusions
> to make it precise in this part.
>
>
> Best,
> Yun
>
> --
> From:Till Rohrmann 
> Send Time:2021 Jul. 16 (Fri.) 16:00
> To:dev 
> Cc:Yun Gao 
> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
> Finished
>
> I think we should try to sort this out because it might affect how and
> when finish() will be called (or in general how the operator lifecycle
> looks like).
>
> To give an example let's take a look at the stop-with-savepoint w/ and w/o
> --drain:
>
> 1) stop-with-savepoint w/o --drain: Conceptually what we would like to do
> is to stop processing w/o completing any windows or waiting for the
> AsyncOperator to finish its operations. All these unfinished operations
> should become part of the final checkpoint so that we can resume from it at
> a later point. Depending on what finish() does (flush unfinished windows or
> not), this method must or must not be called. Assuming that finish()
> flushes unfinished windows/waits for uncompleted async operations, we
> clearly shouldn't call it. But then there is the question, how do we
> signal the operator that the next checkpoint is supposed to stop the
> operator (how will the operator's lifecycle look in this case)? Maybe we
> simply don't tell the operator and handle this situation on the StreamTask
> level. If finish() does not flush unfinished windows, then it shouldn't be
> a problem.
>
> 2) stop-with-savepoint w/ --drain: Here we want to complete all pending
> operations and flush out all results because we don't intend to resume the
> job. Conceptually, we tell the system that we have reached MAX_WATERMARK.
> If finish() is defined so that it implicitly advances the watermark to
> MAX_WATERMARK, then there is no problem. If finish() does not have this
> semantic, then we need to send the MAX_WATERMARK before sending the
> endOfInput event to a downstream task. In fact, stop-with-savepoint /w
> --drain shouldn't be a lot different from a bounded source that reaches its
> end. It would also send MAX_WATERMARK and then signal the endOfInput event
> (note that endOfInput is decoupled from the event time here).
>
> So I guess the question is will finish() advance the time to the end or is
> this a separate mechanism (e.g. explicit watermarks).
>
> Concerning how to handle processing time, I am a bit unsure tbh. I can see
> arguments for completing processing time windows/firing processing time
> timers when calling stop-with-savepoint w/ --drain. On the other hand, I
> could also see that people want to define actions based on the wall clock
> time that are independent of the stream state and, thus, would want to
> ignore them if the Flink application is stopped before reaching this time.
>
> Cheers,
> Till
>
> On Fri, Jul 16, 2021 at 7:48 AM Piotr Nowojski 
> wrote:
> Hi Till,
>
> > 1) Does endOfInput entail sending of the MAX_WATERMARK?
> >
> > 2) StreamOperator.finish says to flush all buffered events. Would a
> > WindowOperator close all windows and emit the results upon calling
> > finish, for example?
>
> 1) currently they are independent but parallel mechanisms. With event time,
> they are basically the same.
> 2) it probably should for the sake of processing time windows.
>
> Here you are touching the bit of the current design that I l

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-16 Thread Till Rohrmann
ing how to handle processing time, I am a bit unsure tbh. I can see
> arguments for completing processing time windows/firing processing time
> timers when calling stop-with-savepoint w/ --drain. On the other hand, I
> could also see that people want to define actions based on the wall clock
> time that are independent of the stream state and, thus, would want to
> ignore them if the Flink application is stopped before reaching this time.
>
> Cheers,
> Till
>
> On Fri, Jul 16, 2021 at 7:48 AM Piotr Nowojski 
> wrote:
> Hi Till,
>
> > 1) Does endOfInput entail sending of the MAX_WATERMARK?
> >
> > 2) StreamOperator.finish says to flush all buffered events. Would a
> > WindowOperator close all windows and emit the results upon calling
> > finish, for example?
>
> 1) currently they are independent but parallel mechanisms. With event time,
> they are basically the same.
> 2) it probably should for the sake of processing time windows.
>
> Here you are touching the bit of the current design that I like the least.
> We basically have now three different ways of conveying very similar
> things:
> a) sending `MAX_WATERMARK`, used by event time WindowOperator (what about
> processing time?)
> b) endInput(), used for example by AsyncWaitOperator to flush it's internal
> state
> c) finish(), used for example by ContinuousFileReaderOperator
>
> It's a bit messy and I'm not sure if this should be strengthened out? Each
> one of those has a little bit different semantic/meaning, but at the same
> time they are very similar. For single input operators `endInput()` and
> `finish()` are actually the very same thing.
>
> Piotrek
>
> czw., 15 lip 2021 o 16:47 Till Rohrmann  napisał(a):
>
> > Thanks for updating the FLIP. Based on the new section about
> > stop-with-savepoint [--drain] I got two other questions:
> >
> > 1) Does endOfInput entail sending of the MAX_WATERMARK?
> >
> > 2) StreamOperator.finish says to flush all buffered events. Would a
> > WindowOperator close all windows and emit the results upon calling
> > finish, for example?
> >
> > Cheers,
> > Till
> >
> > On Thu, Jul 15, 2021 at 10:15 AM Till Rohrmann 
> > wrote:
> >
> > > Thanks a lot for your answers and clarifications Yun.
> > >
> > > 1+2) Agreed, this can be a future improvement if this becomes a
> problem.
> > >
> > > 3) Great, this will help a lot with understanding the FLIP.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 14, 2021 at 5:41 PM Yun Gao 
> > > wrote:
> > >
> > >> Hi Till,
> > >>
> > >> Very thanks for the review and comments!
> > >>
> > >> 1) First I think in fact we could be able to do the computation
> outside
> > >> of the main thread,
> > >> and the current implementation mainly due to the computation is in
> > >> general fast and we
> > >> initially want to have a simplified first version.
> > >>
> > >> The main requirement here is to have a constant view of the state of
> the
> > >> tasks, otherwise
> > >> for example if we have A -> B, if A is running when we check if we
> need
> > >> to trigger A, we will
> > >> mark A as have to trigger, but if A gets to finished when we check B,
> we
> > >> will also mark B as
> > >> have to trigger, then B will receive both rpc trigger and checkpoint
> > >> barrier, which would break
> > >> some assumption on the task side and complicate the implementation.
> > >>
> > >> But to cope this issue, we in fact could first have a snapshot of the
> > >> tasks' state and then do the
> > >> computation, both the two step do not need to be in the main thread.
> > >>
> > >> 2) For the computation logic, in fact currently we benefit a lot from
> > >> some shortcuts on all-to-all
> > >> edges and job vertex with all tasks running, these shortcuts could do
> > >> checks on the job vertex level
> > >> first and skip some job vertices as a whole. With this optimization we
> > >> have a O(V) algorithm, and the
> > >> current running time of the worst case for a job with 320,000 tasks is
> > >> less than 100ms. For
> > >> daily graph sizes the time would be further reduced linearly.
> > >>
> > >> If we do the computation based on the last triggered tasks, we may not
> > >> easily encode this information
> > &

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-16 Thread Till Rohrmann
I think we should try to sort this out because it might affect how and when
finish() will be called (or in general how the operator lifecycle looks
like).

To give an example let's take a look at the stop-with-savepoint w/ and w/o
--drain:

1) stop-with-savepoint w/o --drain: Conceptually what we would like to do
is to stop processing w/o completing any windows or waiting for the
AsyncOperator to finish its operations. All these unfinished operations
should become part of the final checkpoint so that we can resume from it at
a later point. Depending on what finish() does (flush unfinished windows or
not), this method must or must not be called. Assuming that finish()
flushes unfinished windows/waits for uncompleted async operations, we
clearly shouldn't call it. But then there is the question, how do we
signal the operator that the next checkpoint is supposed to stop the
operator (how will the operator's lifecycle look in this case)? Maybe we
simply don't tell the operator and handle this situation on the StreamTask
level. If finish() does not flush unfinished windows, then it shouldn't be
a problem.

2) stop-with-savepoint w/ --drain: Here we want to complete all pending
operations and flush out all results because we don't intend to resume the
job. Conceptually, we tell the system that we have reached MAX_WATERMARK.
If finish() is defined so that it implicitly advances the watermark to
MAX_WATERMARK, then there is no problem. If finish() does not have this
semantic, then we need to send the MAX_WATERMARK before sending the
endOfInput event to a downstream task. In fact, stop-with-savepoint /w
--drain shouldn't be a lot different from a bounded source that reaches its
end. It would also send MAX_WATERMARK and then signal the endOfInput event
(note that endOfInput is decoupled from the event time here).

So I guess the question is will finish() advance the time to the end or is
this a separate mechanism (e.g. explicit watermarks).

Concerning how to handle processing time, I am a bit unsure tbh. I can see
arguments for completing processing time windows/firing processing time
timers when calling stop-with-savepoint w/ --drain. On the other hand, I
could also see that people want to define actions based on the wall clock
time that are independent of the stream state and, thus, would want to
ignore them if the Flink application is stopped before reaching this time.

Cheers,
Till

On Fri, Jul 16, 2021 at 7:48 AM Piotr Nowojski  wrote:

> Hi Till,
>
> > 1) Does endOfInput entail sending of the MAX_WATERMARK?
> >
> > 2) StreamOperator.finish says to flush all buffered events. Would a
> > WindowOperator close all windows and emit the results upon calling
> > finish, for example?
>
> 1) currently they are independent but parallel mechanisms. With event time,
> they are basically the same.
> 2) it probably should for the sake of processing time windows.
>
> Here you are touching the bit of the current design that I like the least.
> We basically have now three different ways of conveying very similar
> things:
> a) sending `MAX_WATERMARK`, used by event time WindowOperator (what about
> processing time?)
> b) endInput(), used for example by AsyncWaitOperator to flush it's internal
> state
> c) finish(), used for example by ContinuousFileReaderOperator
>
> It's a bit messy and I'm not sure if this should be strengthened out? Each
> one of those has a little bit different semantic/meaning, but at the same
> time they are very similar. For single input operators `endInput()` and
> `finish()` are actually the very same thing.
>
> Piotrek
>
> czw., 15 lip 2021 o 16:47 Till Rohrmann  napisał(a):
>
> > Thanks for updating the FLIP. Based on the new section about
> > stop-with-savepoint [--drain] I got two other questions:
> >
> > 1) Does endOfInput entail sending of the MAX_WATERMARK?
> >
> > 2) StreamOperator.finish says to flush all buffered events. Would a
> > WindowOperator close all windows and emit the results upon calling
> > finish, for example?
> >
> > Cheers,
> > Till
> >
> > On Thu, Jul 15, 2021 at 10:15 AM Till Rohrmann 
> > wrote:
> >
> > > Thanks a lot for your answers and clarifications Yun.
> > >
> > > 1+2) Agreed, this can be a future improvement if this becomes a
> problem.
> > >
> > > 3) Great, this will help a lot with understanding the FLIP.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 14, 2021 at 5:41 PM Yun Gao 
> > > wrote:
> > >
> > >> Hi Till,
> > >>
> > >> Very thanks for the review and comments!
> > >>
> > >> 1) First I think in fact we could be able to do the computation
&g

Re: [DISCUSS] FLIP-183: Dynamic buffer size adjustment

2021-07-16 Thread Till Rohrmann
I think this is a good idea. +1 for this approach. Are you gonna update the
FLIP accordingly?

Cheers,
Till

On Thu, Jul 15, 2021 at 9:33 PM Steven Wu  wrote:

> I really like the new idea.
>
> On Thu, Jul 15, 2021 at 11:51 AM Piotr Nowojski 
> wrote:
>
> > Hi Till,
> >
> > >  I assume that buffer sizes are only
> > > changed for newly assigned buffers/credits, right? Otherwise, the data
> > > could already be on the wire and then it wouldn't fit on the receiver
> > side.
> > > Or do we have a back channel mechanism to tell the sender that a part
> of
> > a
> > > buffer needs to be resent once more capacity is available?
> >
> > Initially our implementation proposal was intending to implement the
> first
> > option. Buffer size would be attached to a credit message, so first
> > received would need to allocate a buffer with the updated size, send the
> > credit upstream, and sender would be allowed to only send as much data as
> > in the credit. So there would be no way and no problem with changing
> buffer
> > sizes while something is "on the wire".
> >
> > However Anton suggested an even simpler idea to me today. There is
> actually
> > no problem with receivers supporting all buffer sizes up to the maximum
> > allowed size (current configured memory segment size). Thus new buffer
> size
> > can be treated as a recommendation by the sender. We can announce a new
> > buffer size, and the sender will start capping the newly requested buffer
> > to that size, but we can still send already filled buffers in chunks with
> > any size, as long as it's below max memory segment size. In this way we
> can
> > leave any already filled in buffers on the sender side untouched and we
> do
> > not need to partition/slice them before sending them down, making at
> least
> > the initial version even simpler. This way we also do not need to
> > differentiate that different credits have different sizes. We just
> announce
> > a single value "recommended/requested buffer size".
> >
> > Piotrek
> >
> > czw., 15 lip 2021 o 17:27 Till Rohrmann 
> napisał(a):
> >
> > > Hi everyone,
> > >
> > > Thanks a lot for creating this FLIP Anton and Piotr. I think it looks
> > like
> > > a very promising solution for speeding up our checkpoints and being
> able
> > to
> > > create them more reliably.
> > >
> > > Following up on Steven's question: I assume that buffer sizes are only
> > > changed for newly assigned buffers/credits, right? Otherwise, the data
> > > could already be on the wire and then it wouldn't fit on the receiver
> > side.
> > > Or do we have a back channel mechanism to tell the sender that a part
> of
> > a
> > > buffer needs to be resent once more capacity is available?
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Jul 14, 2021 at 11:16 AM Piotr Nowojski 
> > > wrote:
> > >
> > > > Hi Steven,
> > > >
> > > > As downstream/upstream nodes are decoupled, if downstream nodes
> adjust
> > > > first it's buffer size first, there will be a lag until this updated
> > > buffer
> > > > size information reaches the upstream node.. It is a problem, but it
> > has
> > > a
> > > > quite simple solution that we described in the FLIP document:
> > > >
> > > > > Sending the buffer of the right size.
> > > > > It is not enough to know just the number of available buffers
> > (credits)
> > > > for the downstream because the size of these buffers can be
> different.
> > > > > So we are proposing to resolve this problem in the following way:
> If
> > > the
> > > > downstream buffer size is changed then the upstream should send
> > > > > the buffer of the size not greater than the new one regardless of
> how
> > > big
> > > > the current buffer on the upstream. (pollBuffer should receive
> > > > > parameters like bufferSize and return buffer not greater than it)
> > > >
> > > > So apart from adding buffer size information to the `AddCredit`
> > message,
> > > we
> > > > will need to support a case where upstream subpartition has already
> > > > produced a buffer with older size (for example 32KB), while the next
> > > credit
> > > > arrives with an allowance for a smaller size (16KB). I

[jira] [Created] (FLINK-23403) Decrease default values for heartbeat timeout and interval

2021-07-15 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23403:
-

 Summary: Decrease default values for heartbeat timeout and interval
 Key: FLINK-23403
 URL: https://issues.apache.org/jira/browse/FLINK-23403
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration, Runtime / Coordination
Affects Versions: 1.14.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann
 Fix For: 1.14.0


In order to speed up failure detection I suggest to decrease the default values 
for the heartbeat timeout and interval from 50s/10s to 15s/3s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] FLIP-183: Dynamic buffer size adjustment

2021-07-15 Thread Till Rohrmann
Hi everyone,

Thanks a lot for creating this FLIP Anton and Piotr. I think it looks like
a very promising solution for speeding up our checkpoints and being able to
create them more reliably.

Following up on Steven's question: I assume that buffer sizes are only
changed for newly assigned buffers/credits, right? Otherwise, the data
could already be on the wire and then it wouldn't fit on the receiver side.
Or do we have a back channel mechanism to tell the sender that a part of a
buffer needs to be resent once more capacity is available?

Cheers,
Till

On Wed, Jul 14, 2021 at 11:16 AM Piotr Nowojski 
wrote:

> Hi Steven,
>
> As downstream/upstream nodes are decoupled, if downstream nodes adjust
> first it's buffer size first, there will be a lag until this updated buffer
> size information reaches the upstream node.. It is a problem, but it has a
> quite simple solution that we described in the FLIP document:
>
> > Sending the buffer of the right size.
> > It is not enough to know just the number of available buffers (credits)
> for the downstream because the size of these buffers can be different.
> > So we are proposing to resolve this problem in the following way: If the
> downstream buffer size is changed then the upstream should send
> > the buffer of the size not greater than the new one regardless of how big
> the current buffer on the upstream. (pollBuffer should receive
> > parameters like bufferSize and return buffer not greater than it)
>
> So apart from adding buffer size information to the `AddCredit` message, we
> will need to support a case where upstream subpartition has already
> produced a buffer with older size (for example 32KB), while the next credit
> arrives with an allowance for a smaller size (16KB). In that case, we are
> only allowed to send a portion of the data from this buffer that fits into
> the new updated buffer size, and keep announcing the remaining part as
> available backlog.
>
> Best,
> Piotrek
>
>
> śr., 14 lip 2021 o 08:33 Steven Wu  napisał(a):
>
> >- The subtask observes the changes in the throughput and changes the
> >buffer size during the whole life period of the task.
> >- The subtask sends buffer size and number of available buffers to the
> >upstream to the corresponding subpartition.
> >- Upstream changes the buffer size corresponding to the received
> >information.
> >- Upstream sends the data and number of filled buffers to the
> downstream
> >
> >
> > Will the above steps of buffer size adjustment cause problems with
> > credit-based flow control (mainly for downsizing), since downstream
> > adjust down first?
> >
> > Here is the quote from the blog[1]
> > "Credit-based flow control makes sure that whatever is “on the wire” will
> > have capacity at the receiver to handle. "
> >
> >
> > [1]
> >
> >
> https://flink.apache.org/2019/06/05/flink-network-stack.html#credit-based-flow-control
> >
> >
> > On Tue, Jul 13, 2021 at 7:34 PM Yingjie Cao 
> > wrote:
> >
> > > Hi,
> > >
> > > Thanks for driving this, I think it is really helpful for jobs
> suffering
> > > from backpressure.
> > >
> > > Best,
> > > Yingjie
> > >
> > > Anton,Kalashnikov  于2021年7月9日周五 下午10:59写道:
> > >
> > > > Hey!
> > > >
> > > > There is a wish to decrease amount of in-flight data which can
> improve
> > > > aligned checkpoint time(fewer in-flight data to process before
> > > > checkpoint can complete) and improve the behaviour and performance of
> > > > unaligned checkpoints (fewer in-flight data that needs to be
> persisted
> > > > in every unaligned checkpoint). The main idea is not to keep as much
> > > > in-flight data as much memory we have but keeping the amount of data
> > > > which can be predictably handling for configured amount of time(ex.
> we
> > > > keep data which can be processed in 1 sec). It can be achieved by
> > > > calculation of the effective throughput and following changes the
> > buffer
> > > > size based on the this throughput. More details about the proposal
> you
> > > > can find here [1].
> > > >
> > > > What are you thoughts about it?
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-183%3A+Dynamic+buffer+size+adjustment
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Anton Kalashnikov
> > > >
> > > >
> > > >
> > >
> >
>


Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-15 Thread Till Rohrmann
Thanks for updating the FLIP. Based on the new section about
stop-with-savepoint [--drain] I got two other questions:

1) Does endOfInput entail sending of the MAX_WATERMARK?

2) StreamOperator.finish says to flush all buffered events. Would a
WindowOperator close all windows and emit the results upon calling
finish, for example?

Cheers,
Till

On Thu, Jul 15, 2021 at 10:15 AM Till Rohrmann  wrote:

> Thanks a lot for your answers and clarifications Yun.
>
> 1+2) Agreed, this can be a future improvement if this becomes a problem.
>
> 3) Great, this will help a lot with understanding the FLIP.
>
> Cheers,
> Till
>
> On Wed, Jul 14, 2021 at 5:41 PM Yun Gao 
> wrote:
>
>> Hi Till,
>>
>> Very thanks for the review and comments!
>>
>> 1) First I think in fact we could be able to do the computation outside
>> of the main thread,
>> and the current implementation mainly due to the computation is in
>> general fast and we
>> initially want to have a simplified first version.
>>
>> The main requirement here is to have a constant view of the state of the
>> tasks, otherwise
>> for example if we have A -> B, if A is running when we check if we need
>> to trigger A, we will
>> mark A as have to trigger, but if A gets to finished when we check B, we
>> will also mark B as
>> have to trigger, then B will receive both rpc trigger and checkpoint
>> barrier, which would break
>> some assumption on the task side and complicate the implementation.
>>
>> But to cope this issue, we in fact could first have a snapshot of the
>> tasks' state and then do the
>> computation, both the two step do not need to be in the main thread.
>>
>> 2) For the computation logic, in fact currently we benefit a lot from
>> some shortcuts on all-to-all
>> edges and job vertex with all tasks running, these shortcuts could do
>> checks on the job vertex level
>> first and skip some job vertices as a whole. With this optimization we
>> have a O(V) algorithm, and the
>> current running time of the worst case for a job with 320,000 tasks is
>> less than 100ms. For
>> daily graph sizes the time would be further reduced linearly.
>>
>> If we do the computation based on the last triggered tasks, we may not
>> easily encode this information
>> into the shortcuts on the job vertex level. And since the time seems to
>> be short, perhaps it is enough
>> to do re-computation from the scratch in consideration of the tradeoff
>> between the performance and
>> the complexity ?
>>
>> 3) We are going to emit the EndOfInput event exactly after the finish()
>> method and before the last
>> snapshotState() method so that we could shut down the whole topology with
>> a single final checkpoint.
>> Very sorry for not include enough details for this part and I'll
>> complement the FLIP with the details on
>> the process of the final checkpoint / savepoint.
>>
>> Best,
>> Yun
>>
>>
>>
>> --
>> From:Till Rohrmann 
>> Send Time:2021 Jul. 14 (Wed.) 22:05
>> To:dev 
>> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
>> Finished
>>
>> Hi everyone,
>>
>> I am a bit late to the voting party but let me ask three questions:
>>
>> 1) Why do we execute the trigger plan computation in the main thread if we
>> cannot guarantee that all tasks are still running when triggering the
>> checkpoint? Couldn't we do the computation in a different thread in order
>> to relieve the main thread a bit.
>>
>> 2) The implementation of the DefaultCheckpointPlanCalculator seems to go
>> over the whole topology for every calculation. Wouldn't it be more
>> efficient to maintain the set of current tasks to trigger and check
>> whether
>> anything has changed and if so check the succeeding tasks until we have
>> found the current checkpoint trigger frontier?
>>
>> 3) When are we going to send the endOfInput events to a downstream task?
>> If
>> this happens after we call finish on the upstream operator but before
>> snapshotState then it would be possible to shut down the whole topology
>> with a single final checkpoint. I think this part could benefit from a bit
>> more detailed description in the FLIP.
>>
>> Cheers,
>> Till
>>
>> On Fri, Jul 2, 2021 at 8:36 AM Yun Gao 
>> wrote:
>>
>> > Hi there,
>> >
>> > Since the voting time of FLIP-147[1] has passed, I'm closing the vote
>> now.
>> >
>> > There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes:
>> >
>> > - Dawid Wysakowicz (binding)
>> > - Piotr Nowojski(binding)
>> > - Jiangang Liu (binding)
>> > - Arvid Heise (binding)
>> > - Jing Zhang (binding)
>> > - Leonard Xu (non-binding)
>> > - Guowei Ma (binding)
>> >
>> > Thus I'm happy to announce that the update to the FLIP-147 is accepted.
>> >
>> > Very thanks everyone!
>> >
>> > Best,
>> > Yun
>> >
>> > [1]  https://cwiki.apache.org/confluence/x/mw-ZCQ
>>
>>


Re: [VOTE] Release 1.12.5, release candidate #1

2021-07-15 Thread Till Rohrmann
Great, thanks a lot Jingsong!

Cheers,
Till

On Thu, Jul 15, 2021 at 5:00 AM Jingsong Li  wrote:

> Hi Till and Flinkers working for the FLINK-23233,
>
> Thanks for the effort!
>
> RC2 is on the way.
>
> Best,
> Jingsong
>
> On Tue, Jul 13, 2021 at 8:35 PM Till Rohrmann 
> wrote:
>
> > Hi everyone,
> >
> > FLINK-23233 has been merged. We can continue with the release process.
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 7, 2021 at 1:29 PM Jingsong Li 
> wrote:
> >
> > > Hi all,
> > >
> > > Thanks Xiongtong, Leonard, Yang, JING for the voting. Thanks Till for
> the
> > > information.
> > >
> > > +1 for canceling this RC. That should also give us the chance to fix
> > > FLINK-23233.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Wed, Jul 7, 2021 at 5:29 PM Till Rohrmann 
> > wrote:
> > >
> > > > Hi folks, are we sure that FLINK-23233 [1] does not affect the 1.12
> > > release
> > > > branch. I think this problem was introduced with FLINK-21996 [2]
> which
> > is
> > > > part of release-1.12. Hence, we might either fix this problem and
> > cancel
> > > > this RC or we need to create a fast 1.12.6 afterwards.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-23233
> > > > [2] https://issues.apache.org/jira/browse/FLINK-21996
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Wed, Jul 7, 2021 at 7:43 AM JING ZHANG 
> > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > 1. built from source code flink-1.12.5-src.tgz
> > > > > <
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/flink-1.12.5-src.tgz
> > > > > >
> > > > > succeeded
> > > > > 2. Started a local Flink cluster, ran the WordCount example, WebUI
> > > looks
> > > > > good,  no suspicious output/log
> > > > > 3. started cluster and run some e2e sql queries using SQL Client,
> > query
> > > > > result is expected.
> > > > > 4. Repeat Step 2 and 3 with flink-1.12.5-bin-scala_2.11.tgz
> > > > > <
> > > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/flink-1.12.5-bin-scala_2.11.tgz
> > > > > >
> > > > >
> > > > > Best,
> > > > > JING ZHANG
> > > > >
> > > > > Yang Wang  于2021年7月7日周三 上午11:36写道:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - verified checksums & signatures
> > > > > > - start a session cluster and verify HA data cleanup once job
> > reached
> > > > to
> > > > > > globally terminal state, FLINK-20695
> > > > > > - start a local standalone cluster, check the webUI good and
> JM/TM
> > > > > > without suspicious logs
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > Leonard Xu  于2021年7月7日周三 上午10:52写道:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > - verified signatures and hashsums
> > > > > > > - built from source code with scala 2.11 succeeded
> > > > > > > - checked all denpendency artifacts are 1.12.5
> > > > > > > - started a cluster, ran a wordcount job, the result is
> expected
> > > > > > > - started SQL Client, ran a simple query, the result is
> expected
> > > > > > > - reviewed the web PR, left one minor name comment
> > > > > > >
> > > > > > > Best,
> > > > > > > Leonard
> > > > > > >
> > > > > > > > 在 2021年7月6日,10:02,Xintong Song  写道:
> > > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > - verified checksums & signatures
> > > > > > > > - built from sources
> > > > > > > > - run example jobs with standalone and native k8s deployments
&

Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-15 Thread Till Rohrmann
Thanks a lot for your answers and clarifications Yun.

1+2) Agreed, this can be a future improvement if this becomes a problem.

3) Great, this will help a lot with understanding the FLIP.

Cheers,
Till

On Wed, Jul 14, 2021 at 5:41 PM Yun Gao 
wrote:

> Hi Till,
>
> Very thanks for the review and comments!
>
> 1) First I think in fact we could be able to do the computation outside of
> the main thread,
> and the current implementation mainly due to the computation is in general
> fast and we
> initially want to have a simplified first version.
>
> The main requirement here is to have a constant view of the state of the
> tasks, otherwise
> for example if we have A -> B, if A is running when we check if we need to
> trigger A, we will
> mark A as have to trigger, but if A gets to finished when we check B, we
> will also mark B as
> have to trigger, then B will receive both rpc trigger and checkpoint
> barrier, which would break
> some assumption on the task side and complicate the implementation.
>
> But to cope this issue, we in fact could first have a snapshot of the
> tasks' state and then do the
> computation, both the two step do not need to be in the main thread.
>
> 2) For the computation logic, in fact currently we benefit a lot from some
> shortcuts on all-to-all
> edges and job vertex with all tasks running, these shortcuts could do
> checks on the job vertex level
> first and skip some job vertices as a whole. With this optimization we
> have a O(V) algorithm, and the
> current running time of the worst case for a job with 320,000 tasks is
> less than 100ms. For
> daily graph sizes the time would be further reduced linearly.
>
> If we do the computation based on the last triggered tasks, we may not
> easily encode this information
> into the shortcuts on the job vertex level. And since the time seems to be
> short, perhaps it is enough
> to do re-computation from the scratch in consideration of the tradeoff
> between the performance and
> the complexity ?
>
> 3) We are going to emit the EndOfInput event exactly after the finish()
> method and before the last
> snapshotState() method so that we could shut down the whole topology with
> a single final checkpoint.
> Very sorry for not include enough details for this part and I'll
> complement the FLIP with the details on
> the process of the final checkpoint / savepoint.
>
> Best,
> Yun
>
>
>
> --
> From:Till Rohrmann 
> Send Time:2021 Jul. 14 (Wed.) 22:05
> To:dev 
> Subject:Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks
> Finished
>
> Hi everyone,
>
> I am a bit late to the voting party but let me ask three questions:
>
> 1) Why do we execute the trigger plan computation in the main thread if we
> cannot guarantee that all tasks are still running when triggering the
> checkpoint? Couldn't we do the computation in a different thread in order
> to relieve the main thread a bit.
>
> 2) The implementation of the DefaultCheckpointPlanCalculator seems to go
> over the whole topology for every calculation. Wouldn't it be more
> efficient to maintain the set of current tasks to trigger and check whether
> anything has changed and if so check the succeeding tasks until we have
> found the current checkpoint trigger frontier?
>
> 3) When are we going to send the endOfInput events to a downstream task? If
> this happens after we call finish on the upstream operator but before
> snapshotState then it would be possible to shut down the whole topology
> with a single final checkpoint. I think this part could benefit from a bit
> more detailed description in the FLIP.
>
> Cheers,
> Till
>
> On Fri, Jul 2, 2021 at 8:36 AM Yun Gao 
> wrote:
>
> > Hi there,
> >
> > Since the voting time of FLIP-147[1] has passed, I'm closing the vote
> now.
> >
> > There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes:
> >
> > - Dawid Wysakowicz (binding)
> > - Piotr Nowojski(binding)
> > - Jiangang Liu (binding)
> > - Arvid Heise (binding)
> > - Jing Zhang (binding)
> > - Leonard Xu (non-binding)
> > - Guowei Ma (binding)
> >
> > Thus I'm happy to announce that the update to the FLIP-147 is accepted.
> >
> > Very thanks everyone!
> >
> > Best,
> > Yun
> >
> > [1]  https://cwiki.apache.org/confluence/x/mw-ZCQ
>
>


Re: [RESULT][VOTE] FLIP-147: Support Checkpoint After Tasks Finished

2021-07-14 Thread Till Rohrmann
Hi everyone,

I am a bit late to the voting party but let me ask three questions:

1) Why do we execute the trigger plan computation in the main thread if we
cannot guarantee that all tasks are still running when triggering the
checkpoint? Couldn't we do the computation in a different thread in order
to relieve the main thread a bit.

2) The implementation of the DefaultCheckpointPlanCalculator seems to go
over the whole topology for every calculation. Wouldn't it be more
efficient to maintain the set of current tasks to trigger and check whether
anything has changed and if so check the succeeding tasks until we have
found the current checkpoint trigger frontier?

3) When are we going to send the endOfInput events to a downstream task? If
this happens after we call finish on the upstream operator but before
snapshotState then it would be possible to shut down the whole topology
with a single final checkpoint. I think this part could benefit from a bit
more detailed description in the FLIP.

Cheers,
Till

On Fri, Jul 2, 2021 at 8:36 AM Yun Gao  wrote:

> Hi there,
>
> Since the voting time of FLIP-147[1] has passed, I'm closing the vote now.
>
> There were seven +1 votes ( 6 / 7 are bindings) and no -1 votes:
>
> - Dawid Wysakowicz (binding)
> - Piotr Nowojski(binding)
> - Jiangang Liu (binding)
> - Arvid Heise (binding)
> - Jing Zhang (binding)
> - Leonard Xu (non-binding)
> - Guowei Ma (binding)
>
> Thus I'm happy to announce that the update to the FLIP-147 is accepted.
>
> Very thanks everyone!
>
> Best,
> Yun
>
> [1]  https://cwiki.apache.org/confluence/x/mw-ZCQ


Re: [DISCUSS] Address deprecation warnings when upgrading dependencies

2021-07-14 Thread Till Rohrmann
I think this suggestion makes a lot of sense, Stephan. +1 for fixing
deprecation warnings when bumping/changing dependencies. I actually found
myself recently, whenever touching a test class, replacing Junit's
assertThat with Hamcrest's version which felt quite tedious.

Cheers,
Till

On Tue, Jul 13, 2021 at 6:15 PM Stephan Ewen  wrote:

> Hi all!
>
> I would like to propose that we make it a project standard that when
> upgrading a dependency, deprecation issues arising from that need to be
> fixed in the same step. If the new dependency version deprecates a method
> in favor of another method, all usages in the code need to be replaced
> together with the upgrade.
>
> We are accumulating deprecated API uses over time, and it floods logs and
> IDEs with deprecation warnings. I find this is a problem, because the
> irrelevant warnings more and more drown out the actually relevant warnings.
> And arguably, the deprecation warning isn't fully irrelevant, it can cause
> problems in the future when the method is actually removed.
> We need the general principle that a change leaves the codebase in at least
> as good shape as before, otherwise things accumulate over time and the
> overall quality goes down.
>
> The concrete example that motivated this for me is the JUnit dependency
> upgrade. Pretty much every test I looked at recently is quite yellow (due
> to junit Matchers.assertThat being deprecated in the new JUnit version).
> This is easily fixed (even a string replace and spotless:apply goes a long
> way), so I would suggest we try and do these things in one step in the
> future.
>
> Curious what other committers think about this suggestion.
>
> Best,
> Stephan
>


Re: [DISCUSS] Releasing Flink 1.11.4

2021-07-13 Thread Till Rohrmann
Hi Godfrey,

Are you continuing with the 1.11.4 release process?

Cheers,
Till

On Tue, Jul 6, 2021 at 1:15 PM Chesnay Schepler  wrote:

> Since 1.11.4 is about releasing the commits we already have merged
> between 1.11.3 and 1.13.0, I would suggest to not add additional fixes.
>
> On 06/07/2021 12:47, Matthias Pohl wrote:
> > Hi Godfrey,
> > Thanks for volunteering to be the release manager for 1.11.4. FLINK-21445
> > [1] has a backport PR for 1.11.4 [2] prepared. I wouldn't label it as a
> > blocker but it would be nice to have it included in 1.11.4 considering
> that
> > it's quite unlikely to have another 1.11.5 release. Right now, AzureCI is
> > running as a final step. I'm CC'ing Chesnay because he would be in charge
> > of merging the PR.
> >
> > Matthias
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-21445
> > [2] https://github.com/apache/flink/pull/16387
> >
> > On Wed, Jun 30, 2021 at 2:15 PM godfrey he  wrote:
> >
> >> Hi devs,
> >>
> >> As discussed in [1], I would like to start a discussion for releasing
> Flink
> >> 1.11.4.
> >>
> >> I would like to volunteer as the release manger for 1.11.4, and will
> start
> >> the release process on the next Wednesday (July 7th).
> >>
> >> There are 75 issues that have been closed or resolved [2],
> >> and no blocker issues left [3] so far.
> >>
> >> If any issues need to be marked as blocker for 1.11.4, please let me
> know
> >> in this thread!
> >>
> >> Best,
> >> Godfrey
> >>
> >>
> >> [1]
> >>
> >>
> https://lists.apache.org/thread.html/r40a541027c6a04519f37c61f2a6f3dabdb821b3760cda9cc6ebe6ce9%40%3Cdev.flink.apache.org%3E
> >> [2]
> >>
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.11.4%20AND%20status%20in%20(Closed%2C%20Resolved)
> >> [3]
> >>
> >>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.11.4%20AND%20status%20not%20in%20(Closed%2C%20Resolved)%20ORDER%20BY%20priority%20DESC
> >>
>
>


Re: [VOTE] Release 1.13.2, release candidate #1

2021-07-13 Thread Till Rohrmann
Hi everyone,

FLINK-23233 has been merged. We can continue with the release process.

Cheers,
Till

On Wed, Jul 7, 2021 at 2:00 PM Yun Tang  wrote:

> Since FLINK-23233 brings concerns to many guys, and this release candidate
> #1 had not received enough binding +1 from PMCs, I think we could cancel
> this RC and wait for that problem resolved.
>
> Best
> Yun Tang
> ____
> From: Till Rohrmann 
> Sent: Wednesday, July 7, 2021 17:21
> To: dev 
> Cc: Xintong Song 
> Subject: Re: [VOTE] Release 1.13.2, release candidate #1
>
> I think FLINK-23233 sounds like a very serious problem to me. We either
> continue releasing 1.13.2 and do an immediate 1.13.3 afterwards or we
> cancel this RC and fix the problem and then resume the 1.13.2 release.
>
> Cheers,
> Till
>
> On Wed, Jul 7, 2021 at 5:49 AM JING ZHANG  wrote:
>
> > Thanks Xintong for bringing the alarm, +1 on Xintong's proposal to hold
> the
> > vote until deciding what to do with FLINK-23233.
> >
> > BTW, I made a typo in the previous voting. I mean +1 (non-binding)
> instead
> > of +1 (binding). I'm very sorry for the confusion.
> >
> > +1 (non-binding).
> >
> > 1. Reviewed website pull request
> > 2. Built from source code flink-1.13.2-src.tgz
> > <
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-src.tgz
> > >
> >  succeeded
> > 3. Started a local Flink cluster, ran the WordCount example, WebUI looks
> > good,  no suspicious output/log
> > 4. Started cluster and run some e2e sql queries using SQL Client, query
> > result is expected.
> > 5. Repeat Step 3 and 4 with flink-1.13.2-bin-scala_2.11.tgz
> > <
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-bin-scala_2.11.tgz
> > >
> >
> > Best regards,
> > JING ZHANG
> >
> > Xintong Song  于2021年7月7日周三 上午11:09写道:
> >
> > > Hi everyone,
> > >
> > > We find a bug that may cause data loss in a rare condition, which IMHO
> > is a
> > > release blocker. Please see FLINK-23233 [1] for details.
> > >
> > > Given that this bug is not newly introduced but already exists in
> 1.13.0
> > &
> > > 1.13.1 releases, I would not replace my +1 with a -1 immediately.
> > Instead,
> > > I would like to ask the release manager for holding this vote until
> > > deciding what to do with FLINK-23233.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-23233
> > >
> > > On Wed, Jul 7, 2021 at 12:09 AM JING ZHANG 
> wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > 1. Reviewed website pull request
> > > > 2. Built from source code flink-1.13.2-src.tgz
> > > > <
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-src.tgz
> > > > >
> > > > succeeded
> > > > 3. Started a local Flink cluster, ran the WordCount example, WebUI
> > looks
> > > > good,  no suspicious output/log
> > > > 4. Started cluster and run some e2e sql queries using SQL Client,
> query
> > > > result is expected.
> > > > 5. Repeat Step 3 and 4 with flink-1.13.2-bin-scala_2.11.tgz
> > > > <
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-bin-scala_2.11.tgz
> > > > >
> > > >
> > > > Best regards,
> > > > JING ZHANG
> > > >
> > > >
> > > > Zakelly Lan  于2021年7月5日周一 下午6:06写道:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - built from sources
> > > > > - run streaming job of wordcount
> > > > > - web-ui looks good
> > > > > - checkpoint and restore looks good
> > > > >
> > > > > Best,
> > > > > Zakelly
> > > > >
> > > > > On Mon, Jul 5, 2021 at 2:40 PM Jingsong Li  >
> > > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - Verified checksums and signatures
> > > > > > - Built from sources
> > > > > > - run table example jobs
> > > > > > - web-ui looks good
> > > > > > - sql-client looks good
>

Re: [VOTE] Release 1.12.5, release candidate #1

2021-07-13 Thread Till Rohrmann
Hi everyone,

FLINK-23233 has been merged. We can continue with the release process.

Cheers,
Till

On Wed, Jul 7, 2021 at 1:29 PM Jingsong Li  wrote:

> Hi all,
>
> Thanks Xiongtong, Leonard, Yang, JING for the voting. Thanks Till for the
> information.
>
> +1 for canceling this RC. That should also give us the chance to fix
> FLINK-23233.
>
> Best,
> Jingsong
>
> On Wed, Jul 7, 2021 at 5:29 PM Till Rohrmann  wrote:
>
> > Hi folks, are we sure that FLINK-23233 [1] does not affect the 1.12
> release
> > branch. I think this problem was introduced with FLINK-21996 [2] which is
> > part of release-1.12. Hence, we might either fix this problem and cancel
> > this RC or we need to create a fast 1.12.6 afterwards.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-23233
> > [2] https://issues.apache.org/jira/browse/FLINK-21996
> >
> > Cheers,
> > Till
> >
> > On Wed, Jul 7, 2021 at 7:43 AM JING ZHANG  wrote:
> >
> > > +1 (non-binding)
> > >
> > > 1. built from source code flink-1.12.5-src.tgz
> > > <
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/flink-1.12.5-src.tgz
> > > >
> > > succeeded
> > > 2. Started a local Flink cluster, ran the WordCount example, WebUI
> looks
> > > good,  no suspicious output/log
> > > 3. started cluster and run some e2e sql queries using SQL Client, query
> > > result is expected.
> > > 4. Repeat Step 2 and 3 with flink-1.12.5-bin-scala_2.11.tgz
> > > <
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/flink-1.12.5-bin-scala_2.11.tgz
> > > >
> > >
> > > Best,
> > > JING ZHANG
> > >
> > > Yang Wang  于2021年7月7日周三 上午11:36写道:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - verified checksums & signatures
> > > > - start a session cluster and verify HA data cleanup once job reached
> > to
> > > > globally terminal state, FLINK-20695
> > > > - start a local standalone cluster, check the webUI good and JM/TM
> > > > without suspicious logs
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Leonard Xu  于2021年7月7日周三 上午10:52写道:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - verified signatures and hashsums
> > > > > - built from source code with scala 2.11 succeeded
> > > > > - checked all denpendency artifacts are 1.12.5
> > > > > - started a cluster, ran a wordcount job, the result is expected
> > > > > - started SQL Client, ran a simple query, the result is expected
> > > > > - reviewed the web PR, left one minor name comment
> > > > >
> > > > > Best,
> > > > > Leonard
> > > > >
> > > > > > 在 2021年7月6日,10:02,Xintong Song  写道:
> > > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > - verified checksums & signatures
> > > > > > - built from sources
> > > > > > - run example jobs with standalone and native k8s deployments
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, Jul 5, 2021 at 11:18 AM Jingsong Li <
> > jingsongl...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > >> Hi everyone,
> > > > > >>
> > > > > >> Please review and vote on the release candidate #1 for the
> version
> > > > > 1.12.5,
> > > > > >> as follows:
> > > > > >> [ ] +1, Approve the release
> > > > > >> [ ] -1, Do not approve the release (please provide specific
> > > comments)
> > > > > >>
> > > > > >> The complete staging area is available for your review, which
> > > > includes:
> > > > > >> * JIRA release notes [1],
> > > > > >> * the official Apache source release and binary convenience
> > releases
> > > > to
> > > > > be
> > > > > >> deployed to dist.apache.org [2], which are signed with the key
> > with
> > > > > >> fingerprint FBB83C0A4FFB9CA8 [3],
> > > > > >> * all artifacts to be deployed to the Maven Central Repository
> > [4],
> > > > > >> * source code tag "release-1.12.5-rc1" [5],
> > > > > >> * website pull request listing the new release and adding
> > > announcement
> > > > > blog
> > > > > >> post [6].
> > > > > >>
> > > > > >> The vote will be open for at least 72 hours. It is adopted by
> > > majority
> > > > > >> approval, with at least 3 PMC affirmative votes.
> > > > > >>
> > > > > >> Best,
> > > > > >> Jingsong Lee
> > > > > >>
> > > > > >> [1]
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12350166
> > > > > >> [2]
> > https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/
> > > > > >> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > > > >> [4]
> > > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1430
> > > > > >> [5]
> > https://github.com/apache/flink/releases/tag/release-1.12.5-rc1
> > > > > >> [6] https://github.com/apache/flink-web/pull/455
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best, Jingsong Lee
>


Re: [VOTE] FLIP-181: Custom netty HTTP request inbound/outbound handlers

2021-07-12 Thread Till Rohrmann
Thanks for starting the vote Marton.

I have two comments:

* I would suggest that the interfaces return Optional or at
least have a @Nullable annotation in order to make the contract explicit.
* The test plan should contain tests for the general infrastructure which
should live in Flink. We should test that factories are loaded and that the
handlers are set up in the correct order.

I would consider these two changes to the original FLIP small. I give my +1
(binding) conditionally under the assumption that the comments will be
addressed.

Cheers,
Till

On Mon, Jul 12, 2021 at 10:15 AM Konstantin Knauf  wrote:

> +1 (binding)
>
> Assuming that we continue to vote in this thread for now.
>
> Thank you for your patience!
>
> On Mon, Jul 12, 2021 at 8:56 AM Chesnay Schepler 
> wrote:
>
> > The vote has not reached the required number of votes to be considered
> > successful.
> >
> > As outlined in the bylaws
> > <
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=120731026#FlinkBylaws-Actions
> >
> >
> > FLIP votes require 3 binding +1 votes (i.e., from committers).
> >
> > On 10/07/2021 16:13, Márton Balassi wrote:
> > > Hi team,
> > >
> > > Thank you for your input, I am closing this vote as successful.
> > > Austin: thank you, I have added the experimental annotation explicitly
> to
> > > the FLIP.
> > >
> > > On Tue, Jul 6, 2021 at 5:17 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com>
> > > wrote:
> > >
> > >> +1 (non-binding)
> > >> The @Experimental annotation is really missing, Marton could you add
> it
> > >> please?
> > >>
> > >>
> > >> On Tue, Jul 6, 2021 at 5:04 PM Austin Cawley-Edwards <
> > >> austin.caw...@gmail.com> wrote:
> > >>
> > >>> Hi Márton,
> > >>>
> > >>> The FLIP looks generally good to me, though could we add the
> > >>> `@Experimental` annotation to the proposed interfaces so it is in
> sync
> > >> with
> > >>> what was agreed in the discussion thread?
> > >>>
> > >>> Thanks,
> > >>> Austin
> > >>>
> > >>> On Tue, Jul 6, 2021 at 9:40 AM Gyula Fóra  wrote:
> > >>>
> >  +1 from my side
> > 
> >  This is a good addition that will open many possibilities in the
> > future
> > >>> and
> >  solve some immediate issues with the current Kerberos integration.
> > 
> >  Gyula
> > 
> >  On Tue, Jul 6, 2021 at 2:50 PM Márton Balassi <
> > >> balassi.mar...@gmail.com>
> >  wrote:
> > 
> > > Hi everyone, I would like to start a vote on FLIP-181 [1] which was
> > > discussed in this thread [2]. The vote will be open for at least 72
> > >>> hours
> > > until July 9th unless there is an objection or not enough votes.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/CAUBCw
> > > [2]
> > >
> > >
> > >>
> >
> https://lists.apache.org/thread.html/r53b6b8931b6248a849855dad27b1a431e55cdd48ca055910e8f015a8%40%3Cdev.flink.apache.org%3E
> >
> >
> >
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>


Re: Job Recovery Time on TM Lost

2021-07-09 Thread Till Rohrmann
text.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
>> at
>> org.apache.flink.shaded.netty4.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:331)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
>> at
>> org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>> at java.lang.Thread.run(Thread.java:748)
>> ```
>> 1. It's a bit inconvenient to debug such an exception because it doesn't
>> report the exact container id. Right now we have to look for `
>> xenon-prod-001-20200316-data-slave-prod-0a020f7a.ec2.pin220.com/10.2.15.122:33539`
>> <http://xenon-prod-001-20200316-data-slave-prod-0a020f7a.ec2.pin220.com/10.2.15.122:33539>
>> in JobMananger log to find that.
>> 2. The task manager log doesn't show anything suspicious. Also, no major
>> GC. So it might imply a flack connection in this case.
>> 3. Is there any short term workaround we can try? any config tuning?
>> Also, what's the long term solution?
>>
>> Best
>> Lu
>>
>>
>>
>>
>> On Tue, Jul 6, 2021 at 11:45 PM 刘建刚  wrote:
>>
>>> It is really helpful to find the lost container quickly. In our inner
>>> flink version, we optimize it by task's report and jobmaster's probe. When
>>> a task fails because of the connection, it reports to the jobmaster. The
>>> jobmaster will try to confirm the liveness of the unconnected
>>> taskmanager for certain times by config. If the jobmaster find the
>>> taskmanager unconnected or dead, it releases the taskmanger. This will work
>>> for most cases. For an unstable environment, config needs adjustment.
>>>
>>> Gen Luo  于2021年7月6日周二 下午8:41写道:
>>>
>>>> Yes, I have noticed the PR and commented there with some consideration
>>>> about the new option. We can discuss further there.
>>>>
>>>> On Tue, Jul 6, 2021 at 6:04 PM Till Rohrmann 
>>>> wrote:
>>>>
>>>> > This is actually a very good point Gen. There might not be a lot to
>>>> gain
>>>> > for us by implementing a fancy algorithm for figuring out whether a
>>>> TM is
>>>> > dead or not based on failed heartbeat RPCs from the JM if the TM <> TM
>>>> > communication does not tolerate failures and directly fails the
>>>> affected
>>>> > tasks. This assumes that the JM and TM run in the same environment.
>>>> >
>>>> > One simple approach could be to make the number of failed heartbeat
>>>> RPCs
>>>> > until a target is marked as unreachable configurable because what
>>>> > represents a good enough criterion in one user's environment might
>>>> produce
>>>> > too many false-positives in somebody else's environment. Or even
>>>> simpler,
>>>> > one could say that one can disable reacting to a failed heartbeat RPC
>>>> as it
>>>> > is currently the case.
>>>> >
>>>> > We currently have a discussion about this on this PR [1]. Maybe you
>>>> wanna
>>>> > join the discussion there and share your insights.
>>>> >
>>>> > [1] https://github.com/apache/flink/pull/16357
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Tu

Re: [ANNOUNCE] New Apache Flink Committer - Yuan Mei

2021-07-07 Thread Till Rohrmann
Congratulations Yuan.

Cheers,
Till

On Wed, Jul 7, 2021 at 7:22 PM Yu Li  wrote:

> Hi all,
>
> On behalf of the PMC, I’m very happy to announce Yuan Mei as a new Flink
> committer.
>
> Yuan has been an active contributor for more than two years, with code
> contributions on multiple components including kafka connectors,
> checkpointing, state backends, etc. Besides, she has been actively involved
> in community activities such as helping manage releases, discussing
> questions on dev@list, supporting users and giving talks at conferences.
>
> Please join me in congratulating Yuan for becoming a Flink committer!
>
> Cheers,
> Yu
>


Re: [DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-07-07 Thread Till Rohrmann
One quick comment: When developing the ShuffleService abstraction we also
thought that different jobs might want to use different ShuffleServices
depending on their workload (e.g. batch vs. streaming workload). So
ideally, the chosen solution here can also support this use case eventually.

Cheers,
Till

On Wed, Jul 7, 2021 at 12:50 PM Guowei Ma  wrote:

> Hi,
> Thank Yingjie for initiating this discussion. What I understand that the
> document[1] actually mainly discusses two issues:
> 1. ShuffleMaster should be at the cluster level instead of the job level
> 2. ShuffleMaster should notify PartitionTracker that some data has been
> lost
>
> Relatively speaking, I think the second problem is more serious. Because
> for external or remote batch shuffling services, after the machine storing
> shuffled data goes offline, PartitionTracker needs to be notified in time
> to avoid repeated failures of the job. Therefore, it is hoped that when
> shuffle data goes offline due to a machine error, ShuffleMaster can notify
> the PartitionTracker in time. This requires ShuffleMaster to notify the
> PartitionTracker with a handle such as JobShuffleContext.
>
> So how to pass JobShuffleContext to ShuffleMaster? There are two options:
> 1. After ShuffleMaster is created, pass JobShuffleContext to ShuffleMaster,
> such as ShuffleMaster::register(JobShuffleContext)
> 2. Pass ShuffleServiceFactory when creating ShuffleMaster, such as
> ShuffleServiceFatroctry.createcreateShuffleMaster(JobShuffleContext).
>
> Which one to choose is actually related to issue 1. Because if
> ShuffleMaster is a cluster level, you should choose option 1, otherwise,
> choose option 2. I think ShuffleMaster should be at the cluster level, for
> example, because we don't need to maintain a ShuffleMaster for each job in
> a SessionCluster; in addition, this ShuffleMaster should also be used by
> RM's PartitionTracker in the future. Therefore, I think Option 1 is more
> appropriate.
>
> To sum up, we may give priority to solving problem 2, while taking into
> account that ShuffleMaster should be a cluster-level component. Therefore,
> I think we could ignore the date ShuffleMasterContext at the beginning; at
> the same time, JobShuffleContext::getConfiguration/listPartitions should
> not be needed.
>
> [1]
>
> https://docs.google.com/document/d/1_cHoapNbx_fJ7ZNraSqw4ZK1hMRiWWJDITuSZrdMDDs/edit
>
> Best,
> Guowei
>
>
> On Fri, Jun 11, 2021 at 4:15 PM Yingjie Cao 
> wrote:
>
> > Hi devs,
> >
> > I'd like to start a discussion about "Lifecycle of ShuffleMaster and its
> > Relationship with JobMaster and PartitionTracker". (These are things we
> > found when moving our external shuffle to the pluggable shuffle service
> > framework.)
> >
> > The mail client may fail to display the right format. If so, please refer
> > to this document:
> >
> >
> https://docs.google.com/document/d/1_cHoapNbx_fJ7ZNraSqw4ZK1hMRiWWJDITuSZrdMDDs/edit?usp=sharing
> > .
> > Lifecycle of ShuffleMaster
> >
> > Currently, the lifecycle of ShuffleMaster seems unclear.  The
> > ShuffleServiceFactory is loaded for each JobMaster instance and then
> > ShuffleServiceFactory#createShuffleMaster will be called to create a
> > ShuffleMaser instance. However, the default NettyShuffleServiceFactory
> > always returns the same ShuffleMaser singleton instance for all jobs.
> Based
> > on the current implementation, the lifecycle of ShuffleMaster seems open
> > and depends on the shuffle plugin themselves. However, at the TM side,
> > the ShuffleEnvironment
> > is a part of the TaskManagerServices whose lifecycle is decoupled with
> jobs
> > which is more like a service. It means there is also an inconsistency
> > between the TM side and the JM side.
> >
> > From my understanding, the reason for this is that the pluggable shuffle
> > framework is still not completely finished yet, for example, there is a
> > follow up umbrella ticket  FLINK-19551
> >  for the pluggable
> > shuffle service framework and in its subtasks, there is one task (
> > FLINK-12731 ) which
> > aims
> > to load shuffle plugin with the PluginManager. I think this can solve the
> > issue mentioned above. After the corresponding factory  loaded by the
> > PluginManager, all ShuffleMaster instances can be stored in a map indexed
> > by the corresponding factory class name  which can be shared by all jobs.
> > After that, the ShuffleMaster becomes a cluster level service which is
> > consistent with the ShuffleEnvironment at the TM side.
> >
> > As a summary, we propose to finish FLINK-12731
> >  and make the shuffle
> > service a real cluster level service first. Furthermore, we add two
> > lifecycle methods to the ShuffleMaster interface, including start and
> > close responding
> > for initialization (for example, contacting the external system) and
> > graceful shutdown (fo

Re: [ANNOUNCE] New PMC member: Guowei Ma

2021-07-07 Thread Till Rohrmann
Congratulations, Guowei!

Cheers,
Till

On Wed, Jul 7, 2021 at 9:41 AM Roman Khachatryan  wrote:

> Congratulations!
>
> Regards,
> Roman
>
> On Wed, Jul 7, 2021 at 8:24 AM Rui Li  wrote:
> >
> > Congratulations Guowei!
> >
> > On Wed, Jul 7, 2021 at 1:01 PM Benchao Li  wrote:
> >
> > > Congratulations!
> > >
> > > Dian Fu  于2021年7月7日周三 下午12:46写道:
> > >
> > > > Congratulations, Guowei!
> > > >
> > > > Regards,
> > > > Dian
> > > >
> > > > > 2021年7月7日 上午10:37,Yun Gao  写道:
> > > > >
> > > > > Congratulations Guowei!
> > > > >
> > > > >
> > > > > Best,
> > > > > Yun
> > > > >
> > > > >
> > > > > --
> > > > > Sender:JING ZHANG
> > > > > Date:2021/07/07 10:33:51
> > > > > Recipient:dev
> > > > > Theme:Re: [ANNOUNCE] New PMC member: Guowei Ma
> > > > >
> > > > > Congratulations,  Guowei Ma!
> > > > >
> > > > > Best regards,
> > > > > JING ZHANG
> > > > >
> > > > > Zakelly Lan  于2021年7月7日周三 上午10:30写道:
> > > > >
> > > > >> Congratulations, Guowei!
> > > > >>
> > > > >> Best,
> > > > >> Zakelly
> > > > >>
> > > > >> On Wed, Jul 7, 2021 at 10:24 AM tison 
> wrote:
> > > > >>
> > > > >>> Congrats! NB.
> > > > >>>
> > > > >>> Best,
> > > > >>> tison.
> > > > >>>
> > > > >>>
> > > > >>> Jark Wu  于2021年7月7日周三 上午10:20写道:
> > > > >>>
> > > >  Congratulations Guowei!
> > > > 
> > > >  Best,
> > > >  Jark
> > > > 
> > > >  On Wed, 7 Jul 2021 at 09:54, XING JIN 
> > > > wrote:
> > > > 
> > > > > Congratulations, Guowei~ !
> > > > >
> > > > > Best,
> > > > > Jin
> > > > >
> > > > > Xintong Song  于2021年7月7日周三 上午9:37写道:
> > > > >
> > > > >> Congratulations, Guowei~!
> > > > >>
> > > > >> Thank you~
> > > > >>
> > > > >> Xintong Song
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Wed, Jul 7, 2021 at 9:31 AM Qingsheng Ren <
> renqs...@gmail.com>
> > > >  wrote:
> > > > >>
> > > > >>> Congratulations Guowei!
> > > > >>>
> > > > >>> --
> > > > >>> Best Regards,
> > > > >>>
> > > > >>> Qingsheng Ren
> > > > >>> Email: renqs...@gmail.com
> > > > >>> 2021年7月7日 +0800 09:30 Leonard Xu ,写道:
> > > >  Congratulations! Guowei Ma
> > > > 
> > > >  Best,
> > > >  Leonard
> > > > 
> > > > > ÔÚ 2021Äê7ÔÂ6ÈÕ£¬21:56£¬Kurt Young 
> дµÀ£º
> > > > >
> > > > > Hi all!
> > > > >
> > > > > I'm very happy to announce that Guowei Ma has joined the
> > > > >> Flink
> > > >  PMC!
> > > > >
> > > > > Congratulations and welcome Guowei!
> > > > >
> > > > > Best,
> > > > > Kurt
> > > > 
> > > > >>>
> > > > >>
> > > > >
> > > > 
> > > > >>>
> > > > >>
> > > > >
> > > >
> > > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> > >
> >
> >
> > --
> > Best regards!
> > Rui Li
>


Re: [ANNOUNCE] New Apache Flink Committer - Yang Wang

2021-07-07 Thread Till Rohrmann
Congratulations, Yang!

Cheers,
Till

On Wed, Jul 7, 2021 at 9:41 AM Roman Khachatryan  wrote:

> Congrats!
>
> Regards,
> Roman
>
>
> On Wed, Jul 7, 2021 at 8:28 AM Qingsheng Ren  wrote:
> >
> > Congratulations Yang!
> >
> > --
> > Best Regards,
> >
> > Qingsheng Ren
> > Email: renqs...@gmail.com
> > On Jul 7, 2021, 2:26 PM +0800, Rui Li , wrote:
> > > Congratulations Yang ~
> > >
> > > On Wed, Jul 7, 2021 at 1:01 PM Benchao Li 
> wrote:
> > >
> > > > Congratulations!
> > > >
> > > > Peter Huang  于2021年7月7日周三 下午12:54写道:
> > > >
> > > > > Congratulations, Yang.
> > > > >
> > > > > Best Regards
> > > > > Peter Huang
> > > > >
> > > > > On Tue, Jul 6, 2021 at 9:48 PM Dian Fu 
> wrote:
> > > > >
> > > > > > Congratulations, Yang,
> > > > > >
> > > > > > Regards,
> > > > > > Dian
> > > > > >
> > > > > > > 2021年7月7日 上午10:46,Jary Zhen  写道:
> > > > > > >
> > > > > > > Congratulations, Yang Wang.
> > > > > > >
> > > > > > > Best
> > > > > > > Jary
> > > > > > >
> > > > > > > Yun Gao  于2021年7月7日周三 上午10:38写道:
> > > > > > >
> > > > > > > > Congratulations Yang!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Yun
> > > > > > > >
> > > > > > > >
> > > > > > > >
> --
> > > > > > > > Sender:Jark Wu
> > > > > > > > Date:2021/07/07 10:20:27
> > > > > > > > Recipient:dev
> > > > > > > > Cc:Yang Wang; <
> wangyang0...@apache.org>
> > > > > > > > Theme:Re: [ANNOUNCE] New Apache Flink Committer - Yang Wang
> > > > > > > >
> > > > > > > > Congratulations Yang Wang!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jark
> > > > > > > >
> > > > > > > > On Wed, 7 Jul 2021 at 10:09, Xintong Song <
> tonysong...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > On behalf of the PMC, I'm very happy to announce Yang Wang
> as a new
> > > > > > Flink
> > > > > > > > > committer.
> > > > > > > > >
> > > > > > > > > Yang has been a very active contributor for more than two
> years,
> > > > > mainly
> > > > > > > > > focusing on Flink's deployment components. He's a main
> contributor
> > > > > and
> > > > > > > > > maintainer of Flink's native Kubernetes deployment and
> native
> > > > > > Kubernetes
> > > > > > > > > HA. He's also very active on the mailing lists,
> participating in
> > > > > > > > > discussions and helping with user questions.
> > > > > > > > >
> > > > > > > > > Please join me in congratulating Yang Wang for becoming a
> Flink
> > > > > > > > committer!
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Best,
> > > > Benchao Li
> > > >
> > >
> > >
> > > --
> > > Best regards!
> > > Rui Li
>


Re: [VOTE] Release 1.12.5, release candidate #1

2021-07-07 Thread Till Rohrmann
Hi folks, are we sure that FLINK-23233 [1] does not affect the 1.12 release
branch. I think this problem was introduced with FLINK-21996 [2] which is
part of release-1.12. Hence, we might either fix this problem and cancel
this RC or we need to create a fast 1.12.6 afterwards.

[1] https://issues.apache.org/jira/browse/FLINK-23233
[2] https://issues.apache.org/jira/browse/FLINK-21996

Cheers,
Till

On Wed, Jul 7, 2021 at 7:43 AM JING ZHANG  wrote:

> +1 (non-binding)
>
> 1. built from source code flink-1.12.5-src.tgz
> <
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/flink-1.12.5-src.tgz
> >
> succeeded
> 2. Started a local Flink cluster, ran the WordCount example, WebUI looks
> good,  no suspicious output/log
> 3. started cluster and run some e2e sql queries using SQL Client, query
> result is expected.
> 4. Repeat Step 2 and 3 with flink-1.12.5-bin-scala_2.11.tgz
> <
> https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/flink-1.12.5-bin-scala_2.11.tgz
> >
>
> Best,
> JING ZHANG
>
> Yang Wang  于2021年7月7日周三 上午11:36写道:
>
> > +1 (non-binding)
> >
> > - verified checksums & signatures
> > - start a session cluster and verify HA data cleanup once job reached to
> > globally terminal state, FLINK-20695
> > - start a local standalone cluster, check the webUI good and JM/TM
> > without suspicious logs
> >
> >
> > Best,
> > Yang
> >
> > Leonard Xu  于2021年7月7日周三 上午10:52写道:
> >
> > > +1 (non-binding)
> > >
> > > - verified signatures and hashsums
> > > - built from source code with scala 2.11 succeeded
> > > - checked all denpendency artifacts are 1.12.5
> > > - started a cluster, ran a wordcount job, the result is expected
> > > - started SQL Client, ran a simple query, the result is expected
> > > - reviewed the web PR, left one minor name comment
> > >
> > > Best,
> > > Leonard
> > >
> > > > 在 2021年7月6日,10:02,Xintong Song  写道:
> > > >
> > > > +1 (binding)
> > > >
> > > > - verified checksums & signatures
> > > > - built from sources
> > > > - run example jobs with standalone and native k8s deployments
> > > >
> > > > Thank you~
> > > >
> > > > Xintong Song
> > > >
> > > >
> > > >
> > > > On Mon, Jul 5, 2021 at 11:18 AM Jingsong Li 
> > > wrote:
> > > >
> > > >> Hi everyone,
> > > >>
> > > >> Please review and vote on the release candidate #1 for the version
> > > 1.12.5,
> > > >> as follows:
> > > >> [ ] +1, Approve the release
> > > >> [ ] -1, Do not approve the release (please provide specific
> comments)
> > > >>
> > > >> The complete staging area is available for your review, which
> > includes:
> > > >> * JIRA release notes [1],
> > > >> * the official Apache source release and binary convenience releases
> > to
> > > be
> > > >> deployed to dist.apache.org [2], which are signed with the key with
> > > >> fingerprint FBB83C0A4FFB9CA8 [3],
> > > >> * all artifacts to be deployed to the Maven Central Repository [4],
> > > >> * source code tag "release-1.12.5-rc1" [5],
> > > >> * website pull request listing the new release and adding
> announcement
> > > blog
> > > >> post [6].
> > > >>
> > > >> The vote will be open for at least 72 hours. It is adopted by
> majority
> > > >> approval, with at least 3 PMC affirmative votes.
> > > >>
> > > >> Best,
> > > >> Jingsong Lee
> > > >>
> > > >> [1]
> > > >>
> > > >>
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12350166
> > > >> [2] https://dist.apache.org/repos/dist/dev/flink/flink-1.12.5-rc1/
> > > >> [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >> [4]
> > > https://repository.apache.org/content/repositories/orgapacheflink-1430
> > > >> [5] https://github.com/apache/flink/releases/tag/release-1.12.5-rc1
> > > >> [6] https://github.com/apache/flink-web/pull/455
> > > >>
> > >
> > >
> >
>


Re: [VOTE] Release 1.13.2, release candidate #1

2021-07-07 Thread Till Rohrmann
I think FLINK-23233 sounds like a very serious problem to me. We either
continue releasing 1.13.2 and do an immediate 1.13.3 afterwards or we
cancel this RC and fix the problem and then resume the 1.13.2 release.

Cheers,
Till

On Wed, Jul 7, 2021 at 5:49 AM JING ZHANG  wrote:

> Thanks Xintong for bringing the alarm, +1 on Xintong's proposal to hold the
> vote until deciding what to do with FLINK-23233.
>
> BTW, I made a typo in the previous voting. I mean +1 (non-binding) instead
> of +1 (binding). I'm very sorry for the confusion.
>
> +1 (non-binding).
>
> 1. Reviewed website pull request
> 2. Built from source code flink-1.13.2-src.tgz
> <
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-src.tgz
> >
>  succeeded
> 3. Started a local Flink cluster, ran the WordCount example, WebUI looks
> good,  no suspicious output/log
> 4. Started cluster and run some e2e sql queries using SQL Client, query
> result is expected.
> 5. Repeat Step 3 and 4 with flink-1.13.2-bin-scala_2.11.tgz
> <
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-bin-scala_2.11.tgz
> >
>
> Best regards,
> JING ZHANG
>
> Xintong Song  于2021年7月7日周三 上午11:09写道:
>
> > Hi everyone,
> >
> > We find a bug that may cause data loss in a rare condition, which IMHO
> is a
> > release blocker. Please see FLINK-23233 [1] for details.
> >
> > Given that this bug is not newly introduced but already exists in 1.13.0
> &
> > 1.13.1 releases, I would not replace my +1 with a -1 immediately.
> Instead,
> > I would like to ask the release manager for holding this vote until
> > deciding what to do with FLINK-23233.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-23233
> >
> > On Wed, Jul 7, 2021 at 12:09 AM JING ZHANG  wrote:
> >
> > > +1 (binding)
> > >
> > > 1. Reviewed website pull request
> > > 2. Built from source code flink-1.13.2-src.tgz
> > > <
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-src.tgz
> > > >
> > > succeeded
> > > 3. Started a local Flink cluster, ran the WordCount example, WebUI
> looks
> > > good,  no suspicious output/log
> > > 4. Started cluster and run some e2e sql queries using SQL Client, query
> > > result is expected.
> > > 5. Repeat Step 3 and 4 with flink-1.13.2-bin-scala_2.11.tgz
> > > <
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-1.13.2-rc1/flink-1.13.2-bin-scala_2.11.tgz
> > > >
> > >
> > > Best regards,
> > > JING ZHANG
> > >
> > >
> > > Zakelly Lan  于2021年7月5日周一 下午6:06写道:
> > >
> > > > +1 (non-binding)
> > > >
> > > > - built from sources
> > > > - run streaming job of wordcount
> > > > - web-ui looks good
> > > > - checkpoint and restore looks good
> > > >
> > > > Best,
> > > > Zakelly
> > > >
> > > > On Mon, Jul 5, 2021 at 2:40 PM Jingsong Li 
> > > wrote:
> > > >
> > > > > +1 (non-binding)
> > > > >
> > > > > - Verified checksums and signatures
> > > > > - Built from sources
> > > > > - run table example jobs
> > > > > - web-ui looks good
> > > > > - sql-client looks good
> > > > >
> > > > > I think we should update the unresolved JIRAs in [1] to 1.13.3.
> > > > >
> > > > > And we should check resolved JIRAs in [2], commits of some are not
> in
> > > the
> > > > > 1.13.2. We should exclude them. For example FLINK-23196 FLINK-23166
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.13.2%20AND%20status%20not%20in%20(Closed%2C%20Resolved)%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC
> > > > > [2]
> > > > >
> > > > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12350218&styleName=&projectId=12315522
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Mon, Jul 5, 2021 at 2:27 PM Xingbo Huang 
> > > wrote:
> > > > >
> > > > > > +1 (non-binding)
> > > > > >
> > > > > > - Verified checksums and signatures
> > > > > > - Built from sources
> > > > > > - Verified Python wheel package contents
> > > > > > - Pip install Python wheel package in Mac
> > > > > > - Run Python UDF job in Python shell
> > > > > >
> > > > > > Best,
> > > > > > Xingbo
> > > > > >
> > > > > > Yangze Guo  于2021年7月5日周一 上午11:17写道:
> > > > > >
> > > > > > > +1 (non-binding)
> > > > > > >
> > > > > > > - built from sources
> > > > > > > - run example jobs with standalone and yarn.
> > > > > > > - check TaskManager's rest API from the JM master and its
> > standby,
> > > > > > > everything looks good
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > > > On Mon, Jul 5, 2021 at 10:10 AM Xintong Song <
> > > tonysong...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > - verified checksums & signatures
> > > > > > > > - built from sources
> > > > > > > > - run example jobs with standalone and native k8s (with
> custom
> > > > image)
> > > > > > > > 

[jira] [Created] (FLINK-23297) Upgrade Protobuf to 3.17.3

2021-07-07 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23297:
-

 Summary: Upgrade Protobuf to 3.17.3
 Key: FLINK-23297
 URL: https://issues.apache.org/jira/browse/FLINK-23297
 Project: Flink
  Issue Type: Sub-task
  Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.12.4, 1.13.1, 1.14.0
Reporter: Till Rohrmann


In order to support compilation with ARM (e.g. Apple M1 chip), we need to bump 
our Protobuf dependency to version 3.17.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Job Recovery Time on TM Lost

2021-07-06 Thread Till Rohrmann
This is actually a very good point Gen. There might not be a lot to gain
for us by implementing a fancy algorithm for figuring out whether a TM is
dead or not based on failed heartbeat RPCs from the JM if the TM <> TM
communication does not tolerate failures and directly fails the affected
tasks. This assumes that the JM and TM run in the same environment.

One simple approach could be to make the number of failed heartbeat RPCs
until a target is marked as unreachable configurable because what
represents a good enough criterion in one user's environment might produce
too many false-positives in somebody else's environment. Or even simpler,
one could say that one can disable reacting to a failed heartbeat RPC as it
is currently the case.

We currently have a discussion about this on this PR [1]. Maybe you wanna
join the discussion there and share your insights.

[1] https://github.com/apache/flink/pull/16357

Cheers,
Till

On Tue, Jul 6, 2021 at 4:37 AM Gen Luo  wrote:

> I know that there are retry strategies for akka rpc frameworks. I was just
> considering that, since the environment is shared by JM and TMs, and the
> connections among TMs (using netty) are flaky in unstable environments,
> which will also cause the job failure, is it necessary to build a
> strongly guaranteed connection between JM and TMs, or it could be as flaky
> as the connections among TMs?
>
> As far as I know, connections among TMs will just fail on their first
> connection loss, so behaving like this in JM just means "as flaky as
> connections among TMs". In a stable environment it's good enough, but in an
> unstable environment, it indeed increases the instability. IMO, though a
> single connection loss is not reliable, a double check should be good
> enough. But since I'm not experienced with an unstable environment, I can't
> tell whether that's also enough for it.
>
> On Mon, Jul 5, 2021 at 5:59 PM Till Rohrmann  wrote:
>
>> I think for RPC communication there are retry strategies used by the
>> underlying Akka ActorSystem. So a RpcEndpoint can reconnect to a remote
>> ActorSystem and resume communication. Moreover, there are also
>> reconciliation protocols in place which reconcile the states between the
>> components because of potentially lost RPC messages. So the main question
>> would be whether a single connection loss is good enough for triggering the
>> timeout or whether we want a more elaborate mechanism to reason about the
>> availability of the remote system (e.g. a couple of lost heartbeat
>> messages).
>>
>> Cheers,
>> Till
>>
>> On Mon, Jul 5, 2021 at 10:00 AM Gen Luo  wrote:
>>
>>> As far as I know, a TM will report connection failure once its connected
>>> TM is lost. I suppose JM can believe the report and fail the tasks in the
>>> lost TM if it also encounters a connection failure.
>>>
>>> Of course, it won't work if the lost TM is standalone. But I suppose we
>>> can use the same strategy as the connected scenario. That is, consider it
>>> possibly lost on the first connection loss, and fail it if double check
>>> also fails. The major difference is the senders of the probes are the same
>>> one rather than two different roles, so the results may tend to be the same.
>>>
>>> On the other hand, the fact also means that the jobs can be fragile in
>>> an unstable environment, no matter whether the failover is triggered by TM
>>> or JM. So maybe it's not that worthy to introduce extra configurations for
>>> fault tolerance of heartbeat, unless we also introduce some retry
>>> strategies for netty connections.
>>>
>>>
>>> On Fri, Jul 2, 2021 at 9:34 PM Till Rohrmann 
>>> wrote:
>>>
>>>> Could you share the full logs with us for the second experiment, Lu? I
>>>> cannot tell from the top of my head why it should take 30s unless you have
>>>> configured a restart delay of 30s.
>>>>
>>>> Let's discuss FLINK-23216 on the JIRA ticket, Gen.
>>>>
>>>> I've now implemented FLINK-23209 [1] but it somehow has the problem
>>>> that in a flakey environment you might not want to mark a TaskExecutor dead
>>>> on the first connection loss. Maybe this is something we need to make
>>>> configurable (e.g. introducing a threshold which admittedly is similar to
>>>> the heartbeat timeout) so that the user can configure it for her
>>>> environment. On the upside, if you mark the TaskExecutor dead on the first
>>>> connection loss (assuming you have a stable network environment), then it
>>&g

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Till Rohrmann
I think for RPC communication there are retry strategies used by the
underlying Akka ActorSystem. So a RpcEndpoint can reconnect to a remote
ActorSystem and resume communication. Moreover, there are also
reconciliation protocols in place which reconcile the states between the
components because of potentially lost RPC messages. So the main question
would be whether a single connection loss is good enough for triggering the
timeout or whether we want a more elaborate mechanism to reason about the
availability of the remote system (e.g. a couple of lost heartbeat
messages).

Cheers,
Till

On Mon, Jul 5, 2021 at 10:00 AM Gen Luo  wrote:

> As far as I know, a TM will report connection failure once its connected
> TM is lost. I suppose JM can believe the report and fail the tasks in the
> lost TM if it also encounters a connection failure.
>
> Of course, it won't work if the lost TM is standalone. But I suppose we
> can use the same strategy as the connected scenario. That is, consider it
> possibly lost on the first connection loss, and fail it if double check
> also fails. The major difference is the senders of the probes are the same
> one rather than two different roles, so the results may tend to be the same.
>
> On the other hand, the fact also means that the jobs can be fragile in an
> unstable environment, no matter whether the failover is triggered by TM or
> JM. So maybe it's not that worthy to introduce extra configurations for
> fault tolerance of heartbeat, unless we also introduce some retry
> strategies for netty connections.
>
>
> On Fri, Jul 2, 2021 at 9:34 PM Till Rohrmann  wrote:
>
>> Could you share the full logs with us for the second experiment, Lu? I
>> cannot tell from the top of my head why it should take 30s unless you have
>> configured a restart delay of 30s.
>>
>> Let's discuss FLINK-23216 on the JIRA ticket, Gen.
>>
>> I've now implemented FLINK-23209 [1] but it somehow has the problem that
>> in a flakey environment you might not want to mark a TaskExecutor dead on
>> the first connection loss. Maybe this is something we need to make
>> configurable (e.g. introducing a threshold which admittedly is similar to
>> the heartbeat timeout) so that the user can configure it for her
>> environment. On the upside, if you mark the TaskExecutor dead on the first
>> connection loss (assuming you have a stable network environment), then it
>> can now detect lost TaskExecutors as fast as the heartbeat interval.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-23209
>>
>> Cheers,
>> Till
>>
>> On Fri, Jul 2, 2021 at 9:33 AM Gen Luo  wrote:
>>
>>> Thanks for sharing, Till and Yang.
>>>
>>> @Lu
>>> Sorry but I don't know how to explain the new test with the log. Let's
>>> wait for others' reply.
>>>
>>> @Till
>>> It would be nice if JIRAs could be fixed. Thanks again for proposing
>>> them.
>>>
>>> In addition, I was tracking an issue that RM keeps allocating and
>>> freeing slots after a TM lost until its heartbeat timeout, when I found the
>>> recovery costing as long as heartbeat timeout. That should be a minor bug
>>> introduced by declarative resource management. I have created a JIRA about
>>> the problem [1] and  we can discuss it there if necessary.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-23216
>>>
>>> Lu Niu  于2021年7月2日周五 上午3:13写道:
>>>
>>>> Another side question, Shall we add metric to cover the complete
>>>> restarting time (phase 1 + phase 2)? Current metric jm.restartingTime only
>>>> covers phase 1. Thanks!
>>>>
>>>> Best
>>>> Lu
>>>>
>>>> On Thu, Jul 1, 2021 at 12:09 PM Lu Niu  wrote:
>>>>
>>>>> Thanks TIll and Yang for help! Also Thanks Till for a quick fix!
>>>>>
>>>>> I did another test yesterday. In this test, I intentionally throw
>>>>> exception from the source operator:
>>>>> ```
>>>>> if (runtimeContext.getIndexOfThisSubtask() == 1
>>>>> && errorFrenquecyInMin > 0
>>>>> && System.currentTimeMillis() - lastStartTime >=
>>>>> errorFrenquecyInMin * 60 * 1000) {
>>>>>   lastStartTime = System.currentTimeMillis();
>>>>>   throw new RuntimeException(
>>>>>   "Trigger expected exception at: " + lastStartTime);
>>>>> }
>>>>> ```
>>>>> In this case, I found phase 1 st

Re: Job Recovery Time on TM Lost

2021-07-02 Thread Till Rohrmann
Could you share the full logs with us for the second experiment, Lu? I
cannot tell from the top of my head why it should take 30s unless you have
configured a restart delay of 30s.

Let's discuss FLINK-23216 on the JIRA ticket, Gen.

I've now implemented FLINK-23209 [1] but it somehow has the problem that in
a flakey environment you might not want to mark a TaskExecutor dead on the
first connection loss. Maybe this is something we need to make configurable
(e.g. introducing a threshold which admittedly is similar to the heartbeat
timeout) so that the user can configure it for her environment. On the
upside, if you mark the TaskExecutor dead on the first connection loss
(assuming you have a stable network environment), then it can now detect
lost TaskExecutors as fast as the heartbeat interval.

[1] https://issues.apache.org/jira/browse/FLINK-23209

Cheers,
Till

On Fri, Jul 2, 2021 at 9:33 AM Gen Luo  wrote:

> Thanks for sharing, Till and Yang.
>
> @Lu
> Sorry but I don't know how to explain the new test with the log. Let's
> wait for others' reply.
>
> @Till
> It would be nice if JIRAs could be fixed. Thanks again for proposing them.
>
> In addition, I was tracking an issue that RM keeps allocating and freeing
> slots after a TM lost until its heartbeat timeout, when I found the
> recovery costing as long as heartbeat timeout. That should be a minor bug
> introduced by declarative resource management. I have created a JIRA about
> the problem [1] and  we can discuss it there if necessary.
>
> [1] https://issues.apache.org/jira/browse/FLINK-23216
>
> Lu Niu  于2021年7月2日周五 上午3:13写道:
>
>> Another side question, Shall we add metric to cover the complete
>> restarting time (phase 1 + phase 2)? Current metric jm.restartingTime only
>> covers phase 1. Thanks!
>>
>> Best
>> Lu
>>
>> On Thu, Jul 1, 2021 at 12:09 PM Lu Niu  wrote:
>>
>>> Thanks TIll and Yang for help! Also Thanks Till for a quick fix!
>>>
>>> I did another test yesterday. In this test, I intentionally throw
>>> exception from the source operator:
>>> ```
>>> if (runtimeContext.getIndexOfThisSubtask() == 1
>>> && errorFrenquecyInMin > 0
>>> && System.currentTimeMillis() - lastStartTime >=
>>> errorFrenquecyInMin * 60 * 1000) {
>>>   lastStartTime = System.currentTimeMillis();
>>>   throw new RuntimeException(
>>>   "Trigger expected exception at: " + lastStartTime);
>>> }
>>> ```
>>> In this case, I found phase 1 still takes about 30s and Phase 2 dropped
>>> to 1s (because no need for container allocation). Why phase 1 still takes
>>> 30s even though no TM is lost?
>>>
>>> Related logs:
>>> ```
>>> 2021-06-30 00:55:07,463 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph- Source:
>>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-frontend_event_source -> ...
>>> java.lang.RuntimeException: Trigger expected exception at: 1625014507446
>>> 2021-06-30 00:55:07,509 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
>>> NRTG_USER_MATERIALIZED_EVENT_SIGNAL_user_context_staging
>>> (35c95ee7141334845cfd49b642fa9f98) switched from state RUNNING to
>>> RESTARTING.
>>> 2021-06-30 00:55:37,596 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph- Job
>>> NRTG_USER_MATERIALIZED_EVENT_SIGNAL_user_context_staging
>>> (35c95ee7141334845cfd49b642fa9f98) switched from state RESTARTING to
>>> RUNNING.
>>> 2021-06-30 00:55:38,678 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph(time when
>>> all tasks switch from CREATED to RUNNING)
>>> ```
>>> Best
>>> Lu
>>>
>>>
>>> On Thu, Jul 1, 2021 at 12:06 PM Lu Niu  wrote:
>>>
>>>> Thanks TIll and Yang for help! Also Thanks Till for a quick fix!
>>>>
>>>> I did another test yesterday. In this test, I intentionally throw
>>>> exception from the source operator:
>>>> ```
>>>> if (runtimeContext.getIndexOfThisSubtask() == 1
>>>> && errorFrenquecyInMin > 0
>>>> && System.currentTimeMillis() - lastStartTime >=
>>>> errorFrenquecyInMin * 60 * 1000) {
>>>>   lastStartTime = System.currentTimeMillis();
>>>>   throw new RuntimeException(
>>>>   "Trigger expected exception at: " + lastStartTime);
>>>> }
>>>> ```
>>>

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
A quick addition, I think with FLINK-23202 it should now also be possible
to improve the heartbeat mechanism in the general case. We can leverage the
unreachability exception thrown if a remote target is no longer reachable
to mark an heartbeat target as no longer reachable [1]. This can then be
considered as if the heartbeat timeout has been triggered. That way we
should detect lost TaskExecutors as fast as our heartbeat interval is.

[1] https://issues.apache.org/jira/browse/FLINK-23209

Cheers,
Till

On Thu, Jul 1, 2021 at 1:46 PM Yang Wang  wrote:

> Since you are deploying Flink workloads on Yarn, the Flink ResourceManager
> should get the container
> completion event after the heartbeat of Yarn NM->Yarn RM->Flink RM, which
> is 8 seconds by default.
> And Flink ResourceManager will release the dead TaskManager container once
> received the completion event.
> As a result, Flink will not deploy tasks onto the dead TaskManagers.
>
>
> I think most of the time cost in Phase 1 might be cancelling the tasks on
> the dead TaskManagers.
>
>
> Best,
> Yang
>
>
> Till Rohrmann  于2021年7月1日周四 下午4:49写道:
>
>> The analysis of Gen is correct. Flink currently uses its heartbeat as the
>> primary means to detect dead TaskManagers. This means that Flink will take
>> at least `heartbeat.timeout` time before the system recovers. Even if the
>> cancellation happens fast (e.g. by having configured a low
>> akka.ask.timeout), then Flink will still try to deploy tasks onto the dead
>> TaskManager until it is marked as dead and its slots are released (unless
>> the ResourceManager does not get a signal from the underlying resource
>> management system that a container/pod has died). One way to improve the
>> situation is to introduce logic which can react to a ConnectionException
>> and then black lists or releases a TaskManager, for example. This is
>> currently not implemented in Flink, though.
>>
>> Concerning the cancellation operation: Flink currently does not listen to
>> the dead letters of Akka. This means that the `akka.ask.timeout` is the
>> primary means to fail the future result of a rpc which could not be sent.
>> This is also an improvement we should add to Flink's RpcService. I've
>> created a JIRA issue for this problem [1].
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-23202
>>
>> Cheers,
>> Till
>>
>> On Wed, Jun 30, 2021 at 6:33 PM Lu Niu  wrote:
>>
>>> Thanks Gen! cc flink-dev to collect more inputs.
>>>
>>> Best
>>> Lu
>>>
>>> On Wed, Jun 30, 2021 at 12:55 AM Gen Luo  wrote:
>>>
>>>> I'm also wondering here.
>>>>
>>>> In my opinion, it's because the JM can not confirm whether the TM is
>>>> lost or it's a temporary network trouble and will recover soon, since I can
>>>> see in the log that akka has got a Connection refused but JM still sends a
>>>> heartbeat request to the lost TM until it reaches heartbeat timeout. But
>>>> I'm not sure if it's indeed designed like this.
>>>>
>>>> I would really appreciate it if anyone who knows more details could
>>>> answer. Thanks.
>>>>
>>>


[jira] [Created] (FLINK-23209) Timeout heartbeat if the heartbeat target is no longer reachable

2021-07-01 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23209:
-

 Summary: Timeout heartbeat if the heartbeat target is no longer 
reachable
 Key: FLINK-23209
 URL: https://issues.apache.org/jira/browse/FLINK-23209
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.12.4, 1.13.1
Reporter: Till Rohrmann
 Fix For: 1.14.0, 1.12.5, 1.13.2


With FLINK-23202 it should now be possible to see when a remote RPC endpoint is 
no longer reachable. This can be used by the {{HeartbeatManager}} to mark an 
heartbeat target as no longer reachable. That way, it is possible for Flink to 
react faster to losses of components w/o having to wait for the heartbeat 
timeout to expire. This will result in faster recoveries (e.g. if a 
{{TaskExecutor}} dies).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
The analysis of Gen is correct. Flink currently uses its heartbeat as the
primary means to detect dead TaskManagers. This means that Flink will take
at least `heartbeat.timeout` time before the system recovers. Even if the
cancellation happens fast (e.g. by having configured a low
akka.ask.timeout), then Flink will still try to deploy tasks onto the dead
TaskManager until it is marked as dead and its slots are released (unless
the ResourceManager does not get a signal from the underlying resource
management system that a container/pod has died). One way to improve the
situation is to introduce logic which can react to a ConnectionException
and then black lists or releases a TaskManager, for example. This is
currently not implemented in Flink, though.

Concerning the cancellation operation: Flink currently does not listen to
the dead letters of Akka. This means that the `akka.ask.timeout` is the
primary means to fail the future result of a rpc which could not be sent.
This is also an improvement we should add to Flink's RpcService. I've
created a JIRA issue for this problem [1].

[1] https://issues.apache.org/jira/browse/FLINK-23202

Cheers,
Till

On Wed, Jun 30, 2021 at 6:33 PM Lu Niu  wrote:

> Thanks Gen! cc flink-dev to collect more inputs.
>
> Best
> Lu
>
> On Wed, Jun 30, 2021 at 12:55 AM Gen Luo  wrote:
>
>> I'm also wondering here.
>>
>> In my opinion, it's because the JM can not confirm whether the TM is lost
>> or it's a temporary network trouble and will recover soon, since I can see
>> in the log that akka has got a Connection refused but JM still sends a
>> heartbeat request to the lost TM until it reaches heartbeat timeout. But
>> I'm not sure if it's indeed designed like this.
>>
>> I would really appreciate it if anyone who knows more details could
>> answer. Thanks.
>>
>


[jira] [Created] (FLINK-23202) RpcService should fail result futures if messages could not be sent

2021-07-01 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-23202:
-

 Summary: RpcService should fail result futures if messages could 
not be sent
 Key: FLINK-23202
 URL: https://issues.apache.org/jira/browse/FLINK-23202
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Coordination
Affects Versions: 1.12.4, 1.13.1
Reporter: Till Rohrmann
 Fix For: 1.14.0


The {{RpcService}} should fail result futures if messages could not be sent. 
This would speed up the failure detection mechanism because it would not rely 
on the timeout. One way to achieve this could be to listen to the dead letters 
and then sending a {{Failure}} message back to the sender.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [DISCUSS] Feedback Collection Jira Bot

2021-06-30 Thread Till Rohrmann
 In my
> > experience,
> > > > > these
> > > > > > > > tickets are usually not closed by anyone but the bot.
> > > > > > >
> > > > > > > I agree there are such tickets, but I don't see how this is
> > > > addressing
> > > > > my
> > > > > > > concerns. There are also tickets that just shouldn't be closed
> > as I
> > > > > > > described above. Why do you think that duplicating tickets and
> > > losing
> > > > > > > discussions/knowledge is a good solution?
> > > > > > >
> > > > > > > I would like to avoid having to constantly fight against the
> bot.
> > > > It's
> > > > > > > already responsible for the majority of my daily emails, with
> > quite
> > > > > > little
> > > > > > > benefit for me personally. I initially thought that after some
> > > period
> > > > > of
> > > > > > > time it will settle down, but now I'm afraid it won't happen.
> Can
> > > we
> > > > > add
> > > > > > > some label to mark tickets to be ignored by the jira-bot?
> > > > > > >
> > > > > > > Best,
> > > > > > > Piotrek
> > > > > > >
> > > > > > > śr., 23 cze 2021 o 09:40 Chesnay Schepler 
> > > > > > napisał(a):
> > > > > > >
> > > > > > > > I would like it to not unassign people if a PR is open. These
> > are
> > > > > > > > usually blocked by the reviewer, not the assignee, and having
> > the
> > > > > > > > assignees now additionally having to update JIRA periodically
> > is
> > > a
> > > > > bit
> > > > > > > > like rubbing salt into the wound.
> > > > > > > >
> > > > > > > > On 6/23/2021 7:52 AM, Konstantin Knauf wrote:
> > > > > > > > > Hi everyone,
> > > > > > > > >
> > > > > > > > > I was hoping for more feedback from other committers, but
> > seems
> > > > > like
> > > > > > > this
> > > > > > > > > is not happening, so here's my proposal for immediate
> > changes:
> > > > > > > > >
> > > > > > > > > * Ignore tickets with a fixVersion for all rules but the
> > > > > > > stale-unassigned
> > > > > > > > > role.
> > > > > > > > >
> > > > > > > > > * We change the time intervals as follows, accepting
> reality
> > a
> > > > bit
> > > > > > more
> > > > > > > > ;)
> > > > > > > > >
> > > > > > > > >  * stale-assigned only after 30 days (instead of 14
> days)
> > > > > > > > >  * stale-critical only after 14 days (instead of 7
> days)
> > > > > > > > >  * stale-major only after 60 days (instead of 30 days)
> > > > > > > > >
> > > > > > > > > Unless there are -1s, I'd implement the changes Monday next
> > > week.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Konstantin
> > > > > > > > >
> > > > > > > > > On Thu, Jun 17, 2021 at 2:17 PM Piotr Nowojski <
> > > > > pnowoj...@apache.org
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Hi,
> > > > > > > > >>
> > > > > > > > >> I also think that the bot is a bit too aggressive/too
> quick
> > > with
> > > > > > > > assigning
> > > > > > > > >> stale issues/deprioritizing them, but that's not that big
> > of a
> > > > > deal
> > > > > > > for
> > > > > > > > me.
> > > > > > > > >>
> > > > > > > > >> What bothers me much more is that it's closing minor
> issues
> > > > > &g

Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-30 Thread Till Rohrmann
I think you are right Tison that docs are a special case and they only
require flink-docs to pass. What I am wondering is how much of a problem
this will be (assuming that we have a decent build stability). The more
exceptions we add, the harder it will be to properly follow the guidelines.
Maybe we can observe how many docs PRs get delayed/not merged because of
this and then revisit this discussion if needed.

Cheers,
Till

On Wed, Jun 30, 2021 at 8:30 AM tison  wrote:

> Hi,
>
> There are a number of PRs modifying docs only, but we still require all
> tests passed on that.
>
> It is a good proposal we avoid merge PR with "unrelated" failure, but can
> we improve the case where the contributor only works for docs?
>
> For example, base on the file change set, run doc tests only.
>
> Best,
> tison.
>
>
> godfrey he  于2021年6月30日周三 下午2:17写道:
>
> > +1 for the proposal. Thanks Xintong!
> >
> > Best,
> > Godfrey
> >
> >
> >
> > Rui Li  于2021年6月30日周三 上午11:36写道:
> >
> > > Thanks Xintong. +1 to the proposal.
> > >
> > > On Tue, Jun 29, 2021 at 11:05 AM 刘建刚 
> wrote:
> > >
> > > > +1 for the proposal. Since the test time is long and environment may
> > > vary,
> > > > unstable tests are really annoying for developers. The solution is
> > > welcome.
> > > >
> > > > Best
> > > > liujiangang
> > > >
> > > > Jingsong Li  于2021年6月29日周二 上午10:31写道:
> > > >
> > > > > +1 Thanks Xintong for the update!
> > > > >
> > > > > Best,
> > > > > Jingsong
> > > > >
> > > > > On Mon, Jun 28, 2021 at 6:44 PM Till Rohrmann <
> trohrm...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > +1, thanks for updating the guidelines Xintong!
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Mon, Jun 28, 2021 at 11:49 AM Yangze Guo 
> > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > Thanks Xintong for drafting this doc.
> > > > > > >
> > > > > > > Best,
> > > > > > > Yangze Guo
> > > > > > >
> > > > > > > On Mon, Jun 28, 2021 at 5:42 PM JING ZHANG <
> beyond1...@gmail.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > Thanks Xintong for giving detailed documentation.
> > > > > > > >
> > > > > > > > The best practice for handling test failure is very detailed,
> > > it's
> > > > a
> > > > > > good
> > > > > > > > guidelines document with clear action steps.
> > > > > > > >
> > > > > > > > +1 to Xintong's proposal.
> > > > > > > >
> > > > > > > > Xintong Song  于2021年6月28日周一 下午4:07写道:
> > > > > > > >
> > > > > > > > > Thanks all for the discussion.
> > > > > > > > >
> > > > > > > > > Based on the opinions so far, I've drafted the new
> guidelines
> > > > [1],
> > > > > > as a
> > > > > > > > > potential replacement of the original wiki page [2].
> > > > > > > > >
> > > > > > > > > Hopefully this draft has covered the most opinions
> discussed
> > > and
> > > > > > > consensus
> > > > > > > > > made in this discussion thread.
> > > > > > > > >
> > > > > > > > > Looking forward to your feedback.
> > > > > > > > >
> > > > > > > > > Thank you~
> > > > > > > > >
> > > > > > > > > Xintong Song
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uUbxbgbGErBXtmEjhwVhBWG3i6nhQ0LXs96OlntEYnU/edit?usp=sharing
> > > > > > >

Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-28 Thread Till Rohrmann
+1, thanks for updating the guidelines Xintong!

Cheers,
Till

On Mon, Jun 28, 2021 at 11:49 AM Yangze Guo  wrote:

> +1
>
> Thanks Xintong for drafting this doc.
>
> Best,
> Yangze Guo
>
> On Mon, Jun 28, 2021 at 5:42 PM JING ZHANG  wrote:
> >
> > Thanks Xintong for giving detailed documentation.
> >
> > The best practice for handling test failure is very detailed, it's a good
> > guidelines document with clear action steps.
> >
> > +1 to Xintong's proposal.
> >
> > Xintong Song  于2021年6月28日周一 下午4:07写道:
> >
> > > Thanks all for the discussion.
> > >
> > > Based on the opinions so far, I've drafted the new guidelines [1], as a
> > > potential replacement of the original wiki page [2].
> > >
> > > Hopefully this draft has covered the most opinions discussed and
> consensus
> > > made in this discussion thread.
> > >
> > > Looking forward to your feedback.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> > >
> https://docs.google.com/document/d/1uUbxbgbGErBXtmEjhwVhBWG3i6nhQ0LXs96OlntEYnU/edit?usp=sharing
> > >
> > > [2]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > >
> > >
> > >
> > > On Fri, Jun 25, 2021 at 10:40 PM Piotr Nowojski 
> > > wrote:
> > >
> > > > Thanks for the clarification Till. +1 for what you have written.
> > > >
> > > > Piotrek
> > > >
> > > > pt., 25 cze 2021 o 16:00 Till Rohrmann 
> > > napisał(a):
> > > >
> > > > > One quick note for clarification. I don't have anything against
> builds
> > > > > running on your personal Azure account and this is not what I
> > > understood
> > > > > under "local environment". For me "local environment" means that
> > > someone
> > > > > runs the test locally on his machine and then says that the
> > > > > tests have passed locally.
> > > > >
> > > > > I do agree that there might be a conflict of interests if a PR
> author
> > > > > disables tests. Here I would argue that we don't have malignant
> > > > committers
> > > > > which means that every committer will probably first check the
> > > respective
> > > > > ticket for how often the test failed. Then I guess the next step
> would
> > > be
> > > > > to discuss on the ticket whether to disable it or not. And finally,
> > > after
> > > > > reaching a consensus, it will be disabled. If we see someone
> abusing
> > > this
> > > > > policy, then we can still think about how to guard against it. But,
> > > > > honestly, I have very rarely seen such a case. I am also ok to
> pull in
> > > > the
> > > > > release manager to make the final call if this resolves concerns.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Fri, Jun 25, 2021 at 9:07 AM Piotr Nowojski <
> pnowoj...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > +1 for the general idea, however I have concerns about a couple
> of
> > > > > details.
> > > > > >
> > > > > > > I would first try to not introduce the exception for local
> builds.
> > > > > > > It makes it quite hard for others to verify the build and to
> make
> > > > sure
> > > > > > that the right things were executed.
> > > > > >
> > > > > > I would counter Till's proposal to ignore local green builds. If
> > > > > committer
> > > > > > is merging and closing a PR, with official azure failure, but
> there
> > > > was a
> > > > > > green build before or in local azure it's IMO enough to leave the
> > > > > message:
> > > > > >
> > > > > > > Latest build failure is a known issue: FLINK-12345
> > > > > > > Green local build: URL
> > > > > >
> > > > > > This should address Till's concern about verification.
> > > > > >
> > > > > > On the other hand I have concerns about disabling tests.* It
> > > shouldn't
> > >

Re: Flink 1.14. Bi-weekly 2021-06-22

2021-06-28 Thread Till Rohrmann
Thanks a lot for the update, Joe. This is very helpful!

Cheers,
Till

On Mon, Jun 28, 2021 at 10:10 AM Xintong Song  wrote:

> Thanks for the update, Joe.
>
> Thank you~
>
> Xintong Song
>
>
> On Mon, Jun 28, 2021 at 3:54 PM Johannes Moser 
> wrote:
>
> > Hello,
> >
> > Last Tuesday was our second bi-weekly.
> >
> > You can read up the outcome in the confluence wiki page [1].
> >
> > *Feature freeze date*
> > As we didn't come to a clear agreement, we will keep the anticipated
> > feature freeze date
> > as it is at early August.
> >
> > *Build stability*
> > The good thing: we decreased the number of issues, the not so good thing:
> > only by ten.
> > We as a community need to put further effort into this.
> >
> > *Dependencies*
> > We'd like to ask all contributors to have a look at the components they
> > are heavily
> > Involved with to see if any dependencies require updating. There were
> some
> > Issues recently to pass the security scans by some of the users. In
> future
> > this should
> > somehow be a default at the beginning of every release cycle.
> >
> > *Criteria for merging PRs*
> > We want to avoid merging PRs with unrelated CI failures. We are quite
> > aware that we
> > need to raise the importance of the Docker caching issue.
> >
> > What can you do to make the Flink 1.14. release a good one:
> > * Identify and update outdated dependencies
> > * Get rid of test instabilities
> > * Don't merge PRs including unrelated CI failures
> >
> > Best,
> > Joe
> >
> >
> > [1] https://cwiki.apache.org/confluence/display/FLINK/1.14+Release
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-25 Thread Till Rohrmann
One quick note for clarification. I don't have anything against builds
running on your personal Azure account and this is not what I understood
under "local environment". For me "local environment" means that someone
runs the test locally on his machine and then says that the
tests have passed locally.

I do agree that there might be a conflict of interests if a PR author
disables tests. Here I would argue that we don't have malignant committers
which means that every committer will probably first check the respective
ticket for how often the test failed. Then I guess the next step would be
to discuss on the ticket whether to disable it or not. And finally, after
reaching a consensus, it will be disabled. If we see someone abusing this
policy, then we can still think about how to guard against it. But,
honestly, I have very rarely seen such a case. I am also ok to pull in the
release manager to make the final call if this resolves concerns.

Cheers,
Till

On Fri, Jun 25, 2021 at 9:07 AM Piotr Nowojski  wrote:

> +1 for the general idea, however I have concerns about a couple of details.
>
> > I would first try to not introduce the exception for local builds.
> > It makes it quite hard for others to verify the build and to make sure
> that the right things were executed.
>
> I would counter Till's proposal to ignore local green builds. If committer
> is merging and closing a PR, with official azure failure, but there was a
> green build before or in local azure it's IMO enough to leave the message:
>
> > Latest build failure is a known issue: FLINK-12345
> > Green local build: URL
>
> This should address Till's concern about verification.
>
> On the other hand I have concerns about disabling tests.* It shouldn't be
> the PR author/committer that's disabling a test on his own, as that's a
> conflict of interests*. I have however no problems with disabling test
> instabilities that were marked as "blockers" though, that should work
> pretty well. But the important thing here is to correctly judge bumping
> priorities of test instabilities based on their frequency and current
> general health of the system. I believe that release managers should be
> playing a big role here in deciding on the guidelines of what should be a
> priority of certain test instabilities.
>
> What I mean by that is two example scenarios:
> 1. if we have a handful of very frequently failing tests and a handful of
> very rarely failing tests (like one reported failure and no another
> occurrence in many months, and let's even say that the failure looks like
> infrastructure/network timeout), we should focus on the frequently failing
> ones, and probably we are safe to ignore for the time being the rare issues
> - at least until we deal with the most pressing ones.
> 2. If we have tons of rarely failing test instabilities, we should probably
> start addressing them as blockers.
>
> I'm using my own conscious and my best judgement when I'm
> bumping/decreasing priorities of test instabilities (and bugs), but as
> individual committer I don't have the full picture. As I wrote above, I
> think release managers are in a much better position to keep adjusting
> those kind of guidelines.
>
> Best, Piotrek
>
> pt., 25 cze 2021 o 08:10 Yu Li  napisał(a):
>
> > +1 for Xintong's proposal.
> >
> > For me, resolving problems directly (fixing the infrastructure issue,
> > disabling unstable tests and creating blocker JIRAs to track the fix and
> > re-enable them asap, etc.) is (in most cases) better than working around
> > them (verify locally, manually check and judge the failure as
> "unrelated",
> > etc.), and I believe the proposal could help us pushing those more "real"
> > solutions forward.
> >
> > Best Regards,
> > Yu
> >
> >
> > On Fri, 25 Jun 2021 at 10:58, Yangze Guo  wrote:
> >
> > > Creating a blocker issue for the manually disabled tests sounds good to
> > me.
> > >
> > > Minor: I'm still a bit worried about the commits merged before we fix
> > > the unstable tests can also break those tests. Instead of letting the
> > > assigners keep a look at all potentially related commits, they can
> > > maintain a branch that is periodically synced with the master branch
> > > while enabling the unstable test. So that they can catch the breaking
> > > changes asap.
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Thu, Jun 24, 2021 at 9:52 PM Till Rohrmann 
> > > wrote:
> > > >
> > > > I like the idea of creating a blocker issue for 

Re: [DISCUSS] Releasing Flink 1.12.5

2021-06-24 Thread Till Rohrmann
Thanks for volunteering as our release manager, Jingsong. From my
perspective, we are good to go for the 1.12.5 release.

Cheers,
Till

On Thu, Jun 24, 2021 at 11:53 AM Jingsong Li  wrote:

> Dear devs,
>
> As discussed in [1], I would volunteer as the release manager for 1.12.5
> and kick off the release process on next Wednesday (June 30th).
>
> I went through 1.12.5 issues [2], all blockers have been fixed.
>
> If you think there is any blocker in 1.12.5, please reply to this thread at
> any time.
>
> What do you think?
>
> [1]
>
> https://lists.apache.org/thread.html/r40a541027c6a04519f37c61f2a6f3dabdb821b3760cda9cc6ebe6ce9%40%3Cdev.flink.apache.org%3E
> [2]
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20fixVersion%20%3D%201.12.5%20ORDER%20BY%20priority%20DESC
>
> Best,
> Jingsong
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-24 Thread Till Rohrmann
r looking for assignees for
> such
> > > issues. If a case is still not fixed soonish, even with all these
> > efforts,
> > > I'm not sure how setting a deadline can help this.
> > >
> > > 4. If we disable the respective tests temporarily, we also need a
> > mechanism
> > > > to ensure the issue would be continued to be investigated in the
> > future.
> > > >
> > >
> > > +1. As mentioned above, we may consider disabling such tests iff
> someone
> > is
> > > actively working on it.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Wed, Jun 23, 2021 at 9:56 PM JING ZHANG 
> wrote:
> > >
> > > > Hi Xintong,
> > > > +1 to the proposal.
> > > > In order to better comply with the rule, it is necessary to describe
> > > what's
> > > > best practice if encountering test failure which seems unrelated with
> > the
> > > > current commits.
> > > > How to avoid merging PR with test failures and not blocking code
> > merging
> > > > for a long time?
> > > > I tried to think about the possible steps, and found there are some
> > > > detailed problems that need to be discussed in a step further:
> > > > 1. Report the test failures in the JIRA.
> > > > 2. Set a deadline to find out the root cause and solve the failure
> for
> > > the
> > > > new created JIRA  because we could not block other commit merges for
> a
> > > long
> > > > time
> > > > When is a reasonable deadline here?
> > > > 3. What to do if the JIRA has not made significant progress when
> > reached
> > > to
> > > > the deadline time?
> > > > There are several situations as follows, maybe different cases
> need
> > > > different approaches.
> > > > 1. the JIRA is non-assigned yet
> > > > 2. not found the root cause yet
> > > > 3. not found a good solution, but already found the root cause
> > > > 4. found a solution, but it needs more time to be done.
> > > > 4. If we disable the respective tests temporarily, we also need a
> > > mechanism
> > > > to ensure the issue would be continued to be investigated in the
> > future.
> > > >
> > > > Best regards,
> > > > JING ZHANG
> > > >
> > > > Stephan Ewen  于2021年6月23日周三 下午8:16写道:
> > > >
> > > > > +1 to Xintong's proposal
> > > > >
> > > > > On Wed, Jun 23, 2021 at 1:53 PM Till Rohrmann <
> trohrm...@apache.org>
> > > > > wrote:
> > > > >
> > > > > > I would first try to not introduce the exception for local
> builds.
> > It
> > > > > makes
> > > > > > it quite hard for others to verify the build and to make sure
> that
> > > the
> > > > > > right things were executed. If we see that this becomes an issue
> > then
> > > > we
> > > > > > can revisit this idea.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Wed, Jun 23, 2021 at 4:19 AM Yangze Guo 
> > > wrote:
> > > > > >
> > > > > > > +1 for appending this to community guidelines for merging PRs.
> > > > > > >
> > > > > > > @Till Rohrmann
> > > > > > > I agree that with this approach unstable tests will not block
> > other
> > > > > > > commit merges. However, it might be hard to prevent merging
> > commits
> > > > > > > that are related to those tests and should have been passed
> them.
> > > > It's
> > > > > > > true that this judgment can be made by the committers, but no
> one
> > > can
> > > > > > > ensure the judgment is always precise and so that we have this
> > > > > > > discussion thread.
> > > > > > >
> > > > > > > Regarding the unstable tests, how about adding another
> exception:
> > > > > > > committers verify it in their local environment and comment in
> > such
> > > > > > > cases?
> > > > > > >
> > > > > > > Bes

Re: Change in accumutors semantics with jobClient

2021-06-23 Thread Till Rohrmann
Yes, it should be part of the release notes where this change was
introduced. I'll take a look at your PR. Thanks a lot Etienne.

Cheers,
Till

On Wed, Jun 23, 2021 at 12:29 PM Etienne Chauchot 
wrote:

> Hi Till,
>
> Of course I can update the release notes.
>
> Question is: this change is quite old (January), it is already available
> in all the maintained releases :1.11, 1.12, 1.13.
>
> I think I should update the release notes for all these versions no ?
>
> In case you agree, I took the liberty to update all these release notes
> in a PR: https://github.com/apache/flink/pull/16256
>
> Cheers,
>
> Etienne
>
> On 21/06/2021 11:39, Till Rohrmann wrote:
> > Thanks for bringing this to the dev ML Etienne. Could you maybe update
> the
> > release notes for Flink 1.13 [1] to include this change? That way it
> might
> > be a bit more prominent. I think the change needs to go into the
> > release-1.13 and master branch.
> >
> > [1]
> >
> https://github.com/apache/flink/blob/master/docs/content/release-notes/flink-1.13.md
> >
> > Cheers,
> > Till
> >
> >
> > On Fri, Jun 18, 2021 at 2:45 PM Etienne Chauchot 
> > wrote:
> >
> >> Hi all,
> >>
> >> I did a fix some time ago regarding accumulators:
> >> the/JobClient.getAccumulators()/ was infinitely  blocking in local
> >> environment for a streaming job (1). The change (2) consisted of giving
> >> the current accumulators value for the running job. And when fixing this
> >> in the PR, it appeared that I had to change the accumulators semantics
> >> with /JobClient/ and I just realized that I forgot to bring this back to
> >> the ML:
> >>
> >> Previously /JobClient/ assumed that getAccumulator() was called on a
> >> bounded pipeline and that the user wanted to acquire the *final
> >> accumulator values* after the job is finished.
> >>
> >> But now it returns the *current value of accumulators* immediately to be
> >> compatible with unbounded pipelines.
> >>
> >> If it is run on a bounded pipeline, then to get the final accumulator
> >> values after the job is finished, one needs to call
> >>
> >>
> /getJobExecutionResult().thenApply(JobExecutionResult::getAllAccumulatorResults)/
> >>
> >> (1): https://issues.apache.org/jira/browse/FLINK-18685
> >>
> >> (2): https://github.com/apache/flink/pull/14558#
> >>
> >>
> >> Cheers,
> >>
> >> Etienne
> >>
> >>
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-23 Thread Till Rohrmann
I would first try to not introduce the exception for local builds. It makes
it quite hard for others to verify the build and to make sure that the
right things were executed. If we see that this becomes an issue then we
can revisit this idea.

Cheers,
Till

On Wed, Jun 23, 2021 at 4:19 AM Yangze Guo  wrote:

> +1 for appending this to community guidelines for merging PRs.
>
> @Till Rohrmann
> I agree that with this approach unstable tests will not block other
> commit merges. However, it might be hard to prevent merging commits
> that are related to those tests and should have been passed them. It's
> true that this judgment can be made by the committers, but no one can
> ensure the judgment is always precise and so that we have this
> discussion thread.
>
> Regarding the unstable tests, how about adding another exception:
> committers verify it in their local environment and comment in such
> cases?
>
> Best,
> Yangze Guo
>
> On Tue, Jun 22, 2021 at 8:23 PM 刘建刚  wrote:
> >
> > It is a good principle to run all tests successfully with any change.
> This
> > means a lot for project's stability and development. I am big +1 for this
> > proposal.
> >
> > Best
> > liujiangang
> >
> > Till Rohrmann  于2021年6月22日周二 下午6:36写道:
> >
> > > One way to address the problem of regularly failing tests that block
> > > merging of PRs is to disable the respective tests for the time being.
> Of
> > > course, the failing test then needs to be fixed. But at least that way
> we
> > > would not block everyone from making progress.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise  wrote:
> > >
> > > > I think this is overall a good idea. So +1 from my side.
> > > > However, I'd like to put a higher priority on infrastructure then, in
> > > > particular docker image/artifact caches.
> > > >
> > > > On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann  >
> > > > wrote:
> > > >
> > > > > Thanks for bringing this topic to our attention Xintong. I think
> your
> > > > > proposal makes a lot of sense and we should follow it. It will
> give us
> > > > > confidence that our changes are working and it might be a good
> > > incentive
> > > > to
> > > > > quickly fix build instabilities. Hence, +1.
> > > > >
> > > > > Cheers,
> > > > > Till
> > > > >
> > > > > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song <
> tonysong...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > In the past a couple of weeks, I've observed several times that
> PRs
> > > are
> > > > > > merged without a green light from the CI tests, where failure
> cases
> > > are
> > > > > > considered *unrelated*. This may not always cause problems, but
> would
> > > > > > increase the chance of breaking our code base. In fact, it has
> > > occurred
> > > > > to
> > > > > > me twice in the past few weeks that I had to revert a commit
> which
> > > > breaks
> > > > > > the master branch due to this.
> > > > > >
> > > > > > I think it would be nicer to enforce a stricter rule, that no PRs
> > > > should
> > > > > be
> > > > > > merged without passing CI.
> > > > > >
> > > > > > The problems of merging PRs with "unrelated" test failures are:
> > > > > > - It's not always straightforward to tell whether a test
> failures are
> > > > > > related or not.
> > > > > > - It prevents subsequent test cases from being executed, which
> may
> > > fail
> > > > > > relating to the PR changes.
> > > > > >
> > > > > > To make things easier for the committers, the following
> exceptions
> > > > might
> > > > > be
> > > > > > considered acceptable.
> > > > > > - The PR has passed CI in the contributor's personal workspace.
> > > Please
> > > > > post
> > > > > > the link in such cases.
> > > > > > - The CI tests have been triggered multiple times, on the same
> > > commit,
> > > > > and
> > > > > > each stage has at least passed for once. Please also comment in
> such
> > > > > cases.
> > > > > >
> > > > > > If we all agree on this, I'd update the community guidelines for
> > > > merging
> > > > > > PRs wrt. this proposal. [1]
> > > > > >
> > > > > > Please let me know what do you think.
> > > > > >
> > > > > > Thank you~
> > > > > >
> > > > > > Xintong Song
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > > > > >
> > > > >
> > > >
> > >
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-22 Thread Till Rohrmann
One way to address the problem of regularly failing tests that block
merging of PRs is to disable the respective tests for the time being. Of
course, the failing test then needs to be fixed. But at least that way we
would not block everyone from making progress.

Cheers,
Till

On Tue, Jun 22, 2021 at 12:00 PM Arvid Heise  wrote:

> I think this is overall a good idea. So +1 from my side.
> However, I'd like to put a higher priority on infrastructure then, in
> particular docker image/artifact caches.
>
> On Tue, Jun 22, 2021 at 11:50 AM Till Rohrmann 
> wrote:
>
> > Thanks for bringing this topic to our attention Xintong. I think your
> > proposal makes a lot of sense and we should follow it. It will give us
> > confidence that our changes are working and it might be a good incentive
> to
> > quickly fix build instabilities. Hence, +1.
> >
> > Cheers,
> > Till
> >
> > On Tue, Jun 22, 2021 at 11:12 AM Xintong Song 
> > wrote:
> >
> > > Hi everyone,
> > >
> > > In the past a couple of weeks, I've observed several times that PRs are
> > > merged without a green light from the CI tests, where failure cases are
> > > considered *unrelated*. This may not always cause problems, but would
> > > increase the chance of breaking our code base. In fact, it has occurred
> > to
> > > me twice in the past few weeks that I had to revert a commit which
> breaks
> > > the master branch due to this.
> > >
> > > I think it would be nicer to enforce a stricter rule, that no PRs
> should
> > be
> > > merged without passing CI.
> > >
> > > The problems of merging PRs with "unrelated" test failures are:
> > > - It's not always straightforward to tell whether a test failures are
> > > related or not.
> > > - It prevents subsequent test cases from being executed, which may fail
> > > relating to the PR changes.
> > >
> > > To make things easier for the committers, the following exceptions
> might
> > be
> > > considered acceptable.
> > > - The PR has passed CI in the contributor's personal workspace. Please
> > post
> > > the link in such cases.
> > > - The CI tests have been triggered multiple times, on the same commit,
> > and
> > > each stage has at least passed for once. Please also comment in such
> > cases.
> > >
> > > If we all agree on this, I'd update the community guidelines for
> merging
> > > PRs wrt. this proposal. [1]
> > >
> > > Please let me know what do you think.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
> > >
> >
>


Re: [DISCUSS] Dashboard/HistoryServer authentication

2021-06-22 Thread Till Rohrmann
 nor is it a bad idea, but is not one that will make
> Flink
> > > more compatible with the modern solutions to this problem.
> > >
> > > Best,
> > > Austin
> > >
> > > On Mon, Jun 21, 2021 at 2:18 PM Márton Balassi <
> balassi.mar...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi team,
> > > >
> > > > Thank you for your input. Based on this discussion I agree with G
> that
> > > > selecting and standardizing on a specific strong authentication
> > mechanism
> > > > is more challenging than the whole rest of the scope of this
> > > authentication
> > > > story. :-) I suggest that G and I go back to the drawing board and
> come
> > > up
> > > > with an API that can support multiple authentication mechanisms, and
> we
> > > > would only merge said API to Flink. Specific implementations of it
> can
> > be
> > > > maintained outside of the project. This way we tackle the main
> > challenge
> > > in
> > > > a truly minimal way.
> > > >
> > > > Best,
> > > > Marton
> > > >
> > > > On Mon, Jun 21, 2021 at 4:18 PM Gabor Somogyi <
> > gabor.g.somo...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > We see that adding any kind of specific authentication raises more
> > > > > questions than answers.
> > > > > What would be if a generic API would be added without any real
> > > > > authentication logic?
> > > > > That way every provider can add its own protocol implementation as
> > > > > additional jar.
> > > > >
> > > > > BR,
> > > > > G
> > > > >
> > > > >
> > > > > On Thu, Jun 17, 2021 at 7:53 PM Austin Cawley-Edwards <
> > > > > austin.caw...@gmail.com> wrote:
> > > > >
> > > > >> Hi all,
> > > > >>
> > > > >> Sorry to be joining the conversation late. I'm also on the side of
> > > > >> Konstantin, generally, in that this seems to not be a core goal of
> > > Flink
> > > > >> as
> > > > >> a project and adds a maintenance burden.
> > > > >>
> > > > >> Would another con of Kerberos be that is likely a fading project
> in
> > > > terms
> > > > >> of network security? (serious question, please correct me if there
> > is
> > > > >> reason to believe it is gaining adoption)
> > > > >>
> > > > >> The point about Kerberos being independent of infrastructure is a
> > good
> > > > one
> > > > >> but is something that is also solved by modern sidecar proxies +
> > > service
> > > > >> meshes that can run across Kubernetes and bare-metal. These
> > solutions
> > > > also
> > > > >> handle certificate provisioning, rotation, etc. in addition to
> > > > >> higher-level
> > > > >> authorization policies. Some examples of projects with this
> > "universal
> > > > >> infrastructure support" are Kuma[1] (CNCF Sandbox, I'm a
> maintainer)
> > > and
> > > > >> Istio[2] (Google).
> > > > >>
> > > > >> Wondering out loud: has anyone tried to run Flink on top of
> > cilium[3],
> > > > >> which also provides zero-trust networking at the kernel level
> > without
> > > > >> needing to instrument applications? This currently only runs on
> > > > Kubernetes
> > > > >> on Linux, so that's a major limitation, but solves many of the
> > request
> > > > >> forging concerns at all levels.
> > > > >>
> > > > >> Thanks,
> > > > >> Austin
> > > > >>
> > > > >> [1]: https://kuma.io/docs/1.1.6/quickstart/universal/
> > > > >> [2]: https://istio.io/latest/docs/setup/install/virtual-machine/
> > > > >> [3]: https://cilium.io/
> > > > >>
> > > > >> On Thu, Jun 17, 2021 at 1:50 PM Till Rohrmann <
> trohrm...@apache.org
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > I left some comments in the Googl

Re: [DISCUSS] Releasing Flink 1.11.4, 1.12.5, 1.13.2

2021-06-22 Thread Till Rohrmann
Thanks a lot, Yun, Godfrey, and Jingsong for being our release managers!
Creating these bug-fix releases will be super helpful for our users.

Cheers,
Till

On Tue, Jun 22, 2021 at 9:59 AM Piotr Nowojski  wrote:

> Thanks for volunteering.
>
> A quick update about FLINK-23011. It turned out to be only an issue in the
> master, so we don't need to block bug fix releases on this issue.
>
> Best,
> Piotrek
>
> wt., 22 cze 2021 o 05:20 Xintong Song  napisał(a):
>
> > Thanks Dawid for starting the discussion, and thanks Yun, Godfrey and
> > Jingsong for volunteering as release managers.
> >
> >
> > +1 for the releases, and +1 for the release managers.
> >
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Tue, Jun 22, 2021 at 10:15 AM Jingsong Li 
> > wrote:
> >
> > > +1 to the release.
> > >
> > > Thanks Dawid for driving this discussion.
> > >
> > > I am willing to volunteer as the release manager too.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Tue, Jun 22, 2021 at 9:58 AM godfrey he 
> wrote:
> > >
> > > > Thanks for driving this, Dawid. +1 to the release.
> > > > I would also like to volunteer as the release manager for these
> > versions.
> > > >
> > > >
> > > > Best
> > > > Godfrey
> > > >
> > > > Seth Wiesman  于2021年6月22日周二 上午8:39写道:
> > > >
> > > > > +1 to the release.
> > > > >
> > > > > It would be great if we could get FLINK-23073 into 1.13.2. There's
> > > > already
> > > > > an open PR and it unblocks upgrading the table API walkthrough in
> > > > > apache/flink-playgrounds to 1.13.
> > > > >
> > > > > Seth
> > > > >
> > > > > On Mon, Jun 21, 2021 at 6:28 AM Yun Tang  wrote:
> > > > >
> > > > > > Hi Dawid,
> > > > > >
> > > > > > Thanks for driving this discussion, I am willing to volunteer as
> > the
> > > > > > release manager for these versions.
> > > > > >
> > > > > >
> > > > > > Best
> > > > > > Yun Tang
> > > > > > 
> > > > > > From: Konstantin Knauf 
> > > > > > Sent: Friday, June 18, 2021 22:35
> > > > > > To: dev 
> > > > > > Subject: Re: [DISCUSS] Releasing Flink 1.11.4, 1.12.5, 1.13.2
> > > > > >
> > > > > > Hi Dawid,
> > > > > >
> > > > > > Thank you for starting the discussion. I'd like to add
> > > > > > https://issues.apache.org/jira/browse/FLINK-23025 to the list
> for
> > > > Flink
> > > > > > 1.13.2.
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Konstantin
> > > > > >
> > > > > > On Fri, Jun 18, 2021 at 3:26 PM Dawid Wysakowicz <
> > > > dwysakow...@apache.org
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi devs,
> > > > > > >
> > > > > > > Quite recently we pushed, in our opinion, quite an important
> > fix[1]
> > > > for
> > > > > > > unaligned checkpoints which disables UC for broadcast
> > partitioning.
> > > > > > > Without the fix there might be some broadcast state corruption.
> > > > > > > Therefore we think it would be beneficial to release it
> soonish.
> > > What
> > > > > do
> > > > > > > you think? Do you have other issues in mind you'd like to have
> > > > included
> > > > > > > in these versions.
> > > > > > >
> > > > > > > Would someone be willing to volunteer to help with the releases
> > as
> > > a
> > > > > > > release manager? I guess there is a couple of spots to fill in
> > here
> > > > ;)
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Dawid
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-22815
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Konstantin Knauf
> > > > > >
> > > > > > https://twitter.com/snntrable
> > > > > >
> > > > > > https://github.com/knaufk
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
>


Re: [DISCUSS] Do not merge PRs with "unrelated" test failures.

2021-06-22 Thread Till Rohrmann
Thanks for bringing this topic to our attention Xintong. I think your
proposal makes a lot of sense and we should follow it. It will give us
confidence that our changes are working and it might be a good incentive to
quickly fix build instabilities. Hence, +1.

Cheers,
Till

On Tue, Jun 22, 2021 at 11:12 AM Xintong Song  wrote:

> Hi everyone,
>
> In the past a couple of weeks, I've observed several times that PRs are
> merged without a green light from the CI tests, where failure cases are
> considered *unrelated*. This may not always cause problems, but would
> increase the chance of breaking our code base. In fact, it has occurred to
> me twice in the past few weeks that I had to revert a commit which breaks
> the master branch due to this.
>
> I think it would be nicer to enforce a stricter rule, that no PRs should be
> merged without passing CI.
>
> The problems of merging PRs with "unrelated" test failures are:
> - It's not always straightforward to tell whether a test failures are
> related or not.
> - It prevents subsequent test cases from being executed, which may fail
> relating to the PR changes.
>
> To make things easier for the committers, the following exceptions might be
> considered acceptable.
> - The PR has passed CI in the contributor's personal workspace. Please post
> the link in such cases.
> - The CI tests have been triggered multiple times, on the same commit, and
> each stage has at least passed for once. Please also comment in such cases.
>
> If we all agree on this, I'd update the community guidelines for merging
> PRs wrt. this proposal. [1]
>
> Please let me know what do you think.
>
> Thank you~
>
> Xintong Song
>
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Merging+Pull+Requests
>


Re: [DISCUSS] Drop Mesos in 1.14

2021-06-22 Thread Till Rohrmann
I'd be ok with dropping support for Mesos if it helps us to clear our
dependencies in the flink-runtime module. If we do it, then we should
probably update our documentation with a pointer to the latest Flink
version that supports Mesos in case of users strictly need Mesos.

Cheers,
Till

On Tue, Jun 22, 2021 at 10:29 AM Chesnay Schepler 
wrote:

> Last week I spent some time looking into making flink-runtime scala
> free, which effectively means to move the Akka-reliant classes to
> another module, and load that module along with Akka and all of it's
> dependencies (including Scala) through a separate classloader.
>
> This would finally decouple the Scala versions required by the runtime
> and API, and would allow us to upgrade Akka as we'd no longer be limited
> to Scala 2.11. It would rid the classpath of a few dependencies, and
> remove the need for scala suffixes on quite a few modules.
>
> However, our Mesos support has unfortunately a hard dependency on Akka,
> which naturally does not play well with the goal of isolating Akka in
> it's own ClassLoader.
>
> To solve this issue I was thinking of simple dropping flink-mesos in
> 1.14 (it was deprecated in 1.13).
>
> Truth be told, I picked this option because it is the easiest to do. We
> _could_ probably make things work somehow (likely by shipping a second
> Akka version just for flink-mesos), but it doesn't seem worth the hassle
> and would void some of the benefits. So far we kept flink-mesos around,
> despite not really developing it further, because it didn't hurt to have
> it in still in Flink, but this has now changed.
>
> Please tell me what you think.
>
>


Re: [DISCUSS] FLIP-171: Async Sink

2021-06-22 Thread Till Rohrmann
Adding the InterruptException to the write method would make it explicit
that the write call can block but must react to interruptions (e.g. when
Flink wants to cancel the operation). I think this makes the contract a bit
clearer.

I think starting simple and then extending the API as we see the need is a
good idea.

Cheers,
Till

On Tue, Jun 22, 2021 at 11:20 AM Hausmann, Steffen 
wrote:

> Hey,
>
> Agreed on starting with a blocking `write`. I've adapted the FLIP
> accordingly.
>
> For now I've chosen to add the `InterruptedException` to the `write`
> method signature as I'm not fully understanding the implications of
> swallowing the exception. Depending on the details of  the code that is
> calling the write method, it may cause event loss. But this seems more of
> an implementation detail, that we can revisit once we are actually
> implementing the sink.
>
> Unless there are additional comments, does it make sense to start the
> voting process in the next day or two?
>
> Cheers, Steffen
>
>
> On 21.06.21, 14:51, "Piotr Nowojski"  wrote:
>
> CAUTION: This email originated from outside of the organization. Do
> not click links or open attachments unless you can confirm the sender and
> know the content is safe.
>
>
>
> Hi,
>
> Thanks Steffen for the explanations. I think it makes sense to me.
>
> Re Arvid/Steffen:
>
> - Keep in mind that even if we choose to provide a non blocking API
> using
> the `isAvailable()`/`getAvailableFuture()` method, we would still need
> to
> support blocking inside the sinks. For example at the very least,
> emitting
> many records at once (`flatMap`) or firing timers are scenarios when
> output
> availability would be ignored at the moment by the runtime. Also I
> would
> imagine writing very large (like 1GB) records would be blocking on
> something as well.
> - Secondly, exposing availability to the API level might not be that
> easy/trivial. The availability pattern as defined in
> `AvailabilityProvider`
> class is quite complicated and not that easy to implement by a user.
>
> Both of those combined with lack of a clear motivation for adding
> `AvailabilityProvider` to the sinks/operators/functions,  I would vote
> on
> just starting with blocking `write` calls. This can always be extended
> in
> the future with availability if needed/motivated properly.
>
> That would be aligned with either Arvid's option 1 or 2. I don't know
> what
> are the best practices with `InterruptedException`, but I'm always
> afraid
> of it, so I would feel personally safer with option 2.
>
> I'm not sure what problem option 3 is helping to solve? Adding
> `wakeUp()`
> would sound strange to me.
>
> Best,
> Piotrek
>
> pon., 21 cze 2021 o 12:15 Arvid Heise  napisał(a):
>
> > Hi Piotr,
> >
> > to pick up this discussion thread again:
> > - This FLIP is about providing some base implementation for FLIP-143
> sinks
> > that make adding new implementations easier, similar to the
> > SourceReaderBase.
> > - The whole availability topic will most likely be a separate FLIP.
> The
> > basic issue just popped up here because we currently have no way to
> signal
> > backpressure in sinks except by blocking `write`. This feels quite
> natural
> > in sinks with sync communication but quite unnatural in async sinks.
> >
> > Now we have a couple of options. In all cases, we would have some WIP
> > limit on the number of records/requests being able to be processed in
> > parallel asynchronously (similar to asyncIO).
> > 1. We use some blocking queue in `write`, then we need to handle
> > interruptions. In the easiest case, we extend `write` to throw the
> > `InterruptedException`, which is a small API change.
> > 2. We use a blocking queue, but handle interrupts and
> swallow/translate
> > them. No API change.
> > Both solutions block the task thread, so any RPC message / unaligned
> > checkpoint would be processed only after the backpressure is
> temporarily
> > lifted. That's similar to the discussions that you linked.
> Cancellation may
> > also be a tad harder on 2.
> > 3. We could also add some `wakeUp` to the `SinkWriter` similar to
> > `SplitFetcher` [1]. Basically, you use a normal queue with a
> completeable
> > future on which you block. Wakeup would be a clean way to complete
> it next
> > to the natural completion through finished requests.
> > 4. We add availability to the sink. However, this API change also
> requires
> > that we allow operators to be available so it may be a bigger change
> with
> > undesired side-effects. On the other hand, we could also use the same
> > mechanism for asyncIO.
> >
> > For users of FLIP-171, none of the options are exposed. So we could
> also
> > start with a simple solution (add `InterruptedException`) and later
> try to

Re: Re: [DISCUSS] FLIP-169: DataStream API for Fine-Grained Resource Requirements

2021-06-21 Thread Till Rohrmann
I would be more in favor of what Zhu Zhu proposed to throw an exception
with a meaningful and understandable explanation that also includes how to
resolve this problem. I do understand the reasoning behind automatically
switching the edge types in order to make things easier to use but a) this
can also be confusing if the user does not expect this to happen and b) it
can add some complexity which makes other feature development harder in the
future because users might rely on it. An example of such a case I stumbled
upon rather recently is that we adjust the maximum parallelism wrt the
given savepoint if it has not been explicitly configured. On the paper this
sounds like a good usability improvement, however, for the
AdaptiveScheduler it posed a quite annoying complexity. If instead, we said
that we fail the job submission if the max parallelism does not equal the
max parallelism of the savepoint, it would have been a lot easier.

Cheers,
Till

On Mon, Jun 21, 2021 at 9:36 AM Yangze Guo  wrote:

> Thanks, I append it to the known limitations of this FLIP.
>
> Best,
> Yangze Guo
>
> On Mon, Jun 21, 2021 at 3:20 PM Zhu Zhu  wrote:
> >
> > Thanks for the quick response Yangze.
> > The proposal sounds good to me.
> >
> > Thanks,
> > Zhu
> >
> > Yangze Guo  于2021年6月21日周一 下午3:01写道:
> >>
> >> Thanks for the comments, Zhu!
> >>
> >> Yes, it is a known limitation for fine-grained resource management. We
> >> also have filed this issue in FLINK-20865 when we proposed FLIP-156.
> >>
> >> As a first step, I agree that we can mark batch jobs with PIPELINED
> >> edges as an invalid case for this feature. However, just throwing an
> >> exception, in that case, might confuse users who do not understand the
> >> concept of pipeline region. Maybe we can force all the edges in this
> >> scenario to BLOCKING in compiling stage and well document it. So that,
> >> common users will not be interrupted while the expert users can
> >> understand the cost of that usage and make their decision. WDYT?
> >>
> >> Best,
> >> Yangze Guo
> >>
> >> On Mon, Jun 21, 2021 at 2:24 PM Zhu Zhu  wrote:
> >> >
> >> > Thanks for proposing this @Yangze Guo and sorry for joining the
> discussion so late.
> >> > The proposal generally looks good to me. But I find one problem that
> batch job with PIPELINED edges might hang if enabling fine-grained
> resources. see "Resource Deadlocks could still happen in certain Cases"
> section in
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-119+Pipelined+Region+Scheduling
> >> > However, this problem may happen only in batch cases with PIPELINED
> edges, because
> >> > 1. streaming jobs would always require all resource requirements to
> be fulfilled at the same time.
> >> > 2. batch jobs without PIPELINED edges consist of multiple single
> vertex regions and thus each slot can be individually used and returned
> >> > So maybe in the first step, let's mark batch jobs with PIPELINED
> edges as an invalid case for fine-grained resources and throw exception for
> it in early compiling stage?
> >> >
> >> > Thanks,
> >> > Zhu
> >> >
> >> > Yangze Guo  于2021年6月15日周二 下午4:57写道:
> >> >>
> >> >> Thanks for the supplement, Arvid and Yun. I've annotated these two
> >> >> points in the FLIP.
> >> >> The vote is now started in [1].
> >> >>
> >> >> [1]
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/VOTE-FLIP-169-DataStream-API-for-Fine-Grained-Resource-Requirements-td51381.html
> >> >>
> >> >> Best,
> >> >> Yangze Guo
> >> >>
> >> >> On Fri, Jun 11, 2021 at 2:50 PM Yun Gao 
> wrote:
> >> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Very thanks @Yangze for bringing up this discuss. Overall +1 for
> >> >> > exposing the fine-grained resource requirements in the DataStream
> API.
> >> >> >
> >> >> > One similar issue as Arvid has pointed out is that users may also
> creating
> >> >> > different SlotSharingGroup objects, with different names but with
> different
> >> >> > resources.  We might need to do some check internally. But We
> could also
> >> >> > leave that during the development of the actual PR.
> >> >> >
> >> >> > Best,
> >> >> > Yun
> >> >> >
> >> >> >
> >> >> >
> >> >> >  --Original Mail --
> >> >> > Sender:Arvid Heise 
> >> >> > Send Date:Thu Jun 10 15:33:37 2021
> >> >> > Recipients:dev 
> >> >> > Subject:Re: [DISCUSS] FLIP-169: DataStream API for Fine-Grained
> Resource Requirements
> >> >> > Hi Yangze,
> >> >> >
> >> >> >
> >> >> >
> >> >> > Thanks for incorporating the ideas and sorry for missing the
> builder part.
> >> >> >
> >> >> > My main idea is that SlotSharingGroup is immutable, such that the
> user
> >> >> >
> >> >> > doesn't do:
> >> >> >
> >> >> >
> >> >> >
> >> >> > ssg = new SlotSharingGroup();
> >> >> >
> >> >> > ssg.setCpus(2);
> >> >> >
> >> >> > operator1.slotSharingGroup(ssg);
> >> >> >
> >> >> > ssg.setCpus(4);
> >> >> >
> >> >> > operator2.slotSharingGroup(ssg);
> >> >> >
> >> >> >
> >> >> >
> >> >> > and wonders why both operators have th

Re: [DISCUSS] Releasing Flink 1.11.4, 1.12.5, 1.13.2

2021-06-21 Thread Till Rohrmann
Thanks for starting this discussion Dawid. We have collected a couple of
fixes for the different releases:

#fixes:
1.11.4: 72
1.12.5: 35
1.13.2: 49

which in my opinion warrants new bugfix releases. Note that we intend to do
another 1.11 release because of the seriousness of the FLINK-22815 [1]
which can lead to silent data loss.

I think that FLINK-23025 [2] might be nice to include but I wouldn't block
the release on it.

@pnowojski  do you have an ETA for FLINK-23011 [3]? I
do agree that this would be nice to fix but on the other hand, this issue
is in Flink since the introduction of FLIP-27. Moreover, it does not affect
Flink 1.11.x.

[1] https://issues.apache.org/jira/browse/FLINK-22815
[2] https://issues.apache.org/jira/browse/FLINK-23025
[3] https://issues.apache.org/jira/browse/FLINK-23011

Cheers,
Till

On Fri, Jun 18, 2021 at 5:15 PM Piotr Nowojski  wrote:

> Hi,
>
> Thanks for bringing this up. I think before releasing 1.12.x/1.13.x/1.14.x,
> it would be good to decide what to do with FLINK-23011 [1] and if there is
> a relatively easy fix, I would wait for it before releasing.
>
> Best,
> Piotrek
>
> [1] with https://issues.apache.org/jira/browse/FLINK-23011
>
> pt., 18 cze 2021 o 16:35 Konstantin Knauf  napisał(a):
>
> > Hi Dawid,
> >
> > Thank you for starting the discussion. I'd like to add
> > https://issues.apache.org/jira/browse/FLINK-23025 to the list for Flink
> > 1.13.2.
> >
> > Cheers,
> >
> > Konstantin
> >
> > On Fri, Jun 18, 2021 at 3:26 PM Dawid Wysakowicz  >
> > wrote:
> >
> > > Hi devs,
> > >
> > > Quite recently we pushed, in our opinion, quite an important fix[1] for
> > > unaligned checkpoints which disables UC for broadcast partitioning.
> > > Without the fix there might be some broadcast state corruption.
> > > Therefore we think it would be beneficial to release it soonish. What
> do
> > > you think? Do you have other issues in mind you'd like to have included
> > > in these versions.
> > >
> > > Would someone be willing to volunteer to help with the releases as a
> > > release manager? I guess there is a couple of spots to fill in here ;)
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-22815
> > >
> > >
> > >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


Re: Change in accumutors semantics with jobClient

2021-06-21 Thread Till Rohrmann
Thanks for bringing this to the dev ML Etienne. Could you maybe update the
release notes for Flink 1.13 [1] to include this change? That way it might
be a bit more prominent. I think the change needs to go into the
release-1.13 and master branch.

[1]
https://github.com/apache/flink/blob/master/docs/content/release-notes/flink-1.13.md

Cheers,
Till


On Fri, Jun 18, 2021 at 2:45 PM Etienne Chauchot 
wrote:

> Hi all,
>
> I did a fix some time ago regarding accumulators:
> the/JobClient.getAccumulators()/ was infinitely  blocking in local
> environment for a streaming job (1). The change (2) consisted of giving
> the current accumulators value for the running job. And when fixing this
> in the PR, it appeared that I had to change the accumulators semantics
> with /JobClient/ and I just realized that I forgot to bring this back to
> the ML:
>
> Previously /JobClient/ assumed that getAccumulator() was called on a
> bounded pipeline and that the user wanted to acquire the *final
> accumulator values* after the job is finished.
>
> But now it returns the *current value of accumulators* immediately to be
> compatible with unbounded pipelines.
>
> If it is run on a bounded pipeline, then to get the final accumulator
> values after the job is finished, one needs to call
>
> /getJobExecutionResult().thenApply(JobExecutionResult::getAllAccumulatorResults)/
>
> (1): https://issues.apache.org/jira/browse/FLINK-18685
>
> (2): https://github.com/apache/flink/pull/14558#
>
>
> Cheers,
>
> Etienne
>
>


Re: [DISCUSS] Dashboard/HistoryServer authentication

2021-06-17 Thread Till Rohrmann
I left some comments in the Google document. It would be great if
someone from the community with security experience could also take a look
at it. Maybe Eron you have an opinion on the topic.

Cheers,
Till

On Thu, Jun 17, 2021 at 6:57 PM Till Rohrmann  wrote:

> Hi Gabor,
>
> I haven't found time to look into the updated FLIP yet. I'll try to do it
> asap.
>
> Cheers,
> Till
>
> On Wed, Jun 16, 2021 at 9:35 PM Konstantin Knauf 
> wrote:
>
>> Hi Gabor,
>>
>> > However representing Kerberos as completely new feature is not true
>> because
>> it's already in since Flink makes authentication at least with HDFS and
>> Hbase through Kerberos.
>>
>> True, that is one way to look at it, but there are differences, too:
>> Control Plane vs Data Plane, Core vs Connectors.
>>
>> > Adding OIDC or OAuth2 has the exact same concerns what you've guys just
>> raised. Why exactly these? If you think this would be beneficial we can
>> discuss it in detail
>>
>> That's exactly my point. Once we start adding authx support, we will
>> sooner or later discuss other options besides Kerberos, too. A user who
>> would like to use OAuth can not easily use Kerberos, right?
>> That is one of the reasons I am skeptical about adding initial authx
>> support.
>>
>> > Related authorization you've mentioned it can be complicated over time.
>> Can
>> you show us an example? We've knowledge with couple of open source
>> components
>> but authorization was never a horror complex story. I personally have the
>> most experience with Spark which I think is quite simple and stable. Users
>> can be viewers/admins
>> and jobs started by others can't be modified. If you can share an example
>> over-complication we can discuss on facts.
>>
>> Authorization is a new aspect that needs to be considered for every
>> addition to the REST API. In the future users might ask for additional
>> roles (e.g. an editor), user-defined roles and you've already mentioned
>> job-level permissions yourself. And keep in mind that there might also be
>> larger additions in the future like the flink-sql-gateway. Contributions
>> like this become more expensive the more aspects we need to consider.
>>
>> In general, I believe, it is important that the community focuses its
>> efforts where we can generate the most value to the user and - personally -
>> I don't think there is much to gain by extending Flink's scope in that
>> direction. Of course, this is not black and white and there are other valid
>> opinions.
>>
>> Thanks,
>>
>> Konstantin
>>
>> On Wed, Jun 16, 2021 at 7:38 PM Gabor Somogyi 
>> wrote:
>>
>>> Hi Konstantin,
>>>
>>> Thanks for the response. Related new feature introduction in case of
>>> Basic
>>> auth I tend to agree, anything else can be chosen.
>>>
>>> However representing Kerberos as completely new feature is not true
>>> because
>>> it's already in since Flink makes authentication at least with HDFS and
>>> Hbase through Kerberos.
>>> The main problem with the actual Kerberos implementation is that it
>>> contains several bugs and only partially implemented. Following your
>>> suggestion can we agree that we
>>> skip the Basic auth implementation and finish an already started Kerberos
>>> story by adding History Server and Job Dashboard authentication?
>>>
>>> Adding OIDC or OAuth2 has the exact same concerns what you've guys just
>>> raised. Why exactly these? If you think this would be beneficial we can
>>> discuss it in detail
>>> but as a side story it would be good to finish a halfway done Kerberos
>>> story.
>>>
>>> Related authorization you've mentioned it can be complicated over time.
>>> Can
>>> you show us an example? We've knowledge with couple of open source
>>> components
>>> but authorization was never a horror complex story. I personally have the
>>> most experience with Spark which I think is quite simple and stable.
>>> Users
>>> can be viewers/admins
>>> and jobs started by others can't be modified. If you can share an example
>>> over-complication we can discuss on facts.
>>>
>>> Thank you in advance!
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Wed, Jun 16, 2021 at 5:42 PM Konstantin Knauf 
>>> wrote:
>>>
>>> > Hi everyone,
>>> >

<    1   2   3   4   5   6   7   8   9   10   >