Re: [DISCUSS] FLIP-194: Introduce the JobResultStore
Hi Till, We thought that breaking interfaces, specifically HighAvailabilityServices and RunningJobsRegistry, was acceptable in this instance because: - Neither of these interfaces are marked @Public and so carry no guarantees about being public and stable. - As far as we are aware, we currently have no users with custom HighAvailabilityServices implementations. - The interface was already broken in 1.14 with the changes to CheckpointRecoveryFactory, and will likely be changed again in 1.15 due to further changes in that factory. Given that, we thought changes to the interface would not be disruptive. Perhaps it could be annotated as @Internal - I'm not sure exactly what guarantees we try and give for the stability of the HighAvailabilityServices interface. Kind regards, Mika On 26.11.2021 18:28, Till Rohrmann wrote: Thanks for creating this FLIP Matthias, Mika and David. I think the JobResultStore is an important piece for fixing Flink's last high-availability problem (afaik). Once we have this piece in place, users no longer risk to re-execute a successfully completed job. I have one comment concerning breaking interfaces: If we don't want to break interfaces, then we could keep the HighAvailabilityServices.getRunningJobsRegistry() method and add a default implementation for HighAvailabilityServices.getJobResultStore(). We could then deprecate the former method and then remove it in the subsequent release (1.16). Apart from that, +1 for the FLIP. Cheers, Till On Wed, Nov 17, 2021 at 6:05 PM David Morávek wrote: Hi everyone, Matthias, Mika and I want to start a discussion about introduction of a new Flink component, the *JobResultStore*. The main motivation is to address shortcomings of the *RunningJobsRegistry* and surpass it with the new component. These shortcomings have been first described in FLINK-11813 [1]. This change should improve the overall stability of the JobManager's components and address the race conditions in some of the fail over scenarios during the job cleanup lifecycle. It should also help to ensure that Flink doesn't leave any uncleaned resources behind. We've prepared a FLIP-194 [2], which outlines the design and reasoning behind this new component. [1] https://issues.apache.org/jira/browse/FLINK-11813 [2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435 We're looking forward for your feedback ;) Best, Matthias, Mika and David Mika Naylor https://autophagy.io
Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl
Congratulations Matthias! Best, Mika On 02.12.2021 17:27, David Morávek wrote: Congrats Matthias, well deserved ;) Best, D. On Thu, Dec 2, 2021 at 5:17 PM Dawid Wysakowicz wrote: Congratulations Matthias! Really well deserved! Best, Dawid On 02/12/2021 16:53, Nicolaus Weidner wrote: > Congrats Matthias, well deserved! > > Best, > Nico > > On Thu, Dec 2, 2021 at 4:48 PM Fabian Paul wrote: > >> Congrats and well deserved. >> >> Best, >> Fabian >> >> On Thu, Dec 2, 2021 at 4:42 PM Ingo Bürk wrote: >>> Congrats, Matthias! >>> >>> On Thu, Dec 2, 2021 at 4:28 PM Till Rohrmann >> wrote: Hi everyone, On behalf of the PMC, I'm very happy to announce Matthias Pohl as a new Flink committer. Matthias has worked on Flink since August last year. He helped review >> a ton of PRs. He worked on a variety of things but most notably the tracking >> and reporting of concurrent exceptions, fixing HA bugs and deprecating and removing our Mesos support. He actively reports issues helping Flink to improve and he is actively engaged in Flink's MLs. Please join me in congratulating Matthias for becoming a Flink >> committer! Cheers, Till
Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk
Congratulations, Ingo! Best, Mika On 02.12.2021 17:37, Matthias Pohl wrote: Congratulations, Ingo! Good job! :) On Thu, Dec 2, 2021 at 5:28 PM David Morávek wrote: Congrats Ingo, well deserved ;) Best, D. On Thu, Dec 2, 2021 at 5:17 PM Dawid Wysakowicz wrote: > Congratulations Ingo! Happy to have you onboard as a committer! > > Best, > > Dawid > > On 02/12/2021 17:14, Francesco Guardiani wrote: > > Congrats Ingo! > > > > On Thu, Dec 2, 2021 at 4:58 PM Nicolaus Weidner < > > nicolaus.weid...@ververica.com> wrote: > > > >> Congrats Ingo! > >> The PMC probably realized that it's simply too much work to review and > >> merge all your PRs, so now you can/have to do part of that work yourself > >> ;-) > >> > >> Best, > >> Nico > >> > >> On Thu, Dec 2, 2021 at 4:50 PM Fabian Paul wrote: > >> > >>> Thanks for always pushing Ingo. Congratulations! > >>> > >>> Best, > >>> Fabian > >>> > >>> On Thu, Dec 2, 2021 at 4:24 PM Till Rohrmann > >> wrote: > Hi everyone, > > On behalf of the PMC, I'm very happy to announce Ingo Bürk as a new > >> Flink > committer. > > Ingo has started contributing to Flink since the beginning of this > >> year. > >>> He > worked mostly on SQL components. He has authored many PRs and helped > >>> review > a lot of other PRs in this area. He actively reported issues and > helped > >>> our > users on the MLs. His most notable contributions were Support SQL 2016 > >>> JSON > functions in Flink SQL (FLIP-90), Register sources/sinks in Table API > (FLIP-129) and various other contributions in the SQL area. Moreover, > >> he > >>> is > one of the few people in our community who actually understands > Flink's > frontend. > > Please join me in congratulating Ingo for becoming a Flink committer! > > Cheers, > Till > >
[jira] [Created] (FLINK-23620) Introduce proper YAML parsing to Flink's configuration
Mika Naylor created FLINK-23620: --- Summary: Introduce proper YAML parsing to Flink's configuration Key: FLINK-23620 URL: https://issues.apache.org/jira/browse/FLINK-23620 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Reporter: Mika Naylor At the moment, the YAML parsing for Flink's configuration file (`conf/flink-conf.yaml`) is pretty basic. It only supports basic key value pairs, such as: ``` a.b.c: a value a.b.d: another value ``` As well as supporting some invalid YAML syntax, such as: ``` a: b: value ``` Introducing proper YAML parsing to the configuration component would let Flink users use features such as nested keys, such as: ``` a: b: c: a value d: another value ``` as well as make it easier to integrate configuration tools/languages that compile to YAML, such as Dhall. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-23948) Python Client Executable priority incorrect for PYFLINK_CLIENT_EXECUTABLE environment variable
Mika Naylor created FLINK-23948: --- Summary: Python Client Executable priority incorrect for PYFLINK_CLIENT_EXECUTABLE environment variable Key: FLINK-23948 URL: https://issues.apache.org/jira/browse/FLINK-23948 Project: Flink Issue Type: Bug Components: API / Python Reporter: Mika Naylor Fix For: 1.14.0 The [documentation|https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/python/python_config/#python-client-executable] for configuring the python client executable states: {quote}The priority is as following: 1. the command line option "-pyclientexec"; 2. the environment variable PYFLINK_CLIENT_EXECUTABLE; 3. the configuration 'python.client.executable' defined in flink-conf.yaml {quote} I set \{{python.client.executable}} to point to Python 3.6, and submitted a job that contained Python 3.8 syntax. Running the job normally results in a Syntax Error as expected, and the \{{pyclientexec}} and \{{pyClientExecutable}} CLI flags let me override this setting and point to Python 3.8. However, setting the \{{PYFLINK_CLIENT_EXECUTABLE}} *did not overwrite the \{{python.client.executable}} setting*. {code:bash} export PYFLINK_CLIENT_EXECUTABLE=/usr/bin/python3.8 ./bin/flink run --python examples/python/table/batch/python38_test.py {code} Still used Python 3.6 as the Python client interpreter. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (FLINK-26468) Test default binding to localhost
Mika Naylor created FLINK-26468: --- Summary: Test default binding to localhost Key: FLINK-26468 URL: https://issues.apache.org/jira/browse/FLINK-26468 Project: Flink Issue Type: Improvement Components: Runtime / Configuration Affects Versions: 1.15.0 Reporter: Mika Naylor Fix For: 1.15.0 Change introduced in: https://issues.apache.org/jira/browse/FLINK-24474 For security reasons, we have bound the REST and RPC endpoints (for the JobManagers and TaskManagers) to the loopback address (localhost/127.0.0.1) to prevent clusters from being accidentally exposed to the outside world. These were: * jobmanager.bind-host * taskmanager.bind-host * rest.bind-address Some suggestions to test: * Test that spinning up a Flink cluster with the default flink-conf.yaml works correctly locally with different set ups (1 TaskManager, several task managers, default parallelism, > 1 parallelism). Test that the JobManagers and TaskManagers can communicate, and that the REST endpoint is accessable locally. Test that the REST/RPC endpoints are not accessable outside of the local machine. * Test that removing the the binding configuration for the above mentioned settings means that the cluster binds to 0.0.0.0 and is accessable to the outside world (this may involve also changing rest.address, jobmanager.rpc.address and taskmanager.rpc.address) * Test that default Flink setups with docker behave correctly. * Test that default Flink setups behave correctly with other resource providers (kubernetes native, etc). -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26772) Kubernetes Native in HA Application Mode does not retry resource cleanup
Mika Naylor created FLINK-26772: --- Summary: Kubernetes Native in HA Application Mode does not retry resource cleanup Key: FLINK-26772 URL: https://issues.apache.org/jira/browse/FLINK-26772 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.15.0 Reporter: Mika Naylor I set up a scenario in which a k8s native cluster running in Application Mode used an s3 bucket for it's high availability storage directory, with the hadoop plugin. The credentials the cluster used gives it permission to write to the bucket, but not delete, so cleaning up the blob/jobgraph will fail. I expected that when trying to clean up the HA resources, it would attempt to retry the cleanup. I even configured this explicitly: {{cleanup-strategy: fixed-delay cleanup-strategy.fixed-delay.attempts: 100 cleanup-strategy.fixed-delay.delay: 10 s}} However, the behaviour I observed is that the blob and jobgraph cleanup is only attempted once. After this failure, I observe in the logs that: {{2022-03-21 09:34:40,634 INFO org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap [] - Application completed SUCCESSFULLY 2022-03-21 09:34:40,635 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. Diagnostics null.}} After which, the cluster recieves a SIGTERM an exits. -- This message was sent by Atlassian Jira (v8.20.1#820001)