Re: [DISCUSS] FLIP-194: Introduce the JobResultStore

2021-11-30 Thread Mika Naylor

Hi Till,

We thought that breaking interfaces, specifically
HighAvailabilityServices and RunningJobsRegistry, was acceptable in this
instance because:

- Neither of these interfaces are marked @Public and so carry no
  guarantees about being public and stable.
- As far as we are aware, we currently have no users with custom
  HighAvailabilityServices implementations.
- The interface was already broken in 1.14 with the changes to
  CheckpointRecoveryFactory, and will likely be changed again in 1.15
  due to further changes in that factory.

Given that, we thought changes to the interface would not be disruptive.
Perhaps it could be annotated as @Internal - I'm not sure exactly what
guarantees we try and give for the stability of the
HighAvailabilityServices interface.

Kind regards,
Mika

On 26.11.2021 18:28, Till Rohrmann wrote:

Thanks for creating this FLIP Matthias, Mika and David.

I think the JobResultStore is an important piece for fixing Flink's last
high-availability problem (afaik). Once we have this piece in place, users
no longer risk to re-execute a successfully completed job.

I have one comment concerning breaking interfaces:

If we don't want to break interfaces, then we could keep the
HighAvailabilityServices.getRunningJobsRegistry() method and add a default
implementation for HighAvailabilityServices.getJobResultStore(). We could
then deprecate the former method and then remove it in the subsequent
release (1.16).

Apart from that, +1 for the FLIP.

Cheers,
Till

On Wed, Nov 17, 2021 at 6:05 PM David Morávek  wrote:


Hi everyone,

Matthias, Mika and I want to start a discussion about introduction of a new
Flink component, the *JobResultStore*.

The main motivation is to address shortcomings of the *RunningJobsRegistry*
and surpass it with the new component. These shortcomings have been first
described in FLINK-11813 [1].

This change should improve the overall stability of the JobManager's
components and address the race conditions in some of the fail over
scenarios during the job cleanup lifecycle.

It should also help to ensure that Flink doesn't leave any uncleaned
resources behind.

We've prepared a FLIP-194 [2], which outlines the design and reasoning
behind this new component.

[1] https://issues.apache.org/jira/browse/FLINK-11813
[2]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435

We're looking forward for your feedback ;)

Best,
Matthias, Mika and David



Mika Naylor
https://autophagy.io


Re: [ANNOUNCE] New Apache Flink Committer - Matthias Pohl

2021-12-02 Thread Mika Naylor

Congratulations Matthias!

Best,
Mika

On 02.12.2021 17:27, David Morávek wrote:

Congrats Matthias, well deserved ;)

Best,
D.

On Thu, Dec 2, 2021 at 5:17 PM Dawid Wysakowicz 
wrote:


Congratulations Matthias! Really well deserved!

Best,

Dawid

On 02/12/2021 16:53, Nicolaus Weidner wrote:
> Congrats Matthias, well deserved!
>
> Best,
> Nico
>
> On Thu, Dec 2, 2021 at 4:48 PM Fabian Paul  wrote:
>
>> Congrats and well deserved.
>>
>> Best,
>> Fabian
>>
>> On Thu, Dec 2, 2021 at 4:42 PM Ingo Bürk  wrote:
>>> Congrats, Matthias!
>>>
>>> On Thu, Dec 2, 2021 at 4:28 PM Till Rohrmann 
>> wrote:
 Hi everyone,

 On behalf of the PMC, I'm very happy to announce Matthias Pohl as a
new
 Flink committer.

 Matthias has worked on Flink since August last year. He helped review
>> a ton
 of PRs. He worked on a variety of things but most notably the tracking
>> and
 reporting of concurrent exceptions, fixing HA bugs and deprecating and
 removing our Mesos support. He actively reports issues helping Flink
to
 improve and he is actively engaged in Flink's MLs.

 Please join me in congratulating Matthias for becoming a Flink
>> committer!
 Cheers,
 Till





Re: [ANNOUNCE] New Apache Flink Committer - Ingo Bürk

2021-12-02 Thread Mika Naylor

Congratulations, Ingo!

Best,
Mika

On 02.12.2021 17:37, Matthias Pohl wrote:

Congratulations, Ingo! Good job! :)

On Thu, Dec 2, 2021 at 5:28 PM David Morávek  wrote:


Congrats Ingo, well deserved ;)

Best,
D.

On Thu, Dec 2, 2021 at 5:17 PM Dawid Wysakowicz 
wrote:

> Congratulations Ingo! Happy to have you onboard as a committer!
>
> Best,
>
> Dawid
>
> On 02/12/2021 17:14, Francesco Guardiani wrote:
> > Congrats Ingo!
> >
> > On Thu, Dec 2, 2021 at 4:58 PM Nicolaus Weidner <
> > nicolaus.weid...@ververica.com> wrote:
> >
> >> Congrats Ingo!
> >> The PMC probably realized that it's simply too much work to review and
> >> merge all your PRs, so now you can/have to do part of that work
yourself
> >> ;-)
> >>
> >> Best,
> >> Nico
> >>
> >> On Thu, Dec 2, 2021 at 4:50 PM Fabian Paul  wrote:
> >>
> >>> Thanks for always pushing Ingo. Congratulations!
> >>>
> >>> Best,
> >>> Fabian
> >>>
> >>> On Thu, Dec 2, 2021 at 4:24 PM Till Rohrmann 
> >> wrote:
>  Hi everyone,
> 
>  On behalf of the PMC, I'm very happy to announce Ingo Bürk as a new
> >> Flink
>  committer.
> 
>  Ingo has started contributing to Flink since the beginning of this
> >> year.
> >>> He
>  worked mostly on SQL components. He has authored many PRs and helped
> >>> review
>  a lot of other PRs in this area. He actively reported issues and
> helped
> >>> our
>  users on the MLs. His most notable contributions were Support SQL
2016
> >>> JSON
>  functions in Flink SQL (FLIP-90), Register sources/sinks in Table
API
>  (FLIP-129) and various other contributions in the SQL area.
Moreover,
> >> he
> >>> is
>  one of the few people in our community who actually understands
> Flink's
>  frontend.
> 
>  Please join me in congratulating Ingo for becoming a Flink
committer!
> 
>  Cheers,
>  Till
>
>



[jira] [Created] (FLINK-23620) Introduce proper YAML parsing to Flink's configuration

2021-08-04 Thread Mika Naylor (Jira)
Mika Naylor created FLINK-23620:
---

 Summary: Introduce proper YAML parsing to Flink's configuration
 Key: FLINK-23620
 URL: https://issues.apache.org/jira/browse/FLINK-23620
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Reporter: Mika Naylor


At the moment, the YAML parsing for Flink's configuration file 
(`conf/flink-conf.yaml`) is pretty basic. It only supports basic key value 
pairs, such as:

```
a.b.c: a value
a.b.d: another value
```

As well as supporting some invalid YAML syntax, such as:

```
a: b: value
```

Introducing proper YAML parsing to the configuration component would let Flink 
users use features such as nested keys, such as:

```
a:
  b:
c: a value
d: another value
```

as well as make it easier to integrate configuration tools/languages that 
compile to YAML, such as Dhall.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-23948) Python Client Executable priority incorrect for PYFLINK_CLIENT_EXECUTABLE environment variable

2021-08-24 Thread Mika Naylor (Jira)
Mika Naylor created FLINK-23948:
---

 Summary: Python Client Executable priority incorrect for 
PYFLINK_CLIENT_EXECUTABLE environment variable
 Key: FLINK-23948
 URL: https://issues.apache.org/jira/browse/FLINK-23948
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Reporter: Mika Naylor
 Fix For: 1.14.0


The 
[documentation|https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/python/python_config/#python-client-executable]
 for configuring the python client executable states: 
{quote}The priority is as following:
1. the command line option "-pyclientexec";
2. the environment variable PYFLINK_CLIENT_EXECUTABLE;
3. the configuration 'python.client.executable' defined in flink-conf.yaml
{quote}
I set \{{python.client.executable}} to point to Python 3.6, and submitted a job 
that contained Python 3.8 syntax. Running the job normally results in a Syntax 
Error as expected, and the \{{pyclientexec}} and \{{pyClientExecutable}} CLI 
flags let me override this setting and point to Python 3.8. However, setting 
the \{{PYFLINK_CLIENT_EXECUTABLE}} *did not overwrite the 
\{{python.client.executable}} setting*.

{code:bash}
export PYFLINK_CLIENT_EXECUTABLE=/usr/bin/python3.8
./bin/flink run --python examples/python/table/batch/python38_test.py
{code}

Still used Python 3.6 as the Python client interpreter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (FLINK-26468) Test default binding to localhost

2022-03-03 Thread Mika Naylor (Jira)
Mika Naylor created FLINK-26468:
---

 Summary: Test default binding to localhost
 Key: FLINK-26468
 URL: https://issues.apache.org/jira/browse/FLINK-26468
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / Configuration
Affects Versions: 1.15.0
Reporter: Mika Naylor
 Fix For: 1.15.0


Change introduced in: https://issues.apache.org/jira/browse/FLINK-24474

For security reasons, we have bound the REST and RPC endpoints (for the 
JobManagers and TaskManagers) to the loopback address (localhost/127.0.0.1) to 
prevent clusters from being accidentally exposed to the outside world.

These were:
* jobmanager.bind-host
* taskmanager.bind-host
* rest.bind-address

Some suggestions to test:
* Test that spinning up a Flink cluster with the default flink-conf.yaml works 
correctly locally with different set ups (1 TaskManager, several task managers, 
default parallelism, > 1 parallelism). Test that the JobManagers and 
TaskManagers can communicate, and that the REST endpoint is accessable locally. 
Test that the REST/RPC endpoints are not accessable outside of the local 
machine.
* Test that removing the the binding configuration for the above mentioned 
settings means that the cluster binds to 0.0.0.0 and is accessable to the 
outside world (this may involve also changing rest.address, 
jobmanager.rpc.address and taskmanager.rpc.address)
* Test that default Flink setups with docker behave correctly.
* Test that default Flink setups behave correctly with other resource providers 
(kubernetes native, etc).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-26772) Kubernetes Native in HA Application Mode does not retry resource cleanup

2022-03-21 Thread Mika Naylor (Jira)
Mika Naylor created FLINK-26772:
---

 Summary: Kubernetes Native in HA Application Mode does not retry 
resource cleanup
 Key: FLINK-26772
 URL: https://issues.apache.org/jira/browse/FLINK-26772
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Mika Naylor


I set up a scenario in which a k8s native cluster running in Application Mode 
used an s3 bucket for it's high availability storage directory, with the hadoop 
plugin. The credentials the cluster used gives it permission to write to the 
bucket, but not delete, so cleaning up the blob/jobgraph will fail.

I expected that when trying to clean up the HA resources, it would attempt to 
retry the cleanup. I even configured this explicitly:

{{cleanup-strategy: fixed-delay
cleanup-strategy.fixed-delay.attempts: 100
cleanup-strategy.fixed-delay.delay: 10 s}}

However, the behaviour I observed is that the blob and jobgraph cleanup is only 
attempted once. After this failure, I observe in the logs that:

{{2022-03-21 09:34:40,634 INFO  
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap 
[] - Application completed SUCCESSFULLY
2022-03-21 09:34:40,635 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting 
KubernetesApplicationClusterEntrypoint down with application status SUCCEEDED. 
Diagnostics null.}}

After which, the cluster recieves a SIGTERM an exits.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)