Re: [DISCUSS] Manual savepoint triggering in flink-kubernetes-operator

2024-03-20 Thread Yang Wang
Using a separate CR for managing the savepoints is really a good idea.
After then managing savepoints will be easier and we will not leak any
unusable savepoints on the object storage.


Best,
Yang

On Wed, Mar 13, 2024 at 4:40 AM Gyula Fóra  wrote:

> That would be great Mate! If you could draw up a FLIP for this that would
> be nice as this is a rather large change that will have a significant
> impact for existing users.
>
> If possible it would be good to provide some backward compatibility /
> transition period while we preserve the current content of the status so
> it's easy to migrate to the new savepoint CRs.
>
> Cheers,
> Gyula
>
> On Tue, Mar 12, 2024 at 9:22 PM Mate Czagany  wrote:
>
> > Hi,
> >
> > I really like this idea as well, I think it would be a great improvement
> > compared to how manual savepoints currently work, and suits Kubernetes
> > workflows a lot better.
> >
> > If there are no objections, I can investigate it during the next few
> weeks
> > and see how this could be implemented in the current code.
> >
> > Cheers,
> > Mate
> >
> > Gyula Fóra  ezt írta (időpont: 2024. márc. 12., K,
> > 16:01):
> >
> > > That's definitely a good improvement Robert and we should add it at
> some
> > > point. At the point in time when this was implemented we went with the
> > > current simpler / more lightweight approach.
> > > However if anyone is interested in working on this / contributing this
> > > improvement I would personally support it.
> > >
> > > Gyula
> > >
> > > On Tue, Mar 12, 2024 at 3:53 PM Robert Metzger 
> > > wrote:
> > >
> > > > Have you guys considered making savepoints a first class citizen in
> the
> > > > Kubernetes operator?
> > > > E.g. to trigger a savepoint, you create a "FlinkSavepoint" CR, the
> K8s
> > > > operator picks up that resource and tries to create a savepoint
> > > > indefinitely until the savepoint has been successfully created. We
> > report
> > > > the savepoint status and location in the "status" field.
> > > >
> > > > We could even add an (optional) finalizer to delete the physical
> > > savepoint
> > > > from the savepoint storage once the "FlinkSavepoint" CR has been
> > deleted.
> > > > optional: the savepoint spec could contain a field "retain
> > > > physical savepoint" or something, that controls the delete behavior.
> > > >
> > > >
> > > > On Thu, Mar 3, 2022 at 4:02 AM Yang Wang 
> > wrote:
> > > >
> > > > > I agree that we could start with the annotation approach and
> collect
> > > the
> > > > > feedback at the same time.
> > > > >
> > > > > Best,
> > > > > Yang
> > > > >
> > > > > Őrhidi Mátyás  于2022年3月2日周三 20:06写道:
> > > > >
> > > > > > Thank you for your feedback!
> > > > > >
> > > > > > The annotation on the
> > > > > >
> > > > > > @ControllerConfiguration(generationAwareEventProcessing = false)
> > > > > > FlinkDeploymentController
> > > > > >
> > > > > > already enables the event triggering based on metadata changes.
> It
> > > was
> > > > > set
> > > > > > earlier to support some failure scenarios. (It can be used for
> > > example
> > > > to
> > > > > > manually reenable the reconcile loop when it got stuck in an
> error
> > > > phase)
> > > > > >
> > > > > > I will go ahead and propose a PR using annotations then.
> > > > > >
> > > > > > Cheers,
> > > > > > Matyas
> > > > > >
> > > > > > On Wed, Mar 2, 2022 at 12:47 PM Yang Wang  >
> > > > wrote:
> > > > > >
> > > > > > > I also like the annotation approach since it is more natural.
> > > > > > > But I am not sure about whether the meta data change will
> trigger
> > > an
> > > > > > event
> > > > > > > in java-operator-sdk.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Yang
> > > > > > >
> > > > > > > Gyula Fóra  于2022年3月2日周三 16:29写道:
> 

Re: [VOTE] FLIP-402: Extend ZooKeeper Curator configurations

2024-03-20 Thread Yang Wang
+1 (binding) since ZK HA is still widely used.


Best,
Yang

On Thu, Mar 14, 2024 at 6:27 PM Matthias Pohl
 wrote:

> Nothing to add from my side. Thanks, Alex.
>
> +1 (binding)
>
> On Thu, Mar 7, 2024 at 4:09 PM Alex Nitavsky 
> wrote:
>
> > Hi everyone,
> >
> > I'd like to start a vote on FLIP-402 [1]. It introduces new configuration
> > options for Apache Flink's ZooKeeper integration for high availability by
> > reflecting existing Apache Curator configuration options. It has been
> > discussed in this thread [2].
> >
> > I would like to start a vote.  The vote will be open for at least 72
> hours
> > (until March 10th 18:00 GMT) unless there is an objection or
> > insufficient votes.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-402%3A+Extend+ZooKeeper+Curator+configurations
> > [2] https://lists.apache.org/thread/gqgs2jlq6bmg211gqtgdn8q5hp5v9l1z
> >
> > Thanks
> > Alex
> >
>


Re: Frequent Flink JM restarts due to Kube API server errors.

2024-02-05 Thread Yang Wang
This might be related with FLINK-28481, which is a bug in fabric8io k8s
client.

[1]. https://issues.apache.org/jira/browse/FLINK-28481

Best,
Yang

On Tue, Feb 6, 2024 at 12:30 PM Lavkesh Lahngir  wrote:

> Hi, Matthias, I was wondering if there are any timeout or heartbeat
> configurations for KubeHA available.
>
> Thanks.
>
> On Mon, 5 Feb 2024 at 8:58 PM, Matthias Pohl  .invalid>
> wrote:
>
> > That's stated in the Jira issue. I didn't have the time to investigate it
> > further.
> >
> > On Mon, Feb 5, 2024 at 1:55 PM Lavkesh Lahngir 
> wrote:
> >
> > > Hi Matthias,
> > > Thanks for the suggestion. Do we know which part of code caused this
> > issue
> > > and how it was fixed?
> > >
> > > Thanks!
> > >
> > > On Mon, 5 Feb 2024 at 18:06, Matthias Pohl  > > .invalid>
> > > wrote:
> > >
> > > > Hi Lavkesh,
> > > > FLINK-33998 [1] sounds quite similar to what you describe.
> > > >
> > > > The solution was to upgrade to Flink version 1.14.6. I didn't have
> the
> > > > capacity to look into the details considering that the mentioned
> Flink
> > > > version 1.14 is not officially supported by the community anymore
> and a
> > > fix
> > > > seems to have been provided with a newer version.
> > > >
> > > > Matthias
> > > >
> > > > [1] https://issues.apache.org/jira/browse/FLINK-33998
> > > >
> > > > On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir 
> > > wrote:
> > > >
> > > > > Hii, Few more details:
> > > > > We are running GKE version 1.27.7-gke.1121002.
> > > > > and using flink version 1.14.3.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir 
> > > wrote:
> > > > >
> > > > > > Hii All,
> > > > > >
> > > > > > We run a Flink operator on GKE, deploying one Flink job per job
> > > > manager.
> > > > > > We utilize
> > > > > >
> > > >
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > > > > > for high availability. The JobManager employs config maps for
> > > > > checkpointing
> > > > > > and leader election. If, at any point, the Kube API server
> returns
> > an
> > > > > error
> > > > > > (5xx or 4xx), the JM pod is restarted. This occurrence is
> sporadic,
> > > > > > happening every 1-2 days for some jobs among the 400 running in
> the
> > > > same
> > > > > > cluster, each with its JobManager pod.
> > > > > >
> > > > > > What might be causing these errors from the Kube? One possibility
> > is
> > > > that
> > > > > > when the JM writes the config map and attempts to retrieve it
> > > > immediately
> > > > > > after, it could result in a 404 error.
> > > > > > Are there any configurations to increase heartbeat or timeouts
> that
> > > > might
> > > > > > be causing temporary disconnections from the Kube API server?
> > > > > >
> > > > > > Thank you!
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-403: High Availability Services for OLAP Scenarios

2024-01-16 Thread Yang Wang
I am completely in favor of splitting the LeaderServices and
PersistenceServices
while sharing the same concern that MaterialProvider is not very easy to
understand.
It just feels like we do the separation but not thoroughly.

If you have a clear plan for the subsequent improvements, I am fine that we
only focus
on the OLAP requirements in FLIP-403.


Best,
Yang

On Wed, Jan 17, 2024 at 11:40 AM Yangze Guo  wrote:

> Thanks for the comments, Zhu.
>
> > Did you look into which part takes most of the time? Jar uploading, Jar
> downloading, JobInformation shipping, TDD shipping, or others?
>
> In our scenario, the key factor should be the JobInformation shipping,
> as the jobs are completed within 1 second. This can have a big impact
> on the QPS.
>
> > If these objects are large, e.g. a hundreds megabytes connector jar,
> will ship it hundreds of times(if parallelism > 100) from JMs to TMs be a
> blocker of performance and stability, compared letting the DFS help with
> the shipping... I'm fine to use a void blobService in OLAP scenarios *by
> default* if it works better in most cases.
>
> Thanks for the input. Currently, in our scenario, the connector jars
> are pre-deployed on the JM and TM, and each job submission only
> includes the serialized JobGraph. However, if there are custom
> connectors and UDFs involved in the future, I believe choosing the
> appropriate blob strategy will indeed require a further analysis. So,
> +1 for providing users with the option to switch between blob
> services. high-availability.blob-store.enabled sounds good from my
> side. We can set it to false if it is not manually configured and if
> high-availability.job-recovery.enabled is set to false.
>
> If there are no further comments, I will adjust the FLIP based on
> these discussions and then initiate a vote.
>
> Best,
> Yangze Guo
>
> On Mon, Jan 15, 2024 at 5:55 PM Zhu Zhu  wrote:
> >
> > Correction:
> > I'm fine to use a void blobService in OLAP scenarios if it works better
> > in most cases.  -> I'm fine to use a void blobService in OLAP scenarios
> > *by default* if it works better in most cases.
> >
> >
> >
> > Zhu Zhu  于2024年1月15日周一 17:51写道:
> >
> > > @Yangze
> > >
> > > > (with 128 parallelism WordCount jobs), disabling BlobStore resulted
> in a
> > > 100% increase in QPS
> > >
> > > Did you look into which part takes most of the time? Jar uploading, Jar
> > > downloading, JobInformation shipping, TDD shipping, or others?
> > >
> > > If these objects are large, e.g. a hundreds megabytes connector jar,
> > > will ship it hundreds of times(if parallelism > 100) from JMs to TMs
> > > be a blocker of performance and stability, compared letting the DFS
> > > help with the shipping. If yes, we should not force it to use a void
> > > blobService. Maybe an option should be given to users to switch between
> > > blobServices?
> > >
> > > I'm fine to use a void blobService in OLAP scenarios if it works better
> > > in most cases. However, it is a bit weird that we disable blobs if
> > > `enable-job-recovery=false`. Conceptually, they should be unrelated.
> > >
> > > > As Matthias mentioned, each component still needs to write its RPC
> > > address, so this part of the writing may be unavoidable.
> > >
> > > Thanks Matthias for the inputs.
> > > However, even in non-ha mode, that task manager can connect to
> JobMaster.
> > > Therefore, I guess it's not necessary to store JM addresses externally.
> > > I noticed `HighAvailabilityServices#getJobManagerLeaderRetriever`
> > > accepts a parameter `defaultJobManagerAddress`. So maybe it's not
> needed
> > > for TMs to find out the addresses of JMs via external services?
> > >
> > > > focus on the discussion of HA functionality in the OLAP scenario in
> > > FLIP-403 and exclude the refactoring from the scope of this FLIP
> > >
> > > It sounds good to me.
> > > Actually the concept of separating leader election and persistence
> > > looks great to me at the first glance. But the shared MaterialProvider
> > > makes it more complicated than I had expected.
> > >
> > > Thanks,
> > > Zhu
> > >
> > > Yangze Guo  于2024年1月11日周四 14:53写道:
> > >
> > >> Thanks for the comments, Zhu and Matthias.
> > >>
> > >> @Zhu Zhu
> > >>
> > >> > How about disabling the checkpoint to avoid the cost? I know the
> cost
> > >> is there even if we disable the checkpoint at the moment. But I think
> it
> > >> can be fixed.
> > >> > If HA is disabled, the jobmanager needs to directly participate in
> all
> > >> blob shipping work which may result in a hot-spot.
> > >>
> > >> Currently, there are several persistence services that have specific
> > >> implementations based on the HA mode:
> > >> - JobGraphStore and JobResultStore: These are related to job recovery
> > >> and can cause significant redundant I/O in OLAP scenarios, impacting
> > >> performance. It may be necessary to configure them as in-memory stores
> > >> for OLAP.
> > >> - CompletedCheckpointStore: As @Zhu Zhu mentioned, we can avoid this
> > >> 

Re: [DISCUSS] Hadoop 2 vs Hadoop 3 usage

2024-01-15 Thread Yang Wang
I could share some metrics about Alibaba Cloud EMR clusters.
The ratio of Hadoop2 VS Hadoop3 is 1:3.


Best,
Yang

On Thu, Dec 28, 2023 at 8:16 PM Martijn Visser 
wrote:

> Hi all,
>
> I want to get some insights on how many users are still using Hadoop 2
> vs how many users are using Hadoop 3. Flink currently requires a
> minimum version of Hadoop 2.10.2 for certain features, but also
> extensively uses Hadoop 3 (like for the file system implementations)
>
> Hadoop 2 has a large number of direct and indirect vulnerabilities
> [1]. Most of them can only be resolved by dropping support for Hadoop
> 2 and upgrading to a Hadoop 3 version. This thread is primarily to get
> more insights if Hadoop 2 is still commonly used, or if we can
> actually discuss dropping support for Hadoop 2 in Flink.
>
> Best regards,
>
> Martijn
>
> [1]
> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.10.2
>


Re: Re: [VOTE] Accept Flink CDC into Apache Flink

2024-01-10 Thread Yang Wang
+1 (binding)


Best,
Yang

On Thu, Jan 11, 2024 at 9:53 AM liu ron  wrote:

> +1 non-binding
>
> Best
> Ron
>
> Matthias Pohl  于2024年1月10日周三 23:05写道:
>
> > +1 (binding)
> >
> > On Wed, Jan 10, 2024 at 3:35 PM ConradJam  wrote:
> >
> > > +1 non-binding
> > >
> > > Dawid Wysakowicz  于2024年1月10日周三 21:06写道:
> > >
> > > > +1 (binding)
> > > > Best,
> > > > Dawid
> > > >
> > > > On Wed, 10 Jan 2024 at 11:54, Piotr Nowojski 
> > > wrote:
> > > >
> > > > > +1 (binding)
> > > > >
> > > > > śr., 10 sty 2024 o 11:25 Martijn Visser 
> > > > > napisał(a):
> > > > >
> > > > > > +1 (binding)
> > > > > >
> > > > > > On Wed, Jan 10, 2024 at 4:43 AM Xingbo Huang  >
> > > > wrote:
> > > > > > >
> > > > > > > +1 (binding)
> > > > > > >
> > > > > > > Best,
> > > > > > > Xingbo
> > > > > > >
> > > > > > > Dian Fu  于2024年1月10日周三 11:35写道:
> > > > > > >
> > > > > > > > +1 (binding)
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Dian
> > > > > > > >
> > > > > > > > On Wed, Jan 10, 2024 at 5:09 AM Sharath <
> dsaishar...@gmail.com
> > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > +1 (non-binding)
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Sharath
> > > > > > > > >
> > > > > > > > > On Tue, Jan 9, 2024 at 1:02 PM Venkata Sanath Muppalla <
> > > > > > > > sanath...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > +1 (non-binding)
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Sanath
> > > > > > > > > >
> > > > > > > > > > On Tue, Jan 9, 2024 at 11:16 AM Peter Huang <
> > > > > > > > huangzhenqiu0...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best Regards
> > > > > > > > > > > Peter Huang
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Jan 9, 2024 at 5:26 AM Jane Chan <
> > > > > qingyue@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Jane
> > > > > > > > > > > >
> > > > > > > > > > > > On Tue, Jan 9, 2024 at 8:41 PM Lijie Wang <
> > > > > > > > wangdachui9...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Lijie
> > > > > > > > > > > > >
> > > > > > > > > > > > > Jiabao Sun 
> > > > 于2024年1月9日周二
> > > > > > > > 19:28写道:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > Jiabao
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 2024/01/09 09:58:04 xiangyu feng wrote:
> > > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > > > Xiangyu Feng
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Danny Cranmer  于2024年1月9日周二
> > > > 17:50写道:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Danny
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Tue, Jan 9, 2024 at 9:31 AM Feng Jin <
> > > > > > ji...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > Feng Jin
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Tue, Jan 9, 2024 at 5:29 PM Yuxin Tan <
> > > > > > > > ta...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > Yuxin
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Márton Balassi 
> > > 于2024年1月9日周二
> > > > > > 17:25写道:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Jan 9, 2024 at 10:15 AM Leonard
> > Xu
> > > <
> > > > > > > > > > > xb...@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > +1(binding)
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > > > Leonard
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 2024年1月9日 下午5:08,Yangze Guo <
> > > > > ka...@gmail.com
> > > > > > >
> > > > > > > > 写道:
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > +1 (non-binding)
> > > > > > > > > > > 

[jira] [Created] (FLINK-33155) Flink ResourceManager continuously fails to start TM container on YARN when Kerberos enabled

2023-09-25 Thread Yang Wang (Jira)
Yang Wang created FLINK-33155:
-

 Summary: Flink ResourceManager continuously fails to start TM 
container on YARN when Kerberos enabled
 Key: FLINK-33155
 URL: https://issues.apache.org/jira/browse/FLINK-33155
 Project: Flink
  Issue Type: Bug
Reporter: Yang Wang


When Kerberos enabled(with key tab) and after one day(the container token 
expired), Flink fails to create the TaskManager container on YARN due to the 
following exception.

 
{code:java}
2023-09-25 16:48:50,030 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker container_1695106898104_0003_01_69 is terminated. Diagnostics: 
Container container_1695106898104_0003_01_69 was invalid. Diagnostics: 
[2023-09-25 16:48:45.710]token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=, renewer=, realUser=, issueDate=1695196431487, 
maxDate=1695801231487, sequenceNumber=12, masterKeyId=3) can't be found in 
cacheorg.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for hadoop: HDFS_DELEGATION_TOKEN 
owner=hadoop/master-1-1.c-5ee7bdc598b6e1cc.cn-beijing.emr.aliyuncs@emr.c-5ee7bdc598b6e1cc.com,
 renewer=, realUser=, issueDate=1695196431487, maxDate=1695801231487, 
sequenceNumber=12, masterKeyId=3) can't be found in cacheat 
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1545)at 
org.apache.hadoop.ipc.Client.call(Client.java:1491)at 
org.apache.hadoop.ipc.Client.call(Client.java:1388)at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:907)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:431)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)at 
org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1666)at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1576)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1573)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1588)
at 
org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:269)at 
org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:67)at 
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:414)at 
org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:411)at 
java.security.AccessController.doPrivileged(Native Method)at 
javax.security.auth.Subject.doAs(Subject.java:422)at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:411)at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.doDownloadCall(ContainerLocalizer.java:243)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:236)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer$FSDownloadWrapper.call(ContainerLocalizer.java:224)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
   at java.lang.Thread.run(Thread.java:750) {code}
The root cause might be that we are reading the delegation token from JM local 
file[1]. It will expire after one day. When the old TaskManager container 
crashes and ResourceManager tries to create a new one

Re: [DISSCUSS] Kubernetes Operator Flink Version Support Policy

2023-09-14 Thread Yang Wang
Since the users could always use the old Flink Kubernetes Operator version
along with old Flink versions, I am totally in favor of this proposal to
reduce maintenance burden.

Best,
Yang

Biao Geng  于2023年9月6日周三 18:15写道:

> +1 for the proposal.
>
> Best,
> Biao Geng
>
> Gyula Fóra  于2023年9月6日周三 16:10写道:
>
> > @Zhanghao Chen:
> >
> > I am not completely sure at this point what this will mean for 2.0 simply
> > because I am also not sure what that will mean for the operator as well
> :)
> > I think this will depend on the compatibility guarantees we can provide
> > across Flink major versions in general. We have to look into that and
> > tackle the question there independently.
> >
> > Gyula
> >
> > On Tue, Sep 5, 2023 at 6:12 PM Maximilian Michels 
> wrote:
> >
> > > +1 Sounds good! Four releases give a decent amount of time to migrate
> > > to the next Flink version.
> > >
> > > On Tue, Sep 5, 2023 at 5:33 PM Őrhidi Mátyás 
> > > wrote:
> > > >
> > > > +1
> > > >
> > > > On Tue, Sep 5, 2023 at 8:03 AM Thomas Weise  wrote:
> > > >
> > > > > +1, thanks for the proposal
> > > > >
> > > > > On Tue, Sep 5, 2023 at 8:13 AM Gyula Fóra 
> > > wrote:
> > > > >
> > > > > > Hi All!
> > > > > >
> > > > > > @Maximilian Michels  has raised the question of
> > > Flink
> > > > > > version support in the operator before the last release. I would
> > > like to
> > > > > > open this discussion publicly so we can finalize this before the
> > next
> > > > > > release.
> > > > > >
> > > > > > Background:
> > > > > > Currently the Flink Operator supports all Flink versions since
> > Flink
> > > > > 1.13.
> > > > > > While this is great for the users, it introduces a lot of
> backward
> > > > > > compatibility related code in the operator logic and also adds
> > > > > considerable
> > > > > > time to the CI. We should strike a reasonable balance here that
> > > allows us
> > > > > > to move forward and eliminate some of this tech debt.
> > > > > >
> > > > > > In the current model it is also impossible to support all
> features
> > > for
> > > > > all
> > > > > > Flink versions which leads to some confusion over time.
> > > > > >
> > > > > > Proposal:
> > > > > > Since it's a key feature of the kubernetes operator to support
> > > several
> > > > > > versions at the same time, I propose to support the last 4 stable
> > > Flink
> > > > > > minor versions. Currently this would mean to support Flink
> > 1.14-1.17
> > > (and
> > > > > > drop 1.13 support). When Flink 1.18 is released we would drop
> 1.14
> > > > > support
> > > > > > and so on. Given the Flink release cadence this means about 2
> year
> > > > > support
> > > > > > for each Flink version.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Cheers,
> > > > > > Gyula
> > > > > >
> > > > >
> > >
> >
>


Re: [VOTE] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-09-13 Thread Yang Wang
+1 (binding)

Best,
Yang

Becket Qin  于2023年9月14日周四 11:01写道:

> +1 (binding)
>
> Thanks for the FLIP, Archit.
>
> Cheers,
>
> Jiangjie (Becket) Qin
>
>
> On Thu, Sep 14, 2023 at 10:31 AM Dong Lin  wrote:
>
> > Thanks Archit for the FLIP.
> >
> > +1 (binding)
> >
> > Regards,
> > Dong
> >
> > On Thu, Sep 14, 2023 at 1:47 AM Archit Goyal
>  > >
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Thanks for reviewing the FLIP-355 Add parent dir of files to classpath
> > > using yarn.provided.lib.dirs :
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP+355%3A+Add+parent+dir+of+files+to+classpath+using+yarn.provided.lib.dirs
> > >
> > > Following is the discussion thread :
> > > https://lists.apache.org/thread/gv0ro4jsq4o206wg5gz9z5cww15qkvb9
> > >
> > > I'd like to start a vote for it. The vote will be open for at least 72
> > > hours (until September 15, 12:00AM GMT) unless there is an objection or
> > an
> > > insufficient number of votes.
> > >
> > > Thanks,
> > > Archit Goyal
> > >
> >
>


Re: [Discuss] FLIP-355: Add parent dir of files to classpath using yarn.provided.lib.dirs

2023-08-23 Thread Yang Wang
+1 for this FLIP

Maybe a FLIP is an overkill for this enhancement.


Best,
Yang

Venkatakrishnan Sowrirajan  于2023年8月23日周三 01:44写道:

> Thanks for the FLIP, Archit.
>
> This is definitely quite a useful addition to the *yarn.provided.lib.dirs*
> . +1.
>
> IMO, except for the fact that *yarn.provided.lib.dirs* (platform specific
> jars can be cached) takes only directories whereas *yarn.ship-files* (user
> files) takes both files and dirs, the overall logic in terms of
> constructing the classpath in both the cases should be roughly the same.
>
> Referencing the PR (https://github.com/apache/flink/pull/23164) with the
> initial implementation you created as well here.
>
> Regards
> Venkata krishnan
>
>
> On Tue, Aug 22, 2023 at 10:09 AM Archit Goyal  >
> wrote:
>
> > Hi all,
> >
> > Gentle ping if I can get a review on the FLIP.
> >
> > Thanks,
> > Archit Goyal
> >
> > From: Archit Goyal 
> > Date: Thursday, August 17, 2023 at 5:51 PM
> > To: dev@flink.apache.org 
> > Subject: [Discuss] FLIP-355: Add parent dir of files to classpath using
> > yarn.provided.lib.dirs
> > Hi All,
> >
> > I am opening this thread to discuss the proposal to add parent
> directories
> > of files to classpath when using yarn.provided.lib.dirs. This is
> documented
> > in FLIP-355 <
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*355*3A*Add*parent*dir*of*files*to*classpath*using*yarn.provided.lib.dirs__;KyUrKysrKysrKys!!IKRxdwAv5BmarQ!fFlyBKWuWwYcWfOcpLflTTi36tyHPiENIUry9J0ygaZY0VURnIs0glu5yafV0w0tfSsnOb9ZxDD9Cjw2TApTekVU$
> > >.
> >
> > This FLIP mentions about enhancing YARN's classpath configuration to
> > include parent directories of files in yarn.provided.lib.dirs.
> >
> > Please feel free to reply to this email thread and share your opinions.
> >
> > Thanks,
> > Archit Goyal
> >
>


Re: [DISCUSS] FLIP-316: Introduce SQL Driver

2023-06-11 Thread Yang Wang
Sorry for the late reply. I am in favor of introducing such a built-in
resource localization mechanism
based on Flink FileSystem. Then FLINK-28915[1] could be the second step
which will download
the jars and dependencies to the JobManager/TaskManager local directory
before working.

The first step could be done in another ticket in Flink. Or some external
Flink jobs management system
could also take care of this.

[1]. https://issues.apache.org/jira/browse/FLINK-28915

Best,
Yang

Paul Lam  于2023年6月9日周五 17:39写道:

> Hi Mason,
>
> I get your point. I'm increasingly feeling the need to introduce a
> built-in
> file distribution mechanism for flink-kubernetes module, just like Spark
> does with `spark.kubernetes.file.upload.path` [1].
>
> I’m assuming the workflow is as follows:
>
> - KubernetesClusterDescripter uploads all local resources to a remote
>   storage via Flink filesystem (skips if the resources are already remote).
> - KubernetesApplicationClusterEntrypoint downloads the resources
>   and put them in the classpath during startup.
>
> I wouldn't mind splitting it into another FLIP to ensure that everything is
> done correctly.
>
> cc'ed @Yang to gather more opinions.
>
> [1]
> https://spark.apache.org/docs/latest/running-on-kubernetes.html#dependency-management
>
> Best,
> Paul Lam
>
> 2023年6月8日 12:15,Mason Chen  写道:
>
> Hi Paul,
>
> Thanks for your response!
>
> I agree that utilizing SQL Drivers in Java applications is equally
> important
>
> as employing them in SQL Gateway. WRT init containers, I think most
> users use them just as a workaround. For example, wget a jar from the
> maven repo.
>
> We could implement the functionality in SQL Driver in a more graceful
> way and the flink-supported filesystem approach seems to be a
> good choice.
>
>
> My main point is: can we solve the problem with a design agnostic of SQL
> and Stream API? I mentioned a use case where this ability is useful for
> Java or Stream API applications. Maybe this is even a non-goal to your FLIP
> since you are focusing on the driver entrypoint.
>
> Jark mentioned some optimizations:
>
> This allows SQLGateway to leverage some metadata caching and UDF JAR
> caching for better compiling performance.
>
> It would be great to see this even outside the SQLGateway (i.e. UDF JAR
> caching).
>
> Best,
> Mason
>
> On Wed, Jun 7, 2023 at 2:26 AM Shengkai Fang  wrote:
>
> Hi. Paul.  Thanks for your update and the update makes me understand the
> design much better.
>
> But I still have some questions about the FLIP.
>
> For SQL Gateway, only DMLs need to be delegated to the SQL server
> Driver. I would think about the details and update the FLIP. Do you have
>
> some
>
> ideas already?
>
>
> If the applicaiton mode can not support library mode, I think we should
> only execute INSERT INTO and UPDATE/ DELETE statement in the application
> mode. AFAIK, we can not support ANALYZE TABLE and CALL PROCEDURE
> statements. The ANALYZE TABLE syntax need to register the statistic to the
> catalog after job finishes and the CALL PROCEDURE statement doesn't
> generate the ExecNodeGraph.
>
> * Introduce storage via option `sql-gateway.application.storage-dir`
>
> If we can not support to submit the jars through web submission, +1 to
> introduce the options to upload the files. While I think the uploader
> should be responsible to remove the uploaded jars. Can we remove the jars
> if the job is running or gateway exits?
>
> * JobID is not avaliable
>
> Can we use the returned rest client by ApplicationDeployer to query the job
> id? I am concerned that users don't know which job is related to the
> submitted SQL.
>
> * Do we need to introduce a new module named flink-table-sql-runner?
>
> It seems we need to introduce a new module. Will the new module is
> available in the distribution package? I agree with Jark that we don't need
> to introduce this for table-API users and these users have their main
> class. If we want to make users write the k8s operator more easily, I think
> we should modify the k8s operator repo. If we don't need to support SQL
> files, can we make this jar only visible in the sql-gateway like we do in
> the planner loader?[1]
>
> [1]
>
>
> https://github.com/apache/flink/blob/master/flink-table/flink-table-planner-loader/src/main/java/org/apache/flink/table/planner/loader/PlannerModule.java#L95
>
> Best,
> Shengkai
>
>
>
>
>
>
>
>
> Weihua Hu  于2023年6月7日周三 10:52写道:
>
> Hi,
>
> Thanks for updating the FLIP.
>
> I have two cents on the distribution of SQLs and resources.
> 1. Should we support a common file distribution mechanism for k8s
> application mode?
>  I have seen some issues and requirements on the mailing list.
>  In our production environment, we implement the download command in the
> CliFrontend.
>  And automatically add an init container to the POD for file
>
> downloading.
>
> The advantage of this
>  is that we can use all Flink-supported file systems to store files.
>
>  This need more discussion. I would 

Re: [VOTE] FLIP-312: Add Yarn ACLs to Flink Containers

2023-06-05 Thread Yang Wang
+1 (binding)

Best,
Yang

Becket Qin  于2023年6月6日周二 10:35写道:

> +1 (binding)
>
> Thanks for driving the FLIP, Archit.
>
> Cheers,
>
> Jiangjie (Becket) Qin
>
> On Tue, Jun 6, 2023 at 4:33 AM Venkatakrishnan Sowrirajan <
> vsowr...@asu.edu>
> wrote:
>
> > Thanks for starting the vote on this one, Archit.
> >
> > +1 (non-binding)
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Mon, Jun 5, 2023 at 9:55 AM Archit Goyal  >
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Thanks for all the feedback for FLIP-312: Add Yarn ACLs to Flink
> > > Containers<
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__;KyUrKysrKys!!IKRxdwAv5BmarQ!aWkLc7eHAWyHz5kwEq8kKzEAgbtKtlMmi9ifOy_1GNbiO93taxiMcwdHfENc4inLU_cxZIKPDMwBP97Z4oibXUIM$
> > > >.
> > > Following is the discussion thread : Link<
> > >
> >
> https://urldefense.com/v3/__https://lists.apache.org/thread/xj3ytkwj9lsl3hpjdb4n8pmy7lk3l8tv__;!!IKRxdwAv5BmarQ!aWkLc7eHAWyHz5kwEq8kKzEAgbtKtlMmi9ifOy_1GNbiO93taxiMcwdHfENc4inLU_cxZIKPDMwBP97Z4u3tNMqI$
> > > >
> > >
> > > I'd like to start a vote for it. The vote will be open for at least 72
> > > hours (until June 9th, 12:00AM GMT) unless there is an objection or an
> > > insufficient number of votes.
> > >
> > > Thanks,
> > > Archit Goyal
> > >
> >
>


Re: [DISCUSS] FLIP-312: Add Yarn ACLs to Flink Containers

2023-06-01 Thread Yang Wang
Thanks Archit Goyal for the explanation and updating the FLIP.

No more concerns from my part.
+1


Best,
Yang

Archit Goyal  于2023年5月27日周六 05:19写道:

> Thanks Yang for review.
>
>
>   1.  FLIP-312 relies on Hadoop version 2.6.0 or later.
>   2.  I have updated the FLIP and made it more descriptive.
>   3.  ACLs apply to logs as well as permissions to kill the application.
> Also, in the PR we are setting ACLs for Task Manager
> (createTaskExecutorContext) as well as Job Manager (startAppMaster).
>
> Thanks,
> Archit Goyal
>
> From: Yang Wang 
> Date: Sunday, May 21, 2023 at 9:08 PM
> To: dev@flink.apache.org 
> Subject: Re: [DISCUSS] FLIP-312: Add Yarn ACLs to Flink Containers
> Thanks for creating this FLIP.
>
> This sounds like a useful feature to make the Flink applications running on
> YARN cluster more securely.
>
> However, I think we still miss some important parts in the FLIP.
> 1. Which hadoop versions this FLIP relies on?
> 2. We need to describe a bit more about how the YARN ACLs works.
> 3. Does the ACLs only apply to the logs? How about the Flink JobManager UI?
>
> Best,
> Yang
>
> Venkatakrishnan Sowrirajan  于2023年5月13日周六 08:12写道:
>
> > Thanks for the FLIP, Archit.
> >
> > +1 from me as well. This would be very useful for us and others in the
> > community given the same issue was raised earlier as well.
> >
> > Regards
> > Venkata krishnan
> >
> >
> > On Fri, May 12, 2023 at 4:03 PM Becket Qin  wrote:
> >
> > > Thanks for the FLIP, Archit.
> > >
> > > The motivation sounds reasonable and it looks like a straightforward
> > > proposal. +1 from me.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Fri, May 12, 2023 at 1:30 AM Archit Goyal
> >  > > >
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > I am opening this thread to discuss the proposal to support Yarn ACLs
> > to
> > > > Flink containers which has been documented in FLIP-312 <
> > > >
> > >
> >
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2FFLINK%2FFLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__%3BKyUrKysrKys!!IKRxdwAv5BmarQ!bQiA3GX9bFf-w6A9M4Aez7RSMYLdvFtjZnlrOSf6N2nQUFuDdnoJ20uujW8RPY1VbLS9P4AfpnqPmkZZOuQ%24=05%7C01%7Cargoyal%40linkedin.com%7C0337240314fb45444f5e08db5a7a277f%7C72f988bf86f141af91ab2d7cd011db47%7C0%7C0%7C638203252947441598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=HS6QhFdRGtX7Yp7qCzEB7kOeDyqB0ePhd%2BUy7BAPsY8%3D=0
> <
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__;KyUrKysrKys!!IKRxdwAv5BmarQ!bQiA3GX9bFf-w6A9M4Aez7RSMYLdvFtjZnlrOSf6N2nQUFuDdnoJ20uujW8RPY1VbLS9P4AfpnqPmkZZOuQ$
> >
> > > > >.
> > > >
> > > > This FLIP mentions about providing Yarn application ACL mechanism on
> > > Flink
> > > > containers to be able to provide specific rights to users other than
> > the
> > > > one running the Flink application job. This will restrict other users
> > in
> > > > two ways:
> > > >
> > > >   *   view logs through the Resource Manager job history
> > > >   *   kill the application
> > > >
> > > > Please feel free to reply to this email thread and share your
> opinions.
> > > >
> > > > Thanks,
> > > > Archit Goyal
> > > >
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-312: Add Yarn ACLs to Flink Containers

2023-05-21 Thread Yang Wang
Thanks for creating this FLIP.

This sounds like a useful feature to make the Flink applications running on
YARN cluster more securely.

However, I think we still miss some important parts in the FLIP.
1. Which hadoop versions this FLIP relies on?
2. We need to describe a bit more about how the YARN ACLs works.
3. Does the ACLs only apply to the logs? How about the Flink JobManager UI?

Best,
Yang

Venkatakrishnan Sowrirajan  于2023年5月13日周六 08:12写道:

> Thanks for the FLIP, Archit.
>
> +1 from me as well. This would be very useful for us and others in the
> community given the same issue was raised earlier as well.
>
> Regards
> Venkata krishnan
>
>
> On Fri, May 12, 2023 at 4:03 PM Becket Qin  wrote:
>
> > Thanks for the FLIP, Archit.
> >
> > The motivation sounds reasonable and it looks like a straightforward
> > proposal. +1 from me.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Fri, May 12, 2023 at 1:30 AM Archit Goyal
>  > >
> > wrote:
> >
> > > Hi all,
> > >
> > > I am opening this thread to discuss the proposal to support Yarn ACLs
> to
> > > Flink containers which has been documented in FLIP-312 <
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/FLINK/FLIP*312*3A*Add*Yarn*ACLs*to*Flink*Containers__;KyUrKysrKys!!IKRxdwAv5BmarQ!bQiA3GX9bFf-w6A9M4Aez7RSMYLdvFtjZnlrOSf6N2nQUFuDdnoJ20uujW8RPY1VbLS9P4AfpnqPmkZZOuQ$
> > > >.
> > >
> > > This FLIP mentions about providing Yarn application ACL mechanism on
> > Flink
> > > containers to be able to provide specific rights to users other than
> the
> > > one running the Flink application job. This will restrict other users
> in
> > > two ways:
> > >
> > >   *   view logs through the Resource Manager job history
> > >   *   kill the application
> > >
> > > Please feel free to reply to this email thread and share your opinions.
> > >
> > > Thanks,
> > > Archit Goyal
> > >
> > >
> >
>


Re: Flink Kubernetes Operator 1.4.0 release planning

2023-02-06 Thread Yang Wang
Thanks Gyula for driving the release again.

It is really exciting to see the auto-scaling coming out of the box.

Best,
Yang

Gyula Fóra  于2023年2月6日周一 19:43写道:

> Hi Devs!
>
> Based on the previously agreed upon release schedule (
>
> https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning
> )
> it is almost time for the 1.4.0 release.
>
> There are still a number of smaller but important PRs open for some
> critical fixes. I would like to merge those in the next 1-2 days and I
> suggest we make the release cut on Wednesday/Thursday.
>
> After that we should spend some time testing the release candidate and
> hopefully we can finalize the release next week!
>
> I volunteer as the release manager.
>
> Cheers,
> Gyula
>


Re: [VOTE] FLIP-285: Refactoring LeaderElection to make Flink support multi-component leader election out-of-the-box

2023-01-30 Thread Yang Wang
+1 (Binding)

Best,
Yang

ConradJam  于2023年1月31日周二 12:09写道:

> +1 non-binding
>
> Matthias Pohl  于2023年1月25日周三 17:34写道:
>
> > Hi everyone,
> > After the discussion thread [1] on FLIP-285 [2] didn't bring up any new
> > items, I want to start voting on FLIP-285. This FLIP will not only align
> > the leader election code base again through FLINK-26522 [3]. I also plan
> to
> > improve the test coverage for the leader election as part of this change
> > (covered in FLINK-30338 [4]).
> >
> > The vote will remain open until at least Jan 30th (at least 72 hours)
> > unless there are some objections or insufficient votes.
> >
> > Best,
> > Matthias
> >
> > [1] https://lists.apache.org/thread/qrl881wykob3jnmzsof5ho8b9fgkklpt
> > [2]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> > [3] https://issues.apache.org/jira/browse/FLINK-26522
> > [4] https://issues.apache.org/jira/browse/FLINK-30338
> >
> > --
> >
> > [image: Aiven]
> >
> > Matthias Pohl
> >
> > Software Engineer, Aiven
> >
> > matthias.p...@aiven.io 
> >
> > aiven.io    |
> >  <
> > https://www.facebook.com/aivencloud/>
> > 
> > <
> https://twitter.com/aiven_io>
> > 
> >
> > Aiven Deutschland GmbH
> >
> > Immanuelkirchstraße 26, 10405 Berlin
> >
> > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> >
> > Amtsgericht Charlottenburg, HRB 209739 B
> >
>
>
> --
> Best
>
> ConradJam
>


Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

2023-01-24 Thread Yang Wang
Having the *start()* in *LeaderContender* interface and bringing back the
*LeaderElection* with some new methods make sense to me.

I have no more concerns now.


>- *LeaderContender*: The LeaderContender is integrated as usual except
>that it accesses the LeaderElection instead of the LeaderElectionService.
>It's going to call startLeaderElection(LeaderContender) where, previously,
>LeaderElectionService.start(LeaderContender) was called.
>
> nit: we call the *LeaderElection#startLeaderElection()*, not the
*LeaderElection#startLeaderElection(LeaderContender)*. Because we have
already set the leaderContender in the
*LeaderElection#register(LeaderContender)*.


Best,
Yang


Chesnay Schepler  于2023年1月23日周一 23:16写道:

> Thanks for updating the design. From my side this looks good.
>
> On 18/01/2023 17:59, Matthias Pohl wrote:
> > After another round of discussion, I came up with a (hopefully) final
> > proposal. The previously discussed approach was still not optimal because
> > the contender ID lived in the LeaderContender even though it is actually
> > LeaderElectionService-internal knowledge. Fixing that helped fix the
> > overall architecture. Additionally, it brought back the LeaderElection
> > interface with slightly different methods.
> >
> > I updated the "Code Cleanup: Merge MultipleComponentLeaderElectionService
> > into LeaderElectionService" section and moved the old proposal into the
> > section for rejected alternatives. Feel free to have another look at the
> > updated version [1].
> >
> > Matthias
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
> >
> > On Wed, Jan 18, 2023 at 1:40 PM Matthias Pohl 
> > wrote:
> >
> >> Thanks for participating in the discussion, Yang & Chesnay.
> LeaderElection
> >> interface extension gave me a headache as well. I added it initially
> >> because I thought it would be of more value. But essentially, it doesn't
> >> help but make the code harder to understand (as your questions
> rightfully
> >> point out). I agree that the FLIP is good enough without this
> extension. I
> >> moved it into the Rejected Alternatives section of the FLIP and would
> >> propose going ahead without it.
> >>
> >> I will answer your questions about the LeaderElection extension, anyway:
> >>
> >> BTW, if the *LeaderElectionService#register(return LeaderElection)* and
> >>> *LeaderElectionService#onGrantLeadership* are guarded by a same lock,
> then
> >>> we could ensure that the leaderElection in *LeaderContender* is always
> >>> non-null when it tries to confirm the leadership. And then we do not
> need
> >>> the
> >>> *LeaderContender#initializeLeaderElection*. Right?
> >>
> >> No, we still would need LeaderContender#initializeLeaderElection because
> >> the LeaderElectionService needs to be capable of setting the
> LeaderElection
> >> within the LeaderContender before triggering the process for granting
> the
> >> leadership. This all needs to happen within the
> >> LeaderElectionService#register(LeaderContender). It's indepent of the
> lock.
> >>
> >> With the extension, how does the leader contender get access to the
> >>> LeaderElection? I would've assumed that LEService returns a
> LeaderElection
> >>> when register is called, but according to the diagram this method
> doesn't
> >>> return anything. Is that what initiateLeaderElection is doing?
> >>
> >> Correct. My initial plan was to make
> >> LeaderElectionService#register(LeaderContender) return the
> LeaderElection
> >> instance. That method could have been called within the LeaderContender.
> >> But this approach has the flaw that LeaderContender would be in charge
> >> within this control flow where, actually, we would want
> >> LeaderElectionService to be still in charge to trigger the process for
> >> granting the leadership. This required the
> >> LeaderContender.initializeLeaderElection(LeaderElection) method to be
> added
> >> to enable the LeaderElectionService to do the initialization. I added a
> >> comment to the corresponding class diagram to make this clearer.
> >>
> >> The DefaultLeaderElection will rely on package-private methods of the
> >>> DLEService to handle confirm/hasLeadership calls?
> >>
> >> Correct. I added the missing package-private methods to the class
> diagram
> >> in the FLIP to clear things up.
> >>
> >> On Wed, Jan 18, 2023 at 11:47 AM Chesnay Schepler 
> >> wrote:
> >>
> >>> There are a lot of good things in this, and until the Extension bit I'm
> >>> fully on board.
> >>>
> >>> With the extension, how does the leader contender get access to the
> >>> LeaderElection? I would've assumed that LEService returns a
> >>> LeaderElection when register is called, but according to the diagram
> >>> this method doesn't return anything. Is that what
> initiateLeaderElection
> >>> is doing?
> >>>
> >>> The DefaultLeaderElection will rely on package-private 

Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

2023-01-17 Thread Yang Wang
Thanks Matthias for the detailed explanation.

For the HA backend data structure, you are right. Even the different
components are running in a same JVM,
they have completely different connection infos. But it is not urgent to
use a single ZNode to store multiple connection entries for now.

I lean towards not to introduce the *LeaderElection *since it does not take
many benefits.

BTW, if the *LeaderElectionService#register(return LeaderElection)* and
*LeaderElectionService#onGrantLeadership* are guarded by a same lock, then
we could
ensure that the leaderElection in *LeaderContender* is always non-null
when it tries
to confirm the leadership. And then we do not need the
*LeaderContender#initializeLeaderElection*.
Right?


Best,
Yang

Matthias Pohl  于2023年1月17日周二 20:31写道:

> Thanks Yang for getting back to me.
>
> I checked the connection information that's stored in the HA backend once
> more. My previous proposal is based on a wrong assumption: The address we
> store is the RPC endpoint's address. That address should be unique per
> component which means that we shouldn't change how the HA backend stores
> the connection information. We don't store redundant information here. As a
> consequence, I had to reiterate over my proposal in FLIP-285. We're now
> relying on contender IDs that are going to be used internally for storing
> the connection information in the HA backend (analogously to how it's done
> in the MultipleComponentLeaderElectionService implementation right now).
>
> Additionally, I came to the conclusion that it's actually not really
> necessary to add the LeaderElection interface. Working with the contender
> IDs to identify the LeaderContender in the LeaderElectionService might be
> good enough. But I still kept the LeaderElection interface as an (optional)
> extension of FLIP-285 as it might improve testability a bit. I added some
> diagrams and descriptions to the FLIP hoping that this helps answer your
> questions, Yang.
>
> Best,
> Matthias
>
> On Mon, Jan 16, 2023 at 8:41 AM Yang Wang  wrote:
>
> > Thanks Matthias for updating the FLIP.
> >
> > Given that we do not touch the dedicated ConfigMap for the checkpoint,
> this
> > FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes
> > Operator.
> >
> > # LeaderContender#initiateLeaderElection
> > Could we simply execute *leaderElectionService.register(contender) *just
> in
> > the same place with current *leaderElectionService.start(contender)*, not
> > within the constructor?
> > Then we could ensure that the *onGrantLeadership* does not happen before
> > the contender is ready.
> >
> > # LeaderElection#confirmLeadership
> > Could you please share more about how *LeaderElection#confirmLeadership
> > *works?
> > It will directly call the *LeaderElectionDriver#writeLeaderInformation*,
> or
> > *LeaderElectionService*
> > will be the adapter.
> >
> > # LeaderRetrievalService
> > This might be out the scope of this FLIP :)
> > Do we need to introduce a corresponding
> > *MultipleComponentLeaderRetrievalService*, especially when we are using
> > only one ZNode/ConfigMap for storing all the leader information?
> > For Kubernetes HA, we have already use a shared ConfigMap watcher for all
> > the leader retrieval services.
> > However, for ZK HA, each leader retrieval service still has a dedicated
> > *TreeCache*.
> >
> >
> > Best,
> > Yang
> >
> >
> >
> > Matthias Pohl  于2023年1月12日周四 22:07写道:
> >
> > > Thanks Yang Wang for sharing your view on this. Please find my
> responses
> > > below.
> > >
> > > # HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
> > > > We have already changed the HA data format after introducing the
> > multiple
> > > > component leader election in FLINK-24038. For K8s HA,
> > > > the num of ConfigMaps reduced from 4 to 2. Since we only have one
> > leader
> > > > elector, the K8s APIServer load should also be reduced.
> > > > Why do we still need to change the format again? This just prevents
> the
> > > > LAST_STATE upgrade mode in Flink-Kubernetes-Operator
> > > > when the Flink version changed, even though it is a simple job and
> > state
> > > is
> > > > compatible.
> > > >
> > >
> > > The intention of this remark is that we could reduce the number of
> > > redundant records (the
> > ZooKeeperMultipleComponentLeaderElectionHaServices'
> > > JavaDoc [1] visualizes the redundancy quite well since each of these
> > > connection_info records would contain 

Re: Supplying jar stored at S3 to flink to run the job in kubernetes

2023-01-16 Thread Yang Wang
Do you build your own flink-kubernetes-operator image with the flink-s3-fs
plugin bundled[1]?

[1].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.3/docs/custom-resource/overview/#flinksessionjob-spec-overview

Best,
Yang

Weihua Hu  于2023年1月17日周二 10:47写道:

> Hi, Rahul
>
> User support and questions should be sent to the user mailing list (
> u...@flink.apache.org)
>
> You can resend the issue to the user mailing list with a detailed error
> log.
>
> Best,
> Weihua
>
>
> On Mon, Jan 16, 2023 at 11:18 PM rahul sahoo 
> wrote:
>
> > I have been following the examples mentioned here:
> > flink-kubernetes-operator_examples
> >  >.
> > I'm testing this on the local minikube. I have deployed minio for s3 and
> > flink operator.
> >
> > I have my application jar in s3(using minio for this). I have deployed
> the
> > flink session deployment in minikube and want to submit the job as
> > mentioned in basic-session-deployment-and-job.yaml
> > <
> >
> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-session-deployment-and-job.yaml
> > >
> >
> > I want to replace the `https://` to `s3a://` in this line
> > <
> >
> https://github.com/apache/flink-kubernetes-operator/blob/92034fa912f39f5c8bd57632295c7ca85801f33a/examples/basic-session-deployment-and-job.yaml#L43
> > >.
> > The final URL should look like
> > `s3a://local-bkt/flink-examples-streaming_2.12-1.16.0.jar`. I'm using
> > flink_2.12-1.16.0 with s3 plugin in docker image.
> >
> > Can anyone help me to solve this?
> >
> > Thank You,
> > Rahul Sahoo
> >
>


Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

2023-01-15 Thread Yang Wang
Thanks Matthias for updating the FLIP.

Given that we do not touch the dedicated ConfigMap for the checkpoint, this
FLIP will not affect the LAST_STATE recovery for the Flink Kubernetes
Operator.

# LeaderContender#initiateLeaderElection
Could we simply execute *leaderElectionService.register(contender) *just in
the same place with current *leaderElectionService.start(contender)*, not
within the constructor?
Then we could ensure that the *onGrantLeadership* does not happen before
the contender is ready.

# LeaderElection#confirmLeadership
Could you please share more about how *LeaderElection#confirmLeadership *works?
It will directly call the *LeaderElectionDriver#writeLeaderInformation*, or
*LeaderElectionService*
will be the adapter.

# LeaderRetrievalService
This might be out the scope of this FLIP :)
Do we need to introduce a corresponding
*MultipleComponentLeaderRetrievalService*, especially when we are using
only one ZNode/ConfigMap for storing all the leader information?
For Kubernetes HA, we have already use a shared ConfigMap watcher for all
the leader retrieval services.
However, for ZK HA, each leader retrieval service still has a dedicated
*TreeCache*.


Best,
Yang



Matthias Pohl  于2023年1月12日周四 22:07写道:

> Thanks Yang Wang for sharing your view on this. Please find my responses
> below.
>
> # HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
> > We have already changed the HA data format after introducing the multiple
> > component leader election in FLINK-24038. For K8s HA,
> > the num of ConfigMaps reduced from 4 to 2. Since we only have one leader
> > elector, the K8s APIServer load should also be reduced.
> > Why do we still need to change the format again? This just prevents the
> > LAST_STATE upgrade mode in Flink-Kubernetes-Operator
> > when the Flink version changed, even though it is a simple job and state
> is
> > compatible.
> >
>
> The intention of this remark is that we could reduce the number of
> redundant records (the ZooKeeperMultipleComponentLeaderElectionHaServices'
> JavaDoc [1] visualizes the redundancy quite well since each of these
> connection_info records would contain the very same information). We're
> saving the same connection_info for each of the componentIds (e.g.
> resource_manager, dispatcher, ...) right now. My rationale was that we only
> need to save the connection info once per LeaderElectionDriver, i.e.
> LeaderElectionService. It's an implementation detail of the
> LeaderElectionService implementation to know what components it owns.
> Therefore, I suggested that we would have a unique ID per
> LeaderElectionService instance with a single connection_info that is used
> by all the components that are registered to that service. If we decide to
> have a separate LeaderElectionService for a specific component (e.g. the
> resource manager) in the future, this would end up having a separate
> ConfigMap in k8s or separate ZNode in ZooKeeper.
>
> I added these details to the FLIP [2]. That part, indeed, was quite poorly
> described there initally.
>
> I don't understand how the leader election affects the LAST_STATE changes
> in the Kubernetes Operator, though. We use a separate ConfigMap for the
> checkpoint data [3]. Can you elaborate a little bit more on your concern?
>
> [1]
>
> https://github.com/apache/flink/blob/8ddfd590ebba7fc727e79db41b82d3d40a02b56a/flink-runtime/src/main/java/org/apache/flink/runtime/highavailability/zookeeper/ZooKeeperMultipleComponentLeaderElectionHaServices.java#L47-L61
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box#ha-backend-data-schema
> [3]
>
> https://github.com/apache/flink/blob/2770acee1bc4a82a2f4223d4a4cd6073181dc840/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/highavailability/KubernetesMultipleComponentLeaderElectionHaServices.java#L163
>
>
> > # LeaderContender#initiateLeaderElection
> > I do not get your point why we need the *initiateLeaderElection* in
> > *LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership*
> > could be executed as soon as the registration.
> >
>
> That's the change I'm not really happy about. I'm still not sure whether I
> found the best solution here. The problem is the way the components are
> initialized. The initial plan was to call the
> LeaderElectionService.register(LeaderContender) from within the
> LeaderContender constructor which would return the LeaderElection instance
> that would be used as the adapter for the contender to confirm leadership.
> Therefore, the LeaderContender has to own the LeaderElection instance to be
> able to participate in the leader election handshake (i.e. grant leadershi

Re: [DISCUSS] FLIP-285: Refactoring the leader election code base in Flink

2023-01-12 Thread Yang Wang
Thanks Matthias for preparing this thorough FLIP, which has taken us
reviewing the multiple component leader election.

I am totally in favor of doing the code clean-up. The current
implementation does not have very good readability due to legacy
compatibility.
And I just have a few comments.

# HA data format in the HA backend(e.g. ZK, K8s ConfigMap)
We have already changed the HA data format after introducing the multiple
component leader election in FLINK-24038. For K8s HA,
the num of ConfigMaps reduced from 4 to 2. Since we only have one leader
elector, the K8s APIServer load should also be reduced.
Why do we still need to change the format again? This just prevents the
LAST_STATE upgrade mode in Flink-Kubernetes-Operator
when the Flink version changed, even though it is a simple job and state is
compatible.

# LeaderContender#initiateLeaderElection
I do not get your point why we need the *initiateLeaderElection* in
*LeaderContender*. AFAICS, the callback *onGrant/RevokeLeadership*
could be executed as soon as the registration.

# When to establish the HA backend connection
+1 for establish the connection beforehand


Best,
Yang

Matthias Pohl  于2023年1月5日周四 21:51写道:

> Hi everyone,
> I brought up FLINK-26522 [1] in the mailing list discussion about
> consolidating the HighAvailabilityServices interfaces [2], previously.
> There, it was concluded that the community still wants the ability to have
> per-component leader election and, therefore, keep the
> HighAvailabilityServices interface as is. I went back to work on
> FLINK-26522 [1] to figure out how we can simplify the current codebase
> keeping the decision in mind.
>
> I wanted to handle FLINK-26522 [1] as a follow-up cleanup task of
> FLINK-24038 [3]. But while working on it, I realized that even FLINK-24038
> [3] shouldn't have been handled without a FLIP. The per-process leader
> election which was introduced in FLINK-24038 [3] changed the ownership of
> certain components. This is actually a change that should have been
> discussed in the mailing list and deserved a FLIP. To overcome this
> shortcoming of FLINK-24038 [3], I decided to prepare FLIP-285 [4] to
> provide proper documentation of what happened in FLINK-24038 and what will
> be manifested with resolving its follow-up FLINK-26522 [1].
>
> Conceptually, this FLIP proposes moving away from Flink's support for
> single-contender LeaderElectionServices and introducing multi-contender
> support by disconnecting the HA-backend leader election lifecycle from the
> LeaderContender's lifecycle. This allows us to provide LeaderElection per
> component (as it was requested in [2]) but also enables us to utilize a
> single leader election for multiple components/contenders as well without
> the complexity of the code that was introduced by FLINK-24038 [3].
>
> I'm looking forward to your comments.
>
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-26522
> [2] https://lists.apache.org/thread/9oy2ml9s3j1v6r77h31sndhc3gw57cfm
> [3] https://issues.apache.org/jira/browse/FLINK-24038
> [4]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-285%3A+Refactoring+LeaderElection+to+make+Flink+support+multi-component+leader+election+out-of-the-box
>


Re: [ANNOUNCE] New Apache Flink Committer - Lincoln Lee

2023-01-12 Thread Yang Wang
Congratulations, Lincoln!

Best,
Yang

Lincoln Lee  于2023年1月12日周四 12:13写道:

> Thank you all!
>
> I'm honored to join the committers and look forward to continue working
> with the community.
>
> Best,
> Lincoln Lee
>
>
> Shengkai Fang  于2023年1月12日周四 09:55写道:
>
> > Congratulations, Lincoln!
> >
> > Best,
> > Shengkai
> >
> > liu ron  于2023年1月12日周四 09:48写道:
> >
> > > Congratulations, Lincoln!
> > >
> > > Best
> > >
> > > Yu Li  于2023年1月12日周四 09:22写道:
> > >
> > > > Congratulations, Lincoln!
> > > >
> > > > Best Regards,
> > > > Yu
> > > >
> > > >
> > > > On Wed, 11 Jan 2023 at 21:17, Martijn Visser <
> martijnvis...@apache.org
> > >
> > > > wrote:
> > > >
> > > > > Congratulations Lincoln, happy to have you on board!
> > > > >
> > > > > Best regards, Martijn
> > > > >
> > > > >
> > > > > On Wed, Jan 11, 2023 at 1:49 PM Dong Lin 
> > wrote:
> > > > >
> > > > > > Congratulations, Lincoln!
> > > > > >
> > > > > > Cheers,
> > > > > > Dong
> > > > > >
> > > > > > On Tue, Jan 10, 2023 at 11:52 AM Jark Wu 
> wrote:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > On behalf of the PMC, I'm very happy to announce Lincoln Lee
> as a
> > > new
> > > > > > Flink
> > > > > > > committer.
> > > > > > >
> > > > > > > Lincoln Lee has been a long-term Flink contributor since 2017.
> He
> > > > > mainly
> > > > > > > works on Flink
> > > > > > > SQL parts and drives several important FLIPs, e.g., FLIP-232
> > (Retry
> > > > > Async
> > > > > > > I/O), FLIP-234 (
> > > > > > > Retryable Lookup Join), FLIP-260 (TableFunction Finish).
> Besides,
> > > He
> > > > > also
> > > > > > > contributed
> > > > > > > much to Streaming Semantics, including the non-determinism
> > problem
> > > > and
> > > > > > the
> > > > > > > message
> > > > > > > ordering problem.
> > > > > > >
> > > > > > > Please join me in congratulating Lincoln for becoming a Flink
> > > > > committer!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jark Wu
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] FLIP-271: Autoscaling

2022-11-24 Thread Yang Wang
+1 (binding)


Best,
Yang

Őrhidi Mátyás  于2022年11月24日周四 12:04写道:

> +1 (binding)
>
> On Wed, Nov 23, 2022 at 11:46 AM Gyula Fóra  wrote:
>
> > +1 (binding)
> >
> > Gyula
> >
> > On Wed, Nov 23, 2022 at 5:25 PM Maximilian Michels 
> wrote:
> >
> > > Hi everyone,
> > >
> > > I'd like to start a vote for FLIP-271 [1] which we previously discussed
> > > on the dev mailing list [2].
> > >
> > > I'm planning to keep the vote open for at least until Tuesday, Nov 29.
> > >
> > > -Max
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-271%3A+Autoscaling
> > > [2] https://lists.apache.org/thread/pvfb3fw99mj8r1x8zzyxgvk4dcppwssz
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Matyas Orhidi

2022-11-23 Thread Yang Wang
Congrats Matyas!

Best,
Yang

Maximilian Michels  于2022年11月23日周三 23:27写道:

> Congrats Matyas! Finally :)
>
> On Wed, Nov 23, 2022 at 3:32 PM Martijn Visser 
> wrote:
>
> > Congratulations and welcome!
> >
> > On Wed, Nov 23, 2022 at 2:25 AM yuxia 
> wrote:
> >
> > > Congrats Matyas!
> > >
> > > Best regards,
> > > Yuxia
> > >
> > > - 原始邮件 -
> > > 发件人: "Peter Huang" 
> > > 收件人: "dev" 
> > > 发送时间: 星期三, 2022年 11 月 23日 上午 8:24:00
> > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Matyas Orhidi
> > >
> > > Congrats!
> > >
> > > On Tue, Nov 22, 2022 at 1:47 PM Konstantin Knauf 
> > > wrote:
> > >
> > > > Congrats!
> > > >
> > > > Am Di., 22. Nov. 2022 um 11:46 Uhr schrieb Péter Váry <
> > > > peter.vary.apa...@gmail.com>:
> > > >
> > > > > Congratulations Mátyás!
> > > > >
> > > > > On Tue, Nov 22, 2022, 06:40 Matthias Pohl  > > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > Congratulations, Matyas :)
> > > > > >
> > > > > > On Tue, Nov 22, 2022 at 11:44 AM Xingbo Huang <
> hxbks...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Congrats Matyas!
> > > > > > >
> > > > > > > Best,
> > > > > > > Xingbo
> > > > > > >
> > > > > > > Yanfei Lei  于2022年11月22日周二 11:18写道:
> > > > > > >
> > > > > > > > Congrats Matyas! 
> > > > > > > >
> > > > > > > > Zheng Yu Chen  于2022年11月22日周二 11:15写道:
> > > > > > > >
> > > > > > > > > Congratulations ~ 
> > > > > > > > >
> > > > > > > > > Márton Balassi  于2022年11月21日周一
> 22:18写道:
> > > > > > > > >
> > > > > > > > > > Hi everyone,
> > > > > > > > > >
> > > > > > > > > > On behalf of the PMC, I'm very happy to announce Matyas
> > > Orhidi
> > > > > as a
> > > > > > > new
> > > > > > > > > > Flink
> > > > > > > > > > committer.
> > > > > > > > > >
> > > > > > > > > > Matyas has over a decade of experience of the Big Data
> > > > ecosystem
> > > > > > and
> > > > > > > > has
> > > > > > > > > > been working with Flink full time for the past 3 years.
> In
> > > the
> > > > > open
> > > > > > > > > source
> > > > > > > > > > community he is one of the key driving members of the
> > > > Kubernetes
> > > > > > > > Operator
> > > > > > > > > > subproject. He implemented multiple key features in the
> > > > operator
> > > > > > > > > including
> > > > > > > > > > the metrics system and the ability to dynamically
> configure
> > > > > watched
> > > > > > > > > > namespaces. He enjoys spreading the word about Flink and
> > > > > regularly
> > > > > > > does
> > > > > > > > > so
> > > > > > > > > > via authoring blogposts and giving talks or interviews
> > > > > representing
> > > > > > > the
> > > > > > > > > > community.
> > > > > > > > > >
> > > > > > > > > > Please join me in congratulating Matyas for becoming a
> > Flink
> > > > > > > committer!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Marton
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Best
> > > > > > > > >
> > > > > > > > > ConradJam
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best,
> > > > > > > > Yanfei
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > https://twitter.com/snntrable
> > > > https://github.com/knaufk
> > > >
> > >
> >
>


Re: [DISCUSS] OLM Bundles for Flink Kubernetes Operator

2022-11-23 Thread Yang Wang
Improving the visibility of Flink Kubernetes Operator is great. And I agree
OLM could help with this.

I just hope this will not make the whole release process too complicated.
Of course, if we want to integrate the OLM into the official release, it
should be tested by the users easily.

Best,
Yang

Gyula Fóra  于2022年11月24日周四 00:29写道:

> Ted, Jim:
>
> When we create the RC bundle (jars, sources, helm chart) we execute the
> following steps:
>  1. Push the RC tag to git -> this will generate and publish an image with
> the RC git commit tag to ghcr.io
>  2. We bake into the helm chart the RC tag as the image tag
>  3. We create the source and helm bundle, then publish it
>
> In step 3 we also need to add the OLM bundle creation and we can bake in
> the same ghcr.io image tag.
>
> Gyula
>
> On Wed, Nov 23, 2022 at 7:13 AM Jim Busche  wrote:
>
> > I'm curious how the RC automation works now - is it fully automatic?  For
> > example, a RC Debian image gets created, something like:
> > ghcr.io/apache/flink-kubernetes-operator:95128bf<
> > http://ghcr.io/apache/flink-kubernetes-operator:95128bf> and is pushed
> to
> > ghcr.io … then that's used in the rc helm chart?
> >
> > If that's all automated, then that rc candidate operator image value must
> > be stored as a variable, and could be utilized to build the OLM bundle as
> > well with the same rc operator image.  Then the bundle and catalog could
> be
> > pushed to ghcr.io for testing.
> >
> >
> >
> > If it's not automated, then in the manual steps, there could a few steps
> > added to set the rc operator image value prior to running the bundle
> > creation, then manually pushing the bundle and catalog to ghcr.io for
> > testing.
> >
> >
> > Thanks, Jim
> > --
> > James Busche | Sr. Software Engineer, Watson AI and Data Open Technology
> |
> > 408-460-0737 | jbus...@us.ibm.com
> >
> >
> >
> >
> >
> >
> >
> >
> > From: Hao t Chang 
> > Date: Tuesday, November 22, 2022 at 2:55 PM
> > To: dev@flink.apache.org 
> > Subject: [EXTERNAL] [DISCUSS] OLM Bundles for Flink Kubernetes Operator
> > Hi Gyula,
> >
> > Agree, we should include the bundle file and let community inspect them
> in
> > the staging repo. In addition, people can do a few things to test the
> > bundle files.
> > 1.Run CI test suits (
> > https://github.com/tedhtchang/olm#run-ci-test-suits-before-creating-pr )
> > with the bundle files directly.
> > 2.Deploy operator with OLM (requires the bundle image in a
> > registry)
> > 3.Test operators upgrade from the previous version with
> > OLM(requires both bundle and catalog image in a registry)
> >
> > For 2 and 3, it’s better to build a bundle and catalog images as part of
> > the staging. For example, during the next release(1.3.0-rc1), temporally
> > push the 2 images to
> > ghcr.io/apache/flink-kubernetes-operator-bundle:1.3.0-rc1 and
> > ghcr.io/apache/flink-kubernetes-opeator-catalog:1.3.0-rc1. Then,
> > community can test 2. and 3. easily with the following commands:
> > # Deploy the catalog src in default ns
> > cat < > apiVersion: operators.coreos.com/v1alpha1
> > kind: CatalogSource
> > metadata:
> >   name: olm-flink-operator-catalog
> >   namespace: default
> > spec:
> >   sourceType: grpc
> >   image: ghcr.io/apache/flink-kubernetes-opeator-catalog:1.3.0-rc1
> > EOF
> >
> > # Deploy operator from the catalog
> > Cat < > apiVersion: operators.coreos.com/v1alpha2
> > kind: OperatorGroup
> > metadata:
> >   name: default-og
> >   namespace: default
> > spec:
> >   targetNamespaces:
> >   - default
> > ---
> > apiVersion: operators.coreos.com/v1alpha1
> > kind: Subscription
> > metadata:
> >   name: flink-kubernetes-operator
> >   namespace: default
> > spec:
> >   channel: alpha
> >   name: flink-kubernetes-operator
> >   source: olm-flink-operator-catalog
> >   sourceNamespace: default
> >   # For testing upgrade from previous version
> >   # installPlanApproval: Automatic # Manual
> >   # startingCSV: flink-kubernetes-operator.v1.2.0
> > EOF
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Members - Godfrey He, Xingbo Huang

2022-11-23 Thread Yang Wang
Congratulations, Godfrey and Xingbo!

Best,
Yang

Jing Ge  于2022年11月24日周四 02:06写道:

> Congrats, Godfrey! Congrats, Xingbo!
>
> Best regards,
> Jing
>
> On Wed, Nov 23, 2022 at 6:11 PM Maximilian Michels  wrote:
>
> > Welcome aboard Godfrey and Xingbo!
> >
> > -Max
> >
> > On Wed, Nov 23, 2022 at 5:51 PM Yun Tang  wrote:
> >
> > > Congratulations, Godfrey and Xingbo!
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Rui Fan <1996fan...@gmail.com>
> > > Sent: Wednesday, November 23, 2022 22:31
> > > To: dev@flink.apache.org 
> > > Subject: Re: [ANNOUNCE] New Apache Flink PMC Members - Godfrey He,
> Xingbo
> > > Huang
> > >
> > > Congratulations, well deserved!
> > >
> > > Rui Fan
> > >
> > > On Wed, Nov 23, 2022 at 9:53 PM Konstantin Knauf 
> > > wrote:
> > >
> > > > Congrats to both.
> > > >
> > > > Am Mi., 23. Nov. 2022 um 10:45 Uhr schrieb yu zelin <
> > > yuzelin@gmail.com
> > > > >:
> > > >
> > > > > Congratulations,Godfrey and Xingbo!
> > > > >
> > > > > Best,
> > > > > Yu Zelin
> > > > > > 2022年11月23日 12:18,Dian Fu  写道:
> > > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > On behalf of the Apache Flink PMC, I'm very happy to announce
> that
> > > > > Godfrey
> > > > > > He and Xingbo Huang have joined the Flink PMC!
> > > > > >
> > > > > > Godfrey He becomes a Flink committer in Sep 2020. His
> contributions
> > > are
> > > > > > mainly focused on the Flink table module, covering almost all
> > > important
> > > > > > parts such as Client(SQL Client, SQL gateway, JDBC driver, etc),
> > API,
> > > > SQL
> > > > > > parser, query optimization, query execution, etc. Especially in
> the
> > > > query
> > > > > > optimization part, he built the query optimization framework and
> > the
> > > > cost
> > > > > > model, improved the statistics information and made a lot of
> > > > > optimizations,
> > > > > > e.g. dynamic partition pruning, join hint, multiple input
> rewrite,
> > > etc.
> > > > > In
> > > > > > addition, he has done a lot of groundwork, such as refactoring
> the
> > > > > > ExecNode(which is the basis for further DAG optimizations) and
> SQL
> > > Plan
> > > > > > JSON serialization (which is a big step to support SQL job
> version
> > > > > > upgrade). Besides that, he's also helping the projects in other
> > ways,
> > > > > e.g.
> > > > > > managing releases, giving talks, etc.
> > > > > >
> > > > > > Xingbo Huang becomes a Flink committer in Feb 2021. His
> > contributions
> > > > are
> > > > > > mainly focused on the PyFlink module and he's the author of many
> > > > > important
> > > > > > features in PyFlink, e.g. Cython support, Python thread execution
> > > mode,
> > > > > > Python UDTF support, Python UDAF support in windowing, etc. He is
> > > also
> > > > > one
> > > > > > of the main contributors in the Flink ML project. Besides that,
> > he's
> > > > also
> > > > > > helping to manage releases, taking care of the build stabilites,
> > etc.
> > > > > >
> > > > > > Congratulations and welcome them as Apache Flink PMC!
> > > > > >
> > > > > > Regards,
> > > > > > Dian
> > > > >
> > > > >
> > > >
> > > > --
> > > > https://twitter.com/snntrable
> > > > https://github.com/knaufk
> > > >
> > >
> >
>


Re: [DISCUSS] FLIP-271: Autoscaling

2022-11-07 Thread Yang Wang
Thanks for the fruitful discussion and I am really excited to see that the
auto-scaling really happens for

Flink Kubernetes operator. It will be a very important step to make the
long-running Flink job more smoothly.

I just have some immature ideas and want to share them here.


# Resource Reservation

Since the current auto-scaling needs to fully redeploy the application, it
may fail to start due to lack of resources.

I know the Kubernetes operator could rollback to the old spec, but we still
waste a lot of time to make things worse.

I hope the FLIP-250[1](support customized K8s scheduler) could help in this
case.


# Session cluster

Does auto-scaling have a plan to support jobs which are running in a
session cluster? It might be a different

story since we could not use Flink config options to override the  job
vertex parallelisms. Given that the SessionJob

is also a first-class citizen, we need to document the limitation if not
support.


# Horizontal scaling V.S. Vertical scaling

IIUC, the current proposal does not mention vertical scaling. There might
be a chance that the memory/cpu of

TaskManager is not configured properly. And this will cause unnecessary
multiple scaling executions.




[1].
https://cwiki.apache.org/confluence/display/FLINK/FLIP-250%3A+Support+Customized+Kubernetes+Schedulers+Proposal



Best,

Yang

Maximilian Michels  于2022年11月8日周二 00:31写道:

> Thanks for all the interest here and for the great remarks! Gyula
> already did a great job addressing the questions here. Let me try to
> add additional context:
>
> @Biao Geng:
>
> >1.  For source parallelisms, if the user configure a much larger value
> than normal, there should be very little pending records though it is
> possible to get optimized. But IIUC, in current algorithm, we will not take
> actions for this case as the backlog growth rate is almost zero. Is the
> understanding right?
>
> This is actually a corner case which we haven't exactly described in
> the FLIP yet. Sources are assumed to only be scaled according to the
> backlog but if there is zero backlog, we don't have a number to
> compute the parallelism. In this case we tune the source based on the
> utilization, just like we do for the other vertices. That could mean
> reducing the parallelism in case the source is not doing any work.
> Now, in case there is no backlog, we need to be careful that we don't
> bounce back to a higher parallelism afterwards.
>
> >2.  Compared with “scaling out”, “scaling in” is usually more dangerous
> as it is more likely to lead to negative influence to the downstream jobs.
> The min/max load bounds should be useful. I am wondering if it is possible
> to have different strategy for “scaling in” to make it more conservative.
> Or more eagerly, allow custom autoscaling strategy(e.g. time-based
> strategy).
>
> Gyula already mentioned the bounded scale-down. Additionally, we could
> add more conservative utilization targets for scale down. For example,
> if we targeted 60% utilization for scale-up, we might target 30%
> utilization for scale-down, essentially reducing the parallelism
> slower. Same as with the limited parallelism scale-down, in the worst
> case this will require multiple scale downs. Ideally, the metrics
> should be reliable enough such that we do not require such
> workarounds.
>
> @JunRui Lee:
>
> >In the document, I didn't find the definition of when to trigger
> autoScaling after some jobVertex reach the threshold. If I missed is,
> please let me know.
>
> The triggering is supposed to work based on the number of metric
> reports to aggregate and the cool down time. Additionally, there are
> boundaries for the target rates such that we don't scale on tiny
> deviations of the rates. I agree that we want to prevent unnecessary
> scalings as much as possible. We'll expand on that.
>
> @Pedro Silva:
>
> >Have you considered making metrics collection getting triggered based on
> events rather than periodic checks?
>
> Ideally we want to continuously monitor the job to be able to find
> bottlenecks. Based on the metrics, we will decide whether to scale or
> not. However, if we find that the continuous monitoring is too costly,
> we might do it based on signals. Also, if there is some key-turn event
> that we must refresh our metrics for, that could also be interesting.
> A sudden spike in the backlog could warrant that.
>
> > Could the FLIP also be used to auto-scale based on state-level metrics
> at an operator level?
>
> It could but we don't want to modify the JobGraph which means we are
> bound to using task-level parallelism. Setting operator level
> parallelism would mean rebuilding the JobGraph which is a tricky thing
> to do. It would increase the solution space but also the complexity of
> finding a stable scaling configuration.
>
> @Zheng:
>
> >After the user opens (advcie), it does not actually perform AutoScaling.
> It only outputs the notification form of tuning suggestions for the user's
> reference.

Re: [DISCUSS] Repeatable cleanup of checkpoint data

2022-11-06 Thread Yang Wang
Thanks Matthias for continuously improving the clean-up process.

Given that we highly depends on K8s APIServer for HA implementation, I am
not in favor of storing too many entries in the ConfigMap,
as well as adding more update requests to the APIServer. So I lean towards
Proposal #2. It just works like we revert the current mark-deletion
in StateHandleStore and then introduce a completely new FileSystem based
artifacts clean-up mechanism.

When doing the failover, I suggest the clean-up to be processed
asynchronously. Otherwise, listing the completed checkpoints and deleting
the invalid ones will take too much time and slow down the recovery process.

Best,
Yang

Matthias Pohl  于2022年10月27日周四 20:20写道:

> I would like to bring this topic up one more time. I put some more thought
> into it and created FLIP-270 [1] as a follow-up of FLIP-194 [2] with an
> updated version of what I summarized in my previous email. It would be
> interesting to get some additional perspectives on this; more specifically,
> the two included proposals about either just repurposing the
> CompletedCheckpointStore into a more generic CheckpointStore or refactoring
> the StateHandleStore interface moving all the cleanup logic from the
> CheckpointsCleaner and StateHandleStore into what's currently called
> CompletedCheckpointStore.
>
> Looking forward to feedback on that proposal.
>
> Best,
> Matthias
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-270%3A+Repeatable+Cleanup+of+Checkpoints
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
>
> On Wed, Sep 28, 2022 at 4:07 PM Matthias Pohl 
> wrote:
>
> > Hi everyone,
> >
> > I’d like to start a discussion on repeatable cleanup of checkpoint data.
> > In FLIP-194 [1] we introduced repeatable cleanup of HA data along the
> > introduction of the JobResultStore component. The goal was to make Flink
> > being in charge of cleanup for the data it owns. The Flink cluster should
> > only shutdown gracefully after all its artifacts are removed. That way,
> one
> > would not miss abandoned artifacts accidentally.
> >
> > We forgot to cover one code path around cleaning up checkpoint data.
> > Currently, in case of an error (e.g. permission issues), checkpoints are
> > tried to be cleaned up in the CheckpointsCleaner and left like that if
> > that cleanup failed. A log message is printed. The user would be
> > responsible for cleaning up the data. This was discussed as part of the
> > release testing efforts for Flink 1.15 in FLINK-26388 [2].
> >
> > We could add repeatable cleanup in the CheckpointsCleaner. We would have
> > to make sure that all StateObject#discardState implementations are
> > idempotent. This is not necessarily the case right now (see FLINK-26606
> > [3]).
> >
> > Additionally, there is the problem of losing information about what
> > Checkpoints are subject to cleanup in case of JobManager failovers. These
> > Checkpoints are not stored as part of the HA data. Additionally,
> > PendingCheckpoints are not serialized in any way, either. None of these
> > artifacts are picked up again after a failover. I see the following
> options
> > here:
> >
> >-
> >
> >The purpose of CompletedCheckpointStore needs to be extended to become
> >a “general” CheckpointStore. It will store PendingCheckpoints and
> >CompletedCheckpoints that are marked for deletion. After a failover,
> >CheckpointsCleaner can pick up these instances again and continue with
> >the deletion process.
> >
> > The flaw of that approach is that we’re increasing the amount of data
> that
> > is stored in the underlying StateHandleStore. Additionally, we’re going
> > to have an increased number of accesses to the CompletedCheckpointStore.
> > These accesses need to happen in the main thread; more specifically,
> adding
> > PendingCheckpoints and marking Checkpoints for deletion.
> >
> >-
> >
> >We’re actually interested in cleaning up artifacts from the
> >FileSystem, i.e. the artifacts created by the StateHandleStore used
> >within the DefaultCompletedCheckpointStore containing the serialized
> >CompletedCheckpoint instance and the checkpoint’s folder containing
> >the actual operator states. We could adapt the
> CompletedCheckpointStore
> >in a way that any Checkpoint (including PendingCheckpoint) is
> >serialized and persisted on the FileSystem right away (which is
> currently
> >done within the StateHandleStore implementations when adding
> >CompletedCheckpoints to the underlying HA system). The corresponding
> >FileStateHandleObject (referring to that serialized
> CompletedCheckpoint)
> >that gets persisted to ZooKeeper/k8s ConfigMap in the end would be
> only
> >written if the CompletedCheckpoint is finalized and can be used. The
> >CheckpointsCleaner could recover any artifacts from the FileSystem and
> >cleanup anything that’s not listed in ZooKeeper/k8s 

Re: [DISCUSS ] add --jars to support users dependencies jars.

2022-10-27 Thread Yang Wang
Thanks Jacky Lau for starting this discussion.

I understand that you are trying to find a convenient way to specify
dependency jars along with user jar. However,
let's try to narrow down by differentiating deployment modes.

# Standalone mode
No matter you are using the standalone mode on virtual machine, or in a
Kubernetes cluster,
it is not very difficult to preparing user jar and all the dependencies
under the $FLINK_HOME/usrlib directory.
After then, they will be loaded by user classloader automatically.

# Yarn
We already have "--ship/-Dyarn.ship-files" to ship the dependency jars.

# Native K8s
Currently, only the local user jar in the image could be supported. And
users could not specify dependency jars.
A feasible solution is using the init-container(via pod template[1]) to
download the user jar and dependencies and then mount to usrlib directory.


All in all, I trying to get you point about why do we need the "--jars" to
specify the dependency jars. And which deployment mode it will support?


Best,
Yang

[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#pod-template



Martijn Visser  于2022年10月27日周四 14:49写道:

> Hi Jacky Lau,
>
> Since you've sent the email to multiple mailing lists, I've decided to
> reply to the one that you've sent to both the Dev and User ML.
>
> > but it is not possible for platform users to create fat jars to package
> all their dependencies into the final jar package
>
> Can you elaborate on why that's not possible?
>
> Best regards,
>
> Martijn
>
> On Thu, Oct 27, 2022 at 6:59 AM Jacky Lau  wrote:
>
> > Hi guys:
> >
> > I'd like to initiate a discussion about adding command-line arguments to
> > support user-dependent jar packages.
> >
> > Currently flink supports user's main jars through -jarfile or without
> > setting this , the flink client will treat the first argument after that
> as
> > the user master jar package when it encounters the first command line
> > argument that cannot be parsed. but it is not possible for platform users
> > to create fat jars to package all their dependencies into the final jar
> > package. In the meantime, the configuration pipeline.jars is currently
> > exposed, and this value is overridden by command-line arguments such as
> > -jarfile.
> >
> > And If the user is using both the command-line argument and the
> > pipeline.jars argument, which can make the user werild and confused. In
> > addition, we should specify the priority "command line parameter > -D
> > dynamic parameter > flink-conf.yml configuration file parameter" in docs
> >
>


Re: [VOTE] Release 1.16.0, release candidate #2

2022-10-25 Thread Yang Wang
+1(binding)

* Built from source
* Verified signature and checksum
* Build docker image with flink binary
* Submit/stop a streaming and batch job with Flink Kubernetes Operator and
everything works well
* Check the metrics and logs via ingress webUI


Best,
Yang

Mason Chen  于2022年10月25日周二 14:43写道:

> +1 (non-binding)
>
> * Built from source
> * Verified signature and checksum
> * Verified behavior/metrics/logs with internal stateful applications using
> the Kafka source/sink connectors on K8s
>
> Best,
> Mason
>
> On Mon, Oct 24, 2022 at 11:16 PM Leonard Xu  wrote:
>
> >
> > > BTW, the "Add New" button in "Submit New Job" tab can't work in my
> local
> > > standalone cluster, is this as expected?
> >
> > I checked this case it works well in my local env(MacOS + Chrome),it
> > should be your env issue.
> >
> > Best,
> > Leonard Xu
> >
> >
> >
> >
> >
> >
> >
> > >> +1 (non-binding) for this candidate
> > >>
> > >>  *   Built from the source code.
> > >>  *   Ran batch wordcount jobs with slow nodes of different source
> types
> > on
> > >> the yarn cluster.
> > >>  *   The new source speculative execution works as expected, the
> result
> > is
> > >> expected, no suspicious log output.
> > >>  *   Slow nodes are successfully added to the blocklist and
> subsequently
> > >> removed as expected.
> > >>  *   Ran large parallelism batch jobs and performance does not
> degrade.
> > >>
> > >> Best,
> > >> JunRui
> > >>
> > >> yuxia  于2022年10月25日周二 09:23写道:
> > >>
> > >>> +1 (non-binding)
> > >>> * Build from source
> > >>> * Use Flink Sql client create catalog/tables
> > >>> * Use Hive dialect to run some queries and insert statements
> > >>>
> > >>> Best regards,
> > >>> Yuxia
> > >>>
> > >>> - 原始邮件 -
> > >>> 发件人: "Teoh, Hong" 
> > >>> 收件人: "dev" 
> > >>> 发送时间: 星期二, 2022年 10 月 25日 上午 4:35:39
> > >>> 主题: Re: [VOTE] Release 1.16.0, release candidate #2
> > >>>
> > >>> +1 (non-binding)
> > >>>
> > >>> * Hashes and Signatures look good
> > >>> * All required files on dist.apache.org
> > >>> * Tag is present in Github
> > >>> * Verified source archive does not contain any binary files
> > >>> * Source archive builds using maven
> > >>> * Deployed standalone session cluster and ran TopSpeedWindowing
> example
> > >> in
> > >>> streaming with checkpointing enabled. Looks ok
> > >>>
> > >>> Cheers,
> > >>> Hong
> > >>>
> > >>> On 24/10/2022, 16:06, "Gyula Fóra"  wrote:
> > >>>
> > >>>CAUTION: This email originated from outside of the organization.
> Do
> > >>> not click links or open attachments unless you can confirm the sender
> > and
> > >>> know the content is safe.
> > >>>
> > >>>
> > >>>
> > >>>+1 (binding)
> > >>>
> > >>>* Verified checksums/GPG signatures
> > >>>* Built from source
> > >>>* Tested with Kubernetes operator, including simple jobs,
> > >>> checkpointing etc.
> > >>>* Metrics, logs look good.
> > >>>
> > >>>Gyula
> > >>>
> > >>>On Mon, Oct 24, 2022 at 4:54 PM Matthias Pohl
> > >>> wrote:
> > >>>
> >  +1 (non-binding)
> > 
> >  * Downloaded artifacts
> >  * Verified checksums/GPG signatures
> >  * Compared checkout with provided sources
> >  * Verified pom file versions
> >  * Went over NOTICE file/pom files changes without finding anything
> >  suspicious
> >  * Build Flink from sources
> >  * Deployed standalone session cluster and ran WordCount example in
> > >>> batch
> >  and streaming: Nothing suspicious in log files found
> > 
> >  On Mon, Oct 24, 2022 at 3:51 PM Sergey Nuyanzin <
> > >> snuyan...@gmail.com
> > 
> >  wrote:
> > 
> > > +1 (non-binding)
> > >
> > > - checked hashes and signatures
> > > - built from sources
> > > - started cluster, ran different simple jobs
> > > - checked sql client
> > >
> > >
> > > On Mon, Oct 24, 2022 at 3:14 PM Leonard Xu 
> > >>> wrote:
> > >
> > >> +1 (non-binding)
> > >>
> > >> - verified signatures and hashsums
> > >> - built from source code succeeded
> > >> - checked all dependency artifacts are 1.16
> > >> - started a cluster, ran a wordcount job, the result is
> > >>> expected, no
> > >> suspicious log output
> > >> - started SQL Gateway, tested several rest APIs, the SQL query
> > >>> results
> > > are
> > >> expected
> > >>
> > >> Best,
> > >> Leonard Xu
> > >>
> > >>
> > >>> 2022年10月24日 下午8:49,Xingbo Huang  写道:
> > >>>
> > >>> +1 (non-binding)
> > >>>
> > >>> - verify signatures and checksums
> > >>> - no binaries found in source archive
> > >>> - build from source code
> > >>> - verify python wheel package contents
> > >>> - pip install apache-flink-libraries and apache-flink wheel
> > >>> packages
> > >>> - thread mode works as expected in Python DataStream API
> > >>> - the Python DataStream Window works as expected
> > >>> - the Python Sideoutput works as expected
> > >>>
> > >>> 

Re: [DISCUSS] Changing the minimal supported version of Hadoop to 2.10.2

2022-10-20 Thread Yang Wang
Given that we do not bundle any hadoop classes in the Flink binary, do you
mean simply bump the hadoop version in the parent pom?
If it is, why do not we use the latest stable hadoop version 3.3.4? It
seems that our cron build has verified that hadoop3 could work.

Best,
Yang

David Morávek  于2022年10月19日周三 16:29写道:

> +1; anything below 2.10.x seems to be EOL
>
> Best,
> D.
>
> On Mon, Oct 17, 2022 at 10:48 AM Márton Balassi 
> wrote:
>
> > Hi Martjin,
> >
> > +1 for 2.10.2. Do you expect to have bandwidth in the near term to
> > implement the bump?
> >
> > On Wed, Oct 5, 2022 at 5:00 PM Gabor Somogyi 
> > wrote:
> >
> > > Hi Martin,
> > >
> > > Thanks for bringing this up! Lately I was thinking about to bump the
> > hadoop
> > > version to at least 2.6.1 to clean up issues like this:
> > >
> > >
> >
> https://github.com/apache/flink/blob/8d05393f5bcc0a917b2dab3fe81a58acaccabf13/flink-filesystems/flink-hadoop-fs/src/main/java/org/apache/flink/runtime/util/HadoopUtils.java#L157-L159
> > >
> > > All in all +1 from my perspective.
> > >
> > > Just a question here. Are we stating the minimum Hadoop version for
> users
> > > somewhere in the doc or they need to find it out from source code like
> > > this?
> > >
> > >
> >
> https://github.com/apache/flink/blob/3a4c11371e6f2aacd641d86c1d5b4fd86435f802/tools/azure-pipelines/build-apache-repo.yml#L113
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Wed, Oct 5, 2022 at 5:02 AM Martijn Visser <
> martijnvis...@apache.org>
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Little over a year ago a discussion thread was opened on changing the
> > > > minimal supported version of Hadoop and bringing that to 2.8.5. [1]
> In
> > > this
> > > > discussion thread, I would like to propose to bring that minimal
> > > supported
> > > > version of Hadoop to 2.10.2.
> > > >
> > > > Hadoop 2.8.5 is vulnerable for multiple CVEs which are classified as
> > > > Critical. [2] [3]. While Flink is not directly impacted by those, we
> do
> > > see
> > > > vulnerability scanners flag Flink as being vulnerable. We could
> easily
> > > > mitigate that by bumping the minimal supported version of Hadoop to
> > > 2.10.2.
> > > >
> > > > I'm looking forward to your opinions on this topic.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > > https://twitter.com/MartijnVisser82
> > > > https://github.com/MartijnVisser
> > > >
> > > > [1] https://lists.apache.org/thread/81fhnwfxomjhyy59f9bbofk9rxpdxjo5
> > > > [2] https://nvd.nist.gov/vuln/detail/CVE-2022-25168
> > > > [3] https://nvd.nist.gov/vuln/detail/CVE-2022-26612
> > > >
> > >
> >
>


[jira] [Created] (FLINK-29705) Document the least access with RBAC setting for native K8s integration

2022-10-20 Thread Yang Wang (Jira)
Yang Wang created FLINK-29705:
-

 Summary: Document the least access with RBAC setting for native 
K8s integration
 Key: FLINK-29705
 URL: https://issues.apache.org/jira/browse/FLINK-29705
 Project: Flink
  Issue Type: Improvement
  Components: Deployment / Kubernetes, Documentation
Reporter: Yang Wang


We should document the least access with RBAC settings[1]. And the operator 
docs could be taken as a reference[2].

 

[1]. 
[https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#rbac]

[2]. 
[https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/rbac/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Reference operator from Flink Kubernetes deployment docs

2022-10-13 Thread Yang Wang
+1 for increasing the visibility of flink-kubernetes-operator.

Best,
Yang

Thomas Weise  于2022年10月13日周四 07:49写道:

> +1
>
>
> On Wed, Oct 12, 2022 at 5:03 PM Martijn Visser 
> wrote:
>
> > +1 from my end to include the operator in the related Kubernetes sections
> > of the Flink docs
> >
> > On Wed, Oct 12, 2022 at 5:31 PM Chesnay Schepler 
> > wrote:
> >
> > > I don't see a reason for why we shouldn't at least mention the operator
> > > in the kubernetes docs.
> > >
> > > On 12/10/2022 16:25, Gyula Fóra wrote:
> > > > Hi Devs!
> > > >
> > > > I would like to start a discussion about referencing the Flink
> > Kubernetes
> > > > Operator directly from the Flink Kubernetes deployment documentation.
> > > >
> > > > Currently the Flink deployment/resource provider docs provide some
> > > > information for the Standalone and Native Kubernetes integration
> > without
> > > > any reference to the operator.
> > > >
> > > > I think we reached a point with the operator where we should provide
> a
> > > bit
> > > > more visibility and value to the users by directly proposing to use
> the
> > > > operator when considering Flink on Kubernetes. We should definitely
> > keep
> > > > the current docs but make the point that for most users the easiest
> way
> > > to
> > > > use Flink on Kubernetes is probably through the operator (where they
> > can
> > > > now benefit from both standalone and native integration under the
> > hood).
> > > > This should help us avoid cases where a new user completely misses
> the
> > > > existence of the operator when starting out based on the Flink docs.
> > > >
> > > > What do you think?
> > > >
> > > > Gyula
> > > >
> > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Danny Cranmer

2022-10-13 Thread Yang Wang
Congratulations Danny!

Best,
Yang

Hang Ruan  于2022年10月13日周四 10:58写道:

> Congratulations Danny!
>
> Best,
> Hang
>
> Yun Gao  于2022年10月13日周四 10:56写道:
>
> > Congratulations Danny!
> > Best,
> > Yun Gao
> > --
> > From:yuxia 
> > Send Time:2022 Oct. 12 (Wed.) 09:49
> > To:dev 
> > Subject:Re: [ANNOUNCE] New Apache Flink PMC Member - Danny Cranmer
> > Congratulations Danny!
> > Best regards,
> > Yuxia
> > - 原始邮件 -
> > 发件人: "Xingbo Huang" 
> > 收件人: "dev" 
> > 发送时间: 星期三, 2022年 10 月 12日 上午 9:44:22
> > 主题: Re: [ANNOUNCE] New Apache Flink PMC Member - Danny Cranmer
> > Congratulations Danny!
> > Best,
> > Xingbo
> > Sergey Nuyanzin  于2022年10月12日周三 01:26写道:
> > > Congratulations, Danny
> > >
> > > On Tue, Oct 11, 2022, 15:18 Lincoln Lee 
> wrote:
> > >
> > > > Congratulations Danny!
> > > >
> > > > Best,
> > > > Lincoln Lee
> > > >
> > > >
> > > > Congxian Qiu  于2022年10月11日周二 19:42写道:
> > > >
> > > > > Congratulations Danny!
> > > > >
> > > > > Best,
> > > > > Congxian
> > > > >
> > > > >
> > > > > Leonard Xu  于2022年10月11日周二 18:03写道:
> > > > >
> > > > > > Congratulations Danny!
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Leonard
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS][FLINK-29372] Add a suffix to keys that violate YAML spec

2022-09-21 Thread Yang Wang
This will make it possible to replace the current rough implementation[1]
with a common yaml parser.
And then we could avoid some unexpected behaviors[2].

+1

[1].
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/GlobalConfiguration.java#L179
[2]. https://issues.apache.org/jira/browse/FLINK-15358

Best,
Yang

Konstantin Knauf  于2022年9月22日周四 04:26写道:

> Make sense to me. It is moving us in the right direction and makes it
> possible to drop these keys with Flink 2.0 if that ever happens :)
>
> Am Mi., 21. Sept. 2022 um 16:06 Uhr schrieb Chesnay Schepler <
> ches...@apache.org>:
>
> > Hi,
> >
> > we have a small number of options in Flink whose key is a prefix of
> > other keys.
> >
> > This violates the YAML spec; when you view the options as a tree only
> > leaf nodes may have properties.
> >
> > While this is a minor issue from our side I think this can be quite
> > annoying for users, since it means you can't read or validate a Flink
> > config with standard yaml tools.
> >
> > I'd like to add a suffix to those keys to resolve this particular
> > problem, while still supporting the previous keys (as deprecated).
> >
> > AFAICT there aren't any risks here,
> > except if users have a search step for one of these options in
> > the default config of the Flink distribution;
> > however this seems unsafe in any case since the contents of the default
> > config may change.
> >
> > This would also bring us a step closer to our goal of using a compliant
> > YAML parser.
> >
>
>
> --
> https://twitter.com/snntrable
> https://github.com/knaufk
>


Re: Kubernetes Operator 1.2.0 release

2022-09-19 Thread Yang Wang
Thanks Gyula for managing the release.

+1 for the time schedule.


Best,
Yang



Őrhidi Mátyás  于2022年9月19日周一 22:28写道:

> Thanks Gyula!
>
> Sounds good! Happy to help as always.
>
> Cheers,
> Matyas
>
> On Mon, Sep 19, 2022 at 1:37 PM Gyula Fóra  wrote:
>
> > Hi Devs!
> >
> > The originally planned (October 1) release date for 1.2.0 is fast
> > approaching and we are already slightly behind schedule. There are a
> couple
> > outstanding bug tickets with 2 known blockers at the moment that should
> be
> > fixed in the next few days.
> >
> > As we are not aware of any larger critical issues or outstanding
> features I
> > propose the following adjusted release schedule:
> >
> >
> > *Feature Freeze: September 23Release Branch Cut & RC1: September 28*
> >
> > Hopefully then we can finalize the release somewhere in the first week of
> > October :)
> >
> > I volunteer as the release manager.
> >
> > Cheers,
> > Gyula
> >
>


Re: Recommended way to Enable SSL Flink Kubernetes Operator

2022-09-14 Thread Yang Wang
I think you have already found the solution.

Pod template[1] is exactly what you want.

[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#pod-template

Best,
Yang

Hao t Chang  于2022年9月13日周二 07:41写道:

> Hi Biao
> I think this modify basic-example FlinkDeployment  should load the
> existing keystore although I am not certain re-using the webhook keystore
> recommended.
>
> apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
>   name: basic-example
> spec:
>   image: flink:1.15
>   flinkVersion: v1_15
>   flinkConfiguration:
> taskmanager.numberOfTaskSlots: "2"
>   serviceAccount: flink
>   jobManager:
> resource:
>   memory: "2048m"
>   cpu: 1
>   taskManager:
> resource:
>   memory: "2048m"
>   cpu: 1
>   podTemplate:
> apiVersion: v1
> kind: Pod
> metadata:
>   name: pod-template
> spec:
>   containers:
>   - name: flink-main-container
> volumeMounts:
>   - mountPath: /certs
> name: keystore
>   volumes:
>   - name: keystore
> secret:
>   defaultMode: 420
>   items:
>   - key: keystore.p12
> path: keystore.p12
>   secretName: webhook-server-cert
>   job:
> jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
> parallelism: 2
> upgradeMode: stateless
>
> Verify with curl
> curl -v -k https://basic-example-rest:8081
> *   Trying 172.21.126.88:8081...
> * Connected to basic-example-rest (172.21.126.88) port 8081 (#0)
> * ALPN, offering h2
> * ALPN, offering http/1.1
> * successfully set certificate verify locations:
> *  CAfile: /etc/ssl/certs/ca-certificates.crt
> *  CApath: /etc/ssl/certs
> * TLSv1.3 (OUT), TLS handshake, Client hello (1):
> * TLSv1.3 (IN), TLS handshake, Server hello (2):
> * TLSv1.2 (IN), TLS handshake, Certificate (11):
> * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
> * TLSv1.2 (IN), TLS handshake, Server finished (14):
> * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
> * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
> * TLSv1.2 (OUT), TLS handshake, Finished (20):
> * TLSv1.2 (IN), TLS handshake, Finished (20):
> * SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
> * ALPN, server did not agree to a protocol
> * Server certificate:
> *  subject: CN=FlinkDeployment Validator
> *  start date: Sep 12 17:38:37 2022 GMT
> *  expire date: Dec 11 17:38:37 2022 GMT
> *  issuer: CN=FlinkDeployment Validator
> *  SSL certificate verify result: self signed certificate (18), continuing
> anyway.
> > GET / HTTP/1.1
> > Host: basic-example-rest:8081
> > User-Agent: curl/7.74.0
> > Accept: */*
>
> From: Hao t Chang 
> Date: Friday, September 9, 2022 at 11:10 AM
> To: dev@flink.apache.org 
> Subject: [EXTERNAL] Re: Recommended way to Enable SSL Flink Kubernetes
> Operator
> Hi Biao thanks for the quick reply.
> The helm chart uses a standard Deployment to mount the keystore onto the
> webhook container using volumes/volumeMounts for the operator but it’s not
> clear to me how to mount the keystore using the FlinkDeployment CRD[2] for
> a Flink application.
>
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Martijn Visser

2022-09-12 Thread Yang Wang
Congrats, Martijn!

Best,
Yang

Lijie Wang  于2022年9月13日周二 10:10写道:

> Congratulations, Martijn!
>
> Best,
> Lijie
>
> yuxia  于2022年9月13日周二 09:52写道:
>
> > Congrats, Martijn!
> >
> > Best regards,
> > Yuxia
> >
> > - 原始邮件 -
> > 发件人: "Steven Wu" 
> > 收件人: "dev" 
> > 发送时间: 星期二, 2022年 9 月 13日 上午 5:33:47
> > 主题: Re: [ANNOUNCE] New Apache Flink PMC Member - Martijn Visser
> >
> > Congrats, Martijn!
> >
> > On Mon, Sep 12, 2022 at 1:49 PM Alexander Fedulov 
> > wrote:
> >
> > > Congrats, Martijn!
> > >
> > > On Mon, Sep 12, 2022 at 10:06 AM Jing Ge  wrote:
> > >
> > > > Congrats!
> > > >
> > > > On Mon, Sep 12, 2022 at 9:38 AM Daisy Tsang 
> > wrote:
> > > >
> > > > > Congrats!
> > > > >
> > > > > On Mon, Sep 12, 2022 at 9:32 AM Martijn Visser <
> > > martijnvis...@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Thank you all :)
> > > > > >
> > > > > > Op zo 11 sep. 2022 om 13:58 schreef Zheng Yu Chen <
> > > jam.gz...@gmail.com
> > > > >:
> > > > > >
> > > > > > > Congratulations, Martijn
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Timo Walther  于2022年9月9日周五 23:08写道:
> > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > I'm very happy to announce that Martijn Visser has joined the
> > > Flink
> > > > > > PMC!
> > > > > > > >
> > > > > > > > Martijn has helped the community in many different ways over
> > the
> > > > past
> > > > > > > > months. Externalizing the connectors from the Flink repo to
> > their
> > > > own
> > > > > > > > repository, continously updating dependencies, and performing
> > > other
> > > > > > > > project-wide refactorings. He is constantly coordinating
> > > > > contributions,
> > > > > > > > connecting stakeholders, finding committers for
> contributions,
> > > > > driving
> > > > > > > > release syncs, and helping in making the ASF a better place
> > (e.g.
> > > > by
> > > > > > > > using Matomo an ASF-compliant tracking solution for all
> > > projects).
> > > > > > > >
> > > > > > > > Congratulations and welcome, Martijn!
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Timo Walther
> > > > > > > > (On behalf of the Apache Flink PMC)
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best
> > > > > > >
> > > > > > > ConradJam
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Caizhi Weng

2022-09-08 Thread Yang Wang
Congrats Caizhi

Best,
Yang

Jark Wu  于2022年9月9日周五 11:46写道:

> Congrats ChaiZhi!
>
> Cheers,
> Jark
>
> > 2022年9月9日 11:26,Lijie Wang  写道:
> >
> > Congratulations Caizhi
> >
> > Best,
> > Lijie
> >
> > Yuxin Tan  于2022年9月9日周五 10:19写道:
> >
> >> Caizhi, Congratulations!
> >>
> >> Best,
> >> Yuxin
> >>
> >>
> >> Jane Chan  于2022年9月9日周五 10:09写道:
> >>
> >>> Congrats Caizhi
> >>>
> >>> Best,
> >>> Jane
> >>>
> >>> On Fri, Sep 9, 2022 at 9:58 AM Xingbo Huang 
> wrote:
> >>>
>  Congratulations Caizhi
> 
>  Best,
>  Xingbo
> 
>  Jingsong Lee  于2022年9月9日周五 09:41写道:
> 
> > Hi everyone,
> >
> > On behalf of the PMC, I'm very happy to announce Caizhi Weng as a new
> > Flink committer.
> >
> > Caizhi has been contributing to the Flink project for more than 3
> > years, and has authored 150+ PRs. He is one of the key driving
> >> members
> > of the Flink Table Store subproject. He is responsible for the core
> > design of transaction committing. Expanded the Hive ecosystem of
> >> Flink
> > Table Store. He also works in Flink SQL, helps solve the problems of
> > ease of use and performance.
> >
> > Please join me in congratulating Caizhi for becoming a Flink
> >> committer!
> >
> > Best,
> > Jingsong
> >
> 
> >>>
> >>
>
>


Re: [DISCUSS] ARM support for Flink

2022-08-26 Thread Yang Wang
Thanks Bo for starting this discussion.

I think it is really useful to have the CI for ARM platform. But I am not
sure what is the current situation. TBH, I have not build an ARM image for
Flink.

Given that we have not finished the migration from azure pipelines to
github action, I believe we still need to setup the self-hosted ARM
machines for azure pipeline.

Best,
Yang

bo zhaobo  于2022年8月25日周四 09:37写道:

> Hi Flink,
> Any idea about this? ;-)
>
> bo zhaobo  于2022年8月23日周二 19:28写道:
>
> > Hi Flinkers,
> >
> > In 2019, we raised an discussion in Flink about "ARM support for
> > Flink"[1]. And we got so many helps and supports from Flink community
> about
> > introducing a ARM CI system into Flink community, which is named
> > "OpenLab"[2], and finally we create a full stack regression Flink tests
> on
> > OpenLab ARM resources, then post a email to Flink Maillist about test
> > result every day. We've been doing that for almost 2 years.
> >
> > But now, we are so apologized that OpenLab had been reached its EOL. We
> > had to close it last month. So for sustaining the existing ARM CI still
> > work on Flink Community and helping contributors to verify their code on
> > ARM. We decide that we *will donate some ARM resources(Virtual Machines)
> *into
> > Flink community to make this happen.
> >
> > And considering the existing Flink CICD had been moved to Azure Pipeline,
> > and doesn't use github action.
> >
> > Now what we can provide are *ONLY *ARM resources(Virtual Machines) from
> > us, so we think *Flink community is the right role to decide how to use
> > them.*
> > We only give several suggestions here:
> > 1. Github action self-hosted Machines(Integrated with our ARM resources)
> > 2. Azure Pipeline self-hosted Machines(Integrated with our ARM resources)
> > 3. Any ideas from Flinkers?
> >
> > If community accepts our ARM resources and want to integrate with
> existing
> > CICD in any way, please feel free to ping me about the quota(CPU nums,
> > Memory size and so on) of VM we need to donate.
> >
> > Thank you very much.
> >
> > BR
> >
> > Bo Zhao
> >
> > [1] https://www.mail-archive.com/dev@flink.apache.org/msg27054.html
> > [2] https://openlabtesting.org/
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Junhan Yang

2022-08-22 Thread Yang Wang
Congrats Junhan!


Best,
Yang

Matthias Pohl  于2022年8月22日周一 15:35写道:

> Congratulations & welcome! :-)
>
> Matthias
>
> On Sun, Aug 21, 2022 at 5:42 AM Yuan Mei  wrote:
>
> > Congratulations Junhan!
> >
> > Best,
> > Yuan
> >
> > On Sat, Aug 20, 2022 at 2:11 PM Danny Cranmer 
> > wrote:
> >
> > > Congratulations Junhan! Welcome to the team.
> > >
> > > On Sat, 20 Aug 2022, 03:01 yuxia,  wrote:
> > >
> > > > Congratulations, Junhan!
> > > >
> > > > Best regards,
> > > > Yuxia
> > > >
> > > > - 原始邮件 -
> > > > 发件人: "Aitozi" 
> > > > 收件人: "dev" 
> > > > 发送时间: 星期六, 2022年 8 月 20日 上午 12:18:29
> > > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Junhan Yang
> > > >
> > > > Congratulations, Junhan!
> > > > Best,
> > > > Aitozi
> > > >
> > > > Guowei Ma  于2022年8月19日周五 13:18写道:
> > > >
> > > > > Congratulations, Junhan!
> > > > > Best,
> > > > > Guowei
> > > > >
> > > > >
> > > > > On Fri, Aug 19, 2022 at 6:01 AM Jing Ge 
> wrote:
> > > > >
> > > > > > Congrats Junhan!
> > > > > >
> > > > > > Best regards,
> > > > > > Jing
> > > > > >
> > > > > > On Thu, Aug 18, 2022 at 12:05 PM Jark Wu 
> wrote:
> > > > > >
> > > > > > > Congrats and welcome Junhan!
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jark
> > > > > > >
> > > > > > > > 2022年8月18日 17:59,Timo Walther  写道:
> > > > > > > >
> > > > > > > > Congratulations and welcome to the committer team :-)
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Timo
> > > > > > > >
> > > > > > > > On 18.08.22 07:19, Lijie Wang wrote:
> > > > > > > >> Congratulations, Junhan!
> > > > > > > >> Best,
> > > > > > > >> Lijie
> > > > > > > >> Leonard Xu  于2022年8月18日周四 11:31写道:
> > > > > > > >>> Congratulations, Junhan!
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>>
> > > > > > >  2022年8月18日 上午11:27,Zhipeng Zhang  >
> > > 写道:
> > > > > > > 
> > > > > > >  Congratulations, Junhan!
> > > > > > > 
> > > > > > >  Xintong Song  于2022年8月18日周四 11:21写道:
> > > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > On behalf of the PMC, I'm very happy to announce Junhan
> > Yang
> > > > as a
> > > > > > new
> > > > > > > >>> Flink
> > > > > > > > committer.
> > > > > > > >
> > > > > > > > Junhan has been contributing to the Flink project for
> more
> > > > than 1
> > > > > > > year.
> > > > > > > >>> His
> > > > > > > > contributions are mostly identified in the web frontend,
> > > > > including
> > > > > > > > FLIP-241, FLIP-249 and various maintenance efforts of
> > Flink's
> > > > > > > frontend
> > > > > > > > frameworks.
> > > > > > > >
> > > > > > > > Please join me in congratulating Junhan for becoming a
> > Flink
> > > > > > > committer!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Xintong
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > >  --
> > > > > > >  best,
> > > > > > >  Zhipeng
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [ANNOUNCE] New Apache Flink Committer - Lijie Wang

2022-08-22 Thread Yang Wang
Congrats Lijie!

Best,
Yang

Matthias Pohl  于2022年8月22日周一 15:34写道:

> Congrats & welcome to the team! :-)
>
> Matthias
>
> On Sun, Aug 21, 2022 at 5:41 AM Yuan Mei  wrote:
>
> > Congratulations, Lijie!
> >
> > Best,
> > Yuan
> >
> > On Sat, Aug 20, 2022 at 2:12 PM Danny Cranmer 
> > wrote:
> >
> > > Congratulations Lijie! Welcome to the team.
> > >
> > > On Sat, 20 Aug 2022, 03:25 Yun Tang,  wrote:
> > >
> > > > Congratulations, Lijie!
> > > >
> > > >
> > > > Best
> > > > Yun Tang
> > > > 
> > > > From: Geng Biao 
> > > > Sent: Saturday, August 20, 2022 10:03
> > > > To: dev@flink.apache.org 
> > > > Subject: Re: [ANNOUNCE] New Apache Flink Committer - Lijie Wang
> > > >
> > > > Congratulations  Lijie!
> > > > Best,
> > > > Biao Geng
> > > >
> > > > 获取 Outlook for iOS
> > > > 
> > > > 发件人: yuxia 
> > > > 发送时间: Saturday, August 20, 2022 9:54:29 AM
> > > > 收件人: dev 
> > > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Lijie Wang
> > > >
> > > > Congrats Lijie!
> > > >
> > > > Best regards,
> > > > Yuxia
> > > >
> > > > - 原始邮件 -
> > > > 发件人: "Aitozi" 
> > > > 收件人: "dev" 
> > > > 发送时间: 星期六, 2022年 8 月 20日 上午 12:19:27
> > > > 主题: Re: [ANNOUNCE] New Apache Flink Committer - Lijie Wang
> > > >
> > > > Congrats Lijie!
> > > >
> > > > Best regards,
> > > > Aitozi
> > > >
> > > > Jing Ge  于2022年8月19日周五 06:02写道:
> > > >
> > > > > Congrats Lijie!
> > > > >
> > > > > Best regards,
> > > > > Jing
> > > > >
> > > > > On Thu, Aug 18, 2022 at 8:40 AM Terry Wang 
> > wrote:
> > > > >
> > > > > > Congratulations, Lijie!
> > > > > >
> > > > > > On Thu, Aug 18, 2022 at 11:31 AM Leonard Xu 
> > > wrote:
> > > > > >
> > > > > > > Congratulations, Lijie!
> > > > > > >
> > > > > > > Best,
> > > > > > > Leonard
> > > > > > >
> > > > > > > > 2022年8月18日 上午11:26,Zhipeng Zhang 
> 写道:
> > > > > > > >
> > > > > > > > Congratulations, Lijie!
> > > > > > > >
> > > > > > > > Xintong Song  于2022年8月18日周四 11:23写道:
> > > > > > > >>
> > > > > > > >> Congratulations Lijie, and welcome~!
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >>
> > > > > > > >> Xintong
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Thu, Aug 18, 2022 at 11:12 AM Xingbo Huang <
> > > hxbks...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >>> Congrats, Lijie
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>> Xingbo
> > > > > > > >>>
> > > > > > > >>> Lincoln Lee  于2022年8月18日周四
> 11:01写道:
> > > > > > > >>>
> > > > > > >  Congratulations, Lijie!
> > > > > > > 
> > > > > > >  Best,
> > > > > > >  Lincoln Lee
> > > > > > > 
> > > > > > > 
> > > > > > >  Benchao Li  于2022年8月18日周四 10:51写道:
> > > > > > > 
> > > > > > > > Congratulations Lijie!
> > > > > > > >
> > > > > > > > yanfei lei  于2022年8月18日周四 10:44写道:
> > > > > > > >
> > > > > > > >> Congratulations, Lijie!
> > > > > > > >>
> > > > > > > >> Best,
> > > > > > > >> Yanfei
> > > > > > > >>
> > > > > > > >> JunRui Lee  于2022年8月18日周四 10:35写道:
> > > > > > > >>
> > > > > > > >>> Congratulations, Lijie!
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>> JunRui
> > > > > > > >>>
> > > > > > > >>> Timo Walther  于2022年8月17日周三
> 19:30写道:
> > > > > > > >>>
> > > > > > >  Congratulations and welcome to the committer team :-)
> > > > > > > 
> > > > > > >  Regards,
> > > > > > >  Timo
> > > > > > > 
> > > > > > > 
> > > > > > >  On 17.08.22 12:50, Yuxin Tan wrote:
> > > > > > > > Congratulations, Lijie!
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Yuxin
> > > > > > > >
> > > > > > > >
> > > > > > > > Guowei Ma  于2022年8月17日周三
> > 18:42写道:
> > > > > > > >
> > > > > > > >> Congratulations, Lijie. Welcome on board~!
> > > > > > > >> Best,
> > > > > > > >> Guowei
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Wed, Aug 17, 2022 at 6:25 PM Zhu Zhu <
> > > zh...@apache.org
> > > > >
> > > > > > >  wrote:
> > > > > > > >>
> > > > > > > >>> Hi everyone,
> > > > > > > >>>
> > > > > > > >>> On behalf of the PMC, I'm very happy to announce
> > Lijie
> > > > Wang
> > > > > > > >>> as
> > > > > > > >>> a new Flink committer.
> > > > > > > >>>
> > > > > > > >>> Lijie has been contributing to Flink project for
> more
> > > > than
> > > > > 2
> > > > > > > > years.
> > > > > > > >>> He mainly works on the runtime/coordination part,
> > doing
> > > > > > > >>> feature
> > > > > > > >>> development, problem debugging and code reviews. He
> > has
> > > > > also
> > > > > > > >>> driven the work of FLIP-187(Adaptive Batch
> Scheduler)
> > > and
> > > > > > > >>> FLIP-224(Blocklist for Speculative Execution),
> 

Re: [VOTE] FLIP-250: Support Customized Kubernetes Schedulers Proposal

2022-07-24 Thread Yang Wang
+1 (binding)

Best,
Yang

bo zhaobo  于2022年7月25日周一 09:38写道:

> Hi all,
>
> Thank you very much for all feedback after the discussion in [2][3].
> Now I'd like to proceed with the vote for FLIP-250 [1], as no more
> objections
> or issues were raised in ML thread [2][3].
>
> The vote will be opened until July 28th earliest(at least 72 hours) unless
> there is an objection or
> insufficient votes.
>
> Thank you all.
>
> BR
>
> Bo Zhao
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-250%3A+Support+Customized+Kubernetes+Schedulers+Proposal
> [2] https://lists.apache.org/thread/pf8dvbvqf845wh0x63z68jmhh4pvsbow
> [3] https://lists.apache.org/thread/zbylkkc6jojrqwds7tt02k2t8nw62h26
>


Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.1.0 released

2022-07-24 Thread Yang Wang
Congrats! Thanks Gyula for driving this release, and thanks to all
contributors!


Best,
Yang

Gyula Fóra  于2022年7月25日周一 10:44写道:

> The Apache Flink community is very happy to announce the release of Apache
> Flink Kubernetes Operator 1.1.0.
>
> The Flink Kubernetes Operator allows users to manage their Apache Flink
> applications and their lifecycle through native k8s tooling like kubectl.
>
> Please check out the release blog post for an overview of the release:
>
> https://flink.apache.org/news/2022/07/25/release-kubernetes-operator-1.1.0.html
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Maven artifacts for Flink Kubernetes Operator can be found at:
>
> https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator
>
> Official Docker image for the Flink Kubernetes Operator can be found at:
> https://hub.docker.com/r/apache/flink-kubernetes-operator
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351723
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Gyula Fora
>


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.1.0, release candidate #1

2022-07-24 Thread Yang Wang
+1 (binding)

Successfully verified the following:

- Verify that the checksums and GPG files

- Verify that the source distributions do not contain any binaries

- Build binary and image from release source

- Verify the NOTICE and licenses in source release and the docker image

- Verify the helm chart values with correct appVersion and image tag

- Operator functionality manual testing

- Start a Flink Application job(both streaming and batch) with 1.15

- Verify the FlinkUI could be accessed via ingress

- No strange operator logs




Best,

Yang

Thomas Weise  于2022年7月24日周日 08:02写道:

> +1 (binding)
>
> * built from source archive
> * run examples
>
> Thanks,
> Thomas
>
> On Wed, Jul 20, 2022 at 5:48 AM Gyula Fóra  wrote:
> >
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version 1.1.0
> of
> > Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [3]
> >
> > All artifacts are signed with the key
> > 0B4A34ADDFFA2BB54EB720B221F06303B87DAFF1 [4]
> >
> > Other links for your review:
> > * JIRA release notes [5]
> > * source code tag "release-1.1.0-rc1" [6]
> > * PR to update the website Downloads page to include Kubernetes Operator
> > links [7]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > Thanks,
> > Gyula Fora
> >
> > [1]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.1.0-rc1/
> > [2]
> https://repository.apache.org/content/repositories/orgapacheflink-1518/
> > [3] ghcr.io/apache/flink-kubernetes-operator:c9dec3f
> > [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [5]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351723
> > [6]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.1.0-rc1
> > [7] https://github.com/apache/flink-web/pull/560
> > [8]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
>


Re: [DISCUSS] FLIP-250: Support Customized Kubernetes Schedulers Proposal

2022-07-14 Thread Yang Wang
I think we could go over the customized scheduler plugin mechanism again
with YuniKorn to make sure that it is common enough.
But the implementation could be deferred.

And maybe we also could ping Yikun Jiang who has done similar things in
Spark.

For the e2e tests, I admit that they could be improved. But I am not sure
whether we really need the java implementation instead.
This is out of the scope of this FLIP and let's keep the discussion
under FLINK-20392.


Best,
Yang

Martijn Visser  于2022年7月14日周四 15:28写道:

> Hi Bo,
>
> Thanks for the info! I think I see that you've already updated the FLIP to
> reflect how customized schedulers are beneficial for both batch and
> streaming jobs.
>
> The reason why I'm not too happy that we would only create a reference
> implementation for Volcano is that we don't know if the generic support for
> customized scheduler plugins will also work for others. We think it will,
> but since there would be no other implementation available, we are not
> sure. My concern is that when someone tries to add support for another
> scheduler, we notice that we actually made a mistake or should improve the
> generic support.
>
> Best regards,
>
> Martijn
>
>
>
> Op do 14 jul. 2022 om 05:30 schreef bo zhaobo  >:
>
> > Hi Martijn,
> >
> > Thank you for your comments. I will answer the questions one by one.
> >
> > ""
> > * Regarding the motivation, it mentions that the development trend is
> that
> > Flink supports both batch and stream processing. I think the vision and
> > trend is that we have unified batch- and stream processing. What I'm
> > missing is the vision on what's the impact for customized Kubernetes
> > schedulers on stream processing. Could there be some elaboration on that?
> > ""
> >
> > >>
> >
> > We very much agree with you and the dev trend that Flink supports both
> > batch and stream processing. Actually, using the K8S customized scheduler
> > is beneficial for streaming scenarios too, such as avoiding resource
> > deadlock and other problems, for example, the remaining resources in the
> > K8S cluster are only enough for one job running, but we submitted two. At
> > this time, both jobs will be prevented and hang from requesting resources
> > at the same time when using the default K8S scheduler, but in this case,
> > the customized scheduler Volcano won’t schedule overcommit pods if the
> idle
> > can not fit all following pods setup. So the benefits mentioned in FLIP
> are
> > not only for batch jobs. In fact, the said 4 scheduling capabilities
> > mentioned in FLIP are all required for stream processing. YARN has some
> of
> > those scheduling features too, such as priority scheduling, min/max
> > resource constraint and etc...
> >
> > ""
> > * While the FLIP talks about customized schedulers, it focuses on
> Volcano.
> > Why is the choice made to only focus on Volcano and not on other
> schedulers
> > like Apache YuniKorn? Can we not also provide an implementation for
> > YuniKorn at the same time? Spark did the same with SPARK-36057 [1]
> > ""
> >
> > >>
> >
> > Let's make it more clear about this. The FLIP consists of two parts:
> > 1. Introducing Flink K8S supports the customized scheduler plugin
> > mechanism. This aspect is a general consideration.
> > 2. Introducing ONE reference implementation for the customized scheduler,
> > volcano is just one of them, if other schedulers or people are
> interested,
> > the integration of other schedulers can also be easily completed.
> >
> > ""
> > * We still have quite a lot of tech debt on testing for Kubernetes [2]. I
> > think that this FLIP would be a great improvement for Flink, but I am
> > worried that we will add more tech debt to those tests. Can we somehow
> > improve this situation?
> > ""
> >
> > >>
> >
> > Yeah, We will pay attention to the test problems, which are related to
> > Flink K8S and we are happy to improve it. ;-)
> >
> > BR,
> >
> > Bo Zhao
> >
> > Martijn Visser  于2022年7月13日周三 15:19写道:
> >
> > > Hi all,
> > >
> > > Thanks for the FLIP. I have a couple of remarks/questions:
> > >
> > > * Regarding the motivation, it mentions that the development trend is
> > that
> > > Flink supports both batch and stream processing. I think the vision and
> > > trend is that we have unified batch- and stream processing. What I'm
> > > missing is the vision on what's the impact for customized Kubernetes
> > > schedulers on stream processing. Could there be some elaboration on
> that?
> > > * While the FLIP talks about customized schedulers, it focuses on
> > Volcano.
> > > Why is the choice made to only focus on Volcano and not on other
> > schedulers
> > > like Apache YuniKorn? Can we not also provide an implementation for
> > > YuniKorn at the same time? Spark did the same with SPARK-36057 [1]
> > > * We still have quite a lot of tech debt on testing for Kubernetes
> [2]. I
> > > think that this FLIP would be a great improvement for Flink, but I am
> > > worried that we will add more tech debt to 

Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-12 Thread Yang Wang
Thanks for the explanation. Only having 1 API call in most cases makes
sense to me.

Could you please elaborate more about why do we need the *plan* in CR
status?


Best,
Yang

Gyula Fóra  于2022年7月12日周二 17:36写道:

> Hi Devs!
>
> I discussed with Daren offline, and I agree with him that technically we
> almost never need 2 API calls.
>
> I think it's fine to have a second API call once directly after application
> submission (technically even this can be eliminated by setting a fix job id
> always).
>
> +1 from me.
>
> Cheers,
> Gyula
>
>
> On Tue, Jul 12, 2022 at 11:32 AM WONG, DAREN  >
> wrote:
>
> > Hi Matyas,
> >
> > Thanks for the feedback, and yes I agree. An alternative approach would
> > instead be:
> >
> > - 2 API calls only when jobID is not available (i.e when submitting a new
> > application cluster, which is a one-off event).
> > - 1 API call when jobID is already available by directly calling
> > "/jobs/:jobid".
> >
> > With this approach, we can keep the API call to 1 in most cases.
> >
> > Regards,
> > Daren
> >
> >
> > On 11/07/2022, 14:44, "Őrhidi Mátyás"  wrote:
> >
> > CAUTION: This email originated from outside of the organization. Do
> > not click links or open attachments unless you can confirm the sender and
> > know the content is safe.
> >
> >
> >
> > Hi Daren,
> >
> > At the moment the Operator fetches the job state via
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-overview
> > which contains the 'end-time' and 'duration' fields already. I feel
> > calling
> > the
> >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid
> > after the previous call for every job in every reconcile loop would
> be
> > too
> > expensive.
> >
> > Best,
> > Matyas
> >
> > On Mon, Jul 11, 2022 at 3:17 PM WONG, DAREN
> > 
> > wrote:
> >
> > > Hi everyone, I am Daren from AWS Kinesis Data Analytics (KDA) team.
> > I had
> > > a quick chat with Gyula as I propose to include a few additional
> > fields in
> > > the jobStatus CRD for Flink Kubernetes Operator such as:
> > >
> > > - endTime
> > > - duration
> > > - jobPlan
> > >
> > > Further details of each states can be found here<
> > >
> >
> https://github.com/darenwkt/flink/blob/release-1.15.0/flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/job/JobDetailsInfo.java
> > >.
> > > Although addition of these 3 states stem from an internal
> > requirement, I
> > > think they would be beneficial to others who uses these states in
> > their
> > > application as well. The list of states above are not exhaustive,
> so
> > do let
> > > me know if there are other states that you would like to include
> > together
> > > in this iteration cycle.
> > >
> > > JIRA: https://issues.apache.org/jira/browse/FLINK-28494
> > >
> >
> >
>


Re: [DISCUSS] Add new JobStatus fields to Flink Kubernetes Operator CRD

2022-07-11 Thread Yang Wang
I share mytyas's concern if we list the jobs first and then followed by
some get-job-detail requests.
It is a bit expensive and I do not see the benefit to store jobPlan in the
CR status.

Best,
Yang


Őrhidi Mátyás  于2022年7月11日周一 21:43写道:

> Hi Daren,
>
> At the moment the Operator fetches the job state via
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-overview
> which contains the 'end-time' and 'duration' fields already. I feel calling
> the
>
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jobs-jobid
> after the previous call for every job in every reconcile loop would be too
> expensive.
>
> Best,
> Matyas
>
> On Mon, Jul 11, 2022 at 3:17 PM WONG, DAREN  >
> wrote:
>
> > Hi everyone, I am Daren from AWS Kinesis Data Analytics (KDA) team. I had
> > a quick chat with Gyula as I propose to include a few additional fields
> in
> > the jobStatus CRD for Flink Kubernetes Operator such as:
> >
> > - endTime
> > - duration
> > - jobPlan
> >
> > Further details of each states can be found here<
> >
> https://github.com/darenwkt/flink/blob/release-1.15.0/flink-runtime/src/main/java/org/apache/flink/runtime/rest/messages/job/JobDetailsInfo.java
> >.
> > Although addition of these 3 states stem from an internal requirement, I
> > think they would be beneficial to others who uses these states in their
> > application as well. The list of states above are not exhaustive, so do
> let
> > me know if there are other states that you would like to include together
> > in this iteration cycle.
> >
> > JIRA: https://issues.apache.org/jira/browse/FLINK-28494
> >
>


[jira] [Created] (FLINK-28481) Bump the fabric8 kubernetes-client to 5.12.3

2022-07-10 Thread Yang Wang (Jira)
Yang Wang created FLINK-28481:
-

 Summary: Bump the fabric8 kubernetes-client to 5.12.3
 Key: FLINK-28481
 URL: https://issues.apache.org/jira/browse/FLINK-28481
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Reporter: Yang Wang


The current fabric8 kubernetes-client(5.5.0) will swallow the 
{{KubernetesClientException}} and then the next renewing could not work 
properly until reach the deadline. This will be a serious problem because one 
time failure of renewing leader annotation will cause leadership lost.

 

Refer to the following ticket for more information.

https://github.com/fabric8io/kubernetes-client/issues/4246



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Python Job Support for the Kubernetes Operator

2022-07-06 Thread Yang Wang
I think maybe we make the SQL or Python jobs submission more convenient in
the future by introducing the *"**pyScriptURL"* or *"SQLScriptURI"* similar
to *"jarURI"*.


Best,
Yang

Thomas Weise  于2022年7月7日周四 01:06写道:

> Since SQL or Python are essentially just examples of how to use the
> operator vs. features of the operator itself, they should not affect
> the release schedule and can be added anytime, as examples to the
> operator or elsewhere.
>
> Thanks,
> Thomas
>
> On Wed, Jul 6, 2022 at 8:33 AM Gyula Fóra  wrote:
> >
> > Hi All!
> >
> > One thing we could do already now is to add a simple example on how to
> > execute Python jobs like java jobs (with the right main class, args etc).
> >
> > It would be similar to
> >
> https://github.com/apache/flink-kubernetes-operator/tree/main/examples/flink-sql-runner-example
> > but
> > slightly simpler as we don't need a maven module most likely.
> >
> > Unfortunately I cannot do it myself as @Geng Biao 
> pointed
> > out that Flink Python on M1 macbook is unsupported so cannot really test
> > this locally.
> >
> > Cheers,
> > Gyula
> >
> > On Wed, Jul 6, 2022 at 4:56 AM Dian Fu  wrote:
> >
> > > Thanks for the confirmation Matyas!
> > >
> > > On Tue, Jul 5, 2022 at 3:00 PM Őrhidi Mátyás 
> > > wrote:
> > >
> > > > Yes, this is the plan Dian. Appreciate your assistance!
> > > >
> > > > Best,
> > > > Matyas
> > > >
> > > > On Tue, Jul 5, 2022 at 8:55 AM Dian Fu 
> wrote:
> > > >
> > > >> Hi Matyas,
> > > >>
> > > >> According to the release schedule defined in [1], it seems that the
> > > >> feature freeze of v1.2 may occur at the beginning of September, is
> this
> > > >> correct? If this is the case, I think it should be reasonable to
> make
> > > it in
> > > >> v1.2 for Python support.
> > > >>
> > > >> Regards,
> > > >> Dian
> > > >>
> > > >> [1]
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning
> > > >>
> > > >> On Tue, Jul 5, 2022 at 2:10 PM Őrhidi Mátyás <
> matyas.orh...@gmail.com>
> > > >> wrote:
> > > >>
> > > >>> Both sql and py support is requested frequently. I guess we should
> aim
> > > >>> to support both in v1.2.
> > > >>>
> > > >>> Matyas
> > > >>>
> > > >>> On Tue, Jul 5, 2022 at 6:26 AM Gyula Fóra 
> > > wrote:
> > > >>>
> > > >>>> Thank you for the info and help Dian :)
> > > >>>>
> > > >>>> Gyula
> > > >>>>
> > > >>>> On Tue, 5 Jul 2022 at 05:14, Yang Wang 
> wrote:
> > > >>>>
> > > >>>> > Thanks Dian for the confirmation and nice help.
> > > >>>> >
> > > >>>> > Best,
> > > >>>> > Yang
> > > >>>> >
> > > >>>> > Dian Fu  于2022年7月5日周二 09:27写道:
> > > >>>> >
> > > >>>> > > @Yang, Yes, you are right. Python jobs could be seen as
> special
> > > JAR
> > > >>>> jobs
> > > >>>> > > whose main class is always
> > > >>>> `org.apache.flink.client.python.PythonDriver`.
> > > >>>> > > What we could do in Flink K8s operator is to make it more
> > > >>>> convenient and
> > > >>>> > > handle properly for the different kinds of dependencies[1].
> > > >>>> > >
> > > >>>> > > @Gyula, I can help on this. I will find some time to
> investigate
> > > >>>> this in
> > > >>>> > > the following days and will let you know when there is any
> > > progress.
> > > >>>> > >
> > > >>>> > > Regards,
> > > >>>> > > Dian
> > > >>>> > >
> > > >>>> > > [1]
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/dependency_management/
> > > >>

Re: [DISCUSS] FLIP-250: Support Customized Kubernetes Schedulers Proposal

2022-07-06 Thread Yang Wang
Thanks zhaobo for starting the discussion and preparing the FLIP.

The customized Kubernetes Schedulers support will be very helpful for the
users who still hesitates to migrate the Flink workloads from YARN to
Kubernetes.
Now leveraging the ability of customized K8s scheduler, many advanced
scheduling features(e.g. priority scheduling, dynamic resource sharing,
etc.) could be
introduced to make the streaming/batch jobs run more smoothly in a shared
K8s cluster.

Just to remind that the flink-kubernetes-volcano-*.jar will be an optional
jar located in the $FLINK_HOME/opt. Users who want to try the volcano
scheduler need
to copy it to the plugins directory.


Best,
Yang

bo zhaobo  于2022年7月7日周四 09:16写道:

> Hi, all.
>
> I would like to raise a discussion in Flink dev ML about Support Customized
> Kubernetes Schedulers.
> Currentlly, Kubernetes becomes more and more polular for Flink Cluster
> deployment, and its ability is better, especially, it supports  customized
> scheduling.
> Essentially, in high-performance workloads, we need to apply new scheduling
> policies for meeting the new requirements. And now Flink native Kubernetes
> solution is using Kubernetes default scheduler to work with all scenarios,
> the default scheduling policy might be difficult to apply in some extreme
> cases, so
> we need to improve the Flink Kubernetes for coupling those Kubernetes
> customized schedulers with Flink native Kubernetes, provides a way for
> Flink
> administrators or users to use/apply their Flink Clusters on Kubernetes
> more flexibility.
>
> The proposal will introduce the customized K8S schdulers plugin mechanism
> and a reference implementation 'Volcano' in Flink. More details see [1].
>
> Looking forward to your feedback.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-250%3A+Support+Customized+Kubernetes+Schedulers+Proposal
>
> Thanks,
> BR
>


Re: Python Job Support for the Kubernetes Operator

2022-07-04 Thread Yang Wang
Thanks Dian for the confirmation and nice help.

Best,
Yang

Dian Fu  于2022年7月5日周二 09:27写道:

> @Yang, Yes, you are right. Python jobs could be seen as special JAR jobs
> whose main class is always `org.apache.flink.client.python.PythonDriver`.
> What we could do in Flink K8s operator is to make it more convenient and
> handle properly for the different kinds of dependencies[1].
>
> @Gyula, I can help on this. I will find some time to investigate this in
> the following days and will let you know when there is any progress.
>
> Regards,
> Dian
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/dependency_management/
>
> On Mon, Jul 4, 2022 at 11:52 AM Yang Wang  wrote:
>
>> AFAIK, the python job could be considered as a special case of jar job.
>> The user jar is flink-python-*.jar and is located in the opt directory.
>> The python script is just the argument of this user jar. So I believe the
>> users already could submit python jobs via Flink Kubernetes operator.
>> However, they need some manual operations, including specify the user
>> jar, download python script via init container, etc.
>>
>> What we could do in the Flink kubernetes operator is to make the
>> submission more convenient by introducing a new field(e.g. pyScript).
>>
>> cc @Dian Fu   @biaoge...@gmail.com
>>  WDYT?
>>
>> Best,
>> Yang
>>
>> Gyula Fóra  于2022年7月4日周一 00:12写道:
>>
>>> Hi Devs!
>>>
>>> Would anyone with a good understanding of the Python execution layer be
>>> interested in working on adding Python job support for the Flink
>>> Kubernetes
>>> Operator?
>>>
>>> This is a feature request that comes up often (
>>> https://issues.apache.org/jira/browse/FLINK-28364) and it would be a
>>> great
>>> way to fill some missing feature gaps on the operator :)
>>>
>>> I am of course happy to help or work together with someone on this but I
>>> have zero experience with the Python API at this stage and don't want to
>>> miss some obvious requirements.
>>>
>>> Cheers,
>>> Gyula
>>>
>>


Re: Python Job Support for the Kubernetes Operator

2022-07-03 Thread Yang Wang
AFAIK, the python job could be considered as a special case of jar job. The
user jar is flink-python-*.jar and is located in the opt directory.
The python script is just the argument of this user jar. So I believe the
users already could submit python jobs via Flink Kubernetes operator.
However, they need some manual operations, including specify the user jar,
download python script via init container, etc.

What we could do in the Flink kubernetes operator is to make the submission
more convenient by introducing a new field(e.g. pyScript).

cc @Dian Fu   @biaoge...@gmail.com
 WDYT?

Best,
Yang

Gyula Fóra  于2022年7月4日周一 00:12写道:

> Hi Devs!
>
> Would anyone with a good understanding of the Python execution layer be
> interested in working on adding Python job support for the Flink Kubernetes
> Operator?
>
> This is a feature request that comes up often (
> https://issues.apache.org/jira/browse/FLINK-28364) and it would be a great
> way to fill some missing feature gaps on the operator :)
>
> I am of course happy to help or work together with someone on this but I
> have zero experience with the Python API at this stage and don't want to
> miss some obvious requirements.
>
> Cheers,
> Gyula
>


Re: [VOTE] FLIP-241: Completed Jobs Information Enhancement

2022-06-29 Thread Yang Wang
+1 (binding)

Best,
Yang

Zhu Zhu  于2022年6月29日周三 14:31写道:

> +1 (binding)
>
> Thanks,
> Zhu
>
> Xintong Song  于2022年6月23日周四 17:01写道:
> >
> > +1 (binding)
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, Jun 23, 2022 at 1:49 PM Yangze Guo  wrote:
> >
> > > +1 (binding)
> > >
> > > Best,
> > > Yangze Guo
> > >
> > > On Thu, Jun 23, 2022 at 1:07 PM junhan yang 
> > > wrote:
> > > >
> > > > Hi everyone,
> > > >
> > > > Thanks for the feedbacks on the discussion thread[1]. I would like to
> > > start
> > > > a vote thread here for FLIP-241: Completed Jobs Information
> > > Enhancement[2].
> > > >
> > > > The vote will last for at least 72 hours unless there is an
> objection, I
> > > > will try to close it by *next Tuesday* if we receive sufficient votes
> > > until
> > > > then.
> > > >
> > > > Thank you again for your participation in this FLIP discussion.
> > > >
> > > > [1] https://lists.apache.org/thread/qycqmxbh37b5qzs72y110rp8457kkxkb
> > > > [2] https://cwiki.apache.org/confluence/x/dRD1D
> > > >
> > > > Best regards,
> > > > Junhan
> > >
>


Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.0.1 released

2022-06-28 Thread Yang Wang
Thanks Gyula for working on the first patch release for the Flink
Kubernetes Operator project.


Best,
Yang



Gyula Fóra  于2022年6月28日周二 00:22写道:

> The Apache Flink community is very happy to announce the release of Apache
> Flink Kubernetes Operator 1.0.1.
>
> The Flink Kubernetes Operator allows users to manage their Apache Flink
> applications and their lifecycle through native k8s tooling like kubectl.
> <
> https://flink.apache.org/news/2022/04/03/release-kubernetes-operator-0.1.0.html
> >
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Official Docker image for Flink Kubernetes Operator applications can be
> found at:
> https://hub.docker.com/r/apache/flink-kubernetes-operator
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351812
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Gyula Fora
>


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.1, release candidate #1

2022-06-26 Thread Yang Wang
+1 (binding)

Successfully verified the following:
- Verify that the checksums and GPG files
- Verify that the source distributions do not contain any binaries
- Build binary and image from release source
- Verify the NOTICE and licenses in source release and the docker image
- Verify the helm chart values with correct appVersion and image tag
- The image tag is *b93417f1*, not *b93417f*. But it is trivial because
they have the same SHA.
- Operator functionality manual testing
- Start a session FlinkDeployment and submit a SessionJob CR
- Start a Flink Application job(both streaming and batch) with 1.15
- Verify the FlinkUI could be accessed via ingress
- No strange operator logs


Best,
Yang

Chenya Zhang  于2022年6月24日周五 22:31写道:

> +1 (non-binding)
>
> - Verified maven and helm chart versions for built from source
> - Verified helm chart points to correct docker image and deploys it by
> default
> - Verified helm installation and basic checkpointing, stateful examples
> with upgrades and manual savepoints
> - Verified online documents including Quick Start etc.
>
> Chenya
>
> On Fri, Jun 24, 2022 at 4:26 AM Őrhidi Mátyás 
> wrote:
>
> > +1
> > - Verified hashes/signatures
> > - Verified Helm chart, helm install
> > - Ran a few example jobs
> >
> > Matyas
> >
> >
> > On Fri, Jun 24, 2022 at 11:30 AM Gyula Fóra  wrote:
> >
> > > +1 (binding)
> > >
> > >  - Verified hashes/signatures, license headers for the release
> artifacts
> > >  - Verified Helm chart, helm install
> > >  - Ran a few example jobs
> > >  - Verified release notes
> > >
> > > Gyul
> > >
> > >
> > > On Thu, Jun 23, 2022 at 8:34 AM Gyula Fóra  wrote:
> > >
> > > > Hi Devs,
> > > >
> > > > Please review and vote on the release candidate #1 for the version
> > 1.0.1
> > > of
> > > > Apache Flink Kubernetes Operator,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > **Release Overview**
> > > >
> > > > As an overview, the release consists of the following:
> > > > a) Kubernetes Operator canonical source distribution (including the
> > > > Dockerfile), to be deployed to the release repository at
> > dist.apache.org
> > > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > > repository
> > > > at dist.apache.org
> > > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > > d) Docker image to be pushed to dockerhub
> > > >
> > > > **Staging Areas to Review**
> > > >
> > > > The staging areas containing the above mentioned artifacts are as
> > > follows,
> > > > for your review:
> > > > * All artifacts for a,b) can be found in the corresponding dev
> > repository
> > > > at dist.apache.org [1]
> > > > * All artifacts for c) can be found at the Apache Nexus Repository
> [2]
> > > > * The docker image for d) is staged on github [7]
> > > >
> > > > All artifacts are signed with the key
> > > > 0B4A34ADDFFA2BB54EB720B221F06303B87DAFF1[3]
> > > >
> > > > Other links for your review:
> > > > * JIRA release notes [4]
> > > > * source code tag "release-1.0.1-rc1" [5]
> > > > * PR to update the website Downloads page to include Kubernetes
> > Operator
> > > > links [6]
> > > >
> > > > **Vote Duration**
> > > >
> > > > The voting time will run for at least 72 hours.
> > > > It is adopted by majority approval, with at least 3 PMC affirmative
> > > votes.
> > > >
> > > > **Note on Verification**
> > > >
> > > > You can follow the basic verification guide here[8].
> > > > Note that you don't need to verify everything yourself, but please
> make
> > > > note of what you have tested together with your +- vote.
> > > >
> > > > Thanks,
> > > > Gyula
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.1-rc1/
> > > > [2]
> > > >
> > https://repository.apache.org/content/repositories/orgapacheflink-1512/
> > > > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > > >
> > > > [4]
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351812
> > > > <
> > >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351812
> > > >
> > > > [5]
> > > >
> > >
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.1-rc1
> > > > [6] https://github.com/apache/flink-web/pull/555
> > > > [7] ghcr.io/apache/flink-kubernetes-operator:b93417f
> > > > [8]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> > > >
> > >
> >
>


Re: [DISCUSS] Release Kubernetes operator 1.0.1

2022-06-23 Thread Yang Wang
Thanks Gyula for preparing the first patch release for Flink Kubernetes
operator.

+1 for this.

Best,
Yang

Őrhidi Mátyás  于2022年6月22日周三 23:38写道:

> +1 for the patch release. Thanks Gyula!
>
> On Wed, Jun 22, 2022 at 5:35 PM Márton Balassi 
> wrote:
>
>> Hi team,
>>
>> +1 for having a 1.0.1 for the Kubernetes Operator.
>>
>> On Wed, Jun 22, 2022 at 4:23 PM Gyula Fóra  wrote:
>>
>> > Hi Devs!
>> >
>> > How do you feel about releasing the 1.0.1 patch release for the
>> Kubernetes
>> > operator?
>> >
>> > We have fixed a few annoying issues that many people tend to hit.
>> >
>> > Given that we are about halfway until the next minor release based on
>> the
>> > proposed schedule I think we could prepare a 1.0.1 RC1 in the next 1-2
>> days
>> > .
>> >
>> > I can volunteer to be the release manager.
>> >
>> > What do you think?
>> >
>> > Cheers,
>> > Gyula
>> >
>>
>


Re: [DISCUSS] Flink Kubernetes Operator release cadence proposal

2022-06-21 Thread Yang Wang
+1 for 2 month release cycles.

Since we have promised the backward compatibility for CRD, I think it is
also reasonable for us to maintain the latest two minor versions with patch
releases.

Given that we only have 5~6 weeks for feature development, maybe we need to
confirm down the features as soon as possible in the release cycle.
Otherwise, we are at great risk of delay. If we are targeting the 1.1.0
release to Aug 1. It is the time to determine which features we want to
include in this release.

I agree with Gyula that we need to continuously improve the test coverage,
especially the e2e tests. The users are very welcome to share their
production use cases
and we could consider whether they could be covered by e2e tests. Benefit
from this, we could ship a stable release more quickly and easily.



Best,
Yang

Gyula Fóra  于2022年6月21日周二 19:57写道:

> Hi Matyas!
>
> Thanks for starting the discussion. I think the 2 month release cycle
> sounds reasonable.
>
> I think it's important for users to have frequent operator releases as
> these affect all Flink jobs running in the environment. We should also only
> adopt a schedule that we can most likely keep.
> If we want to be successful with the proposed schedule we have to ensure
> that each release has a relatively small scope of new features and we have
> good test coverage.
>
> In addition to your suggestion I would like to add a feature-freeze for
> bigger new features after 5 weeks (1 week before cutting the release
> branch).
>
> So in practice for 1.1.0 this would look like:
>
> - Jun 6 : 1.0.0 was released
> - July 11: Feature Freeze
> - July 18: Cut release-1.1 branch
> - Aug 1: Target 1.1.0 release date
>
> Cheers,
> Gyula
>
> On Tue, Jun 21, 2022 at 9:04 AM Őrhidi Mátyás 
> wrote:
>
> > Hi Devs,
> >
> > After the successful Kubernetes Operator 1.0.0 release, which is
> > considered to be the first production grade one, it is probably a good
> time
> > now to also agree on a predictable release cadence for the Operator too,
> > similarly to the time-based release plan we have for the Flink core
> project.
> >
> > Given that the Operator itself is not strictly bound to Flink versions it
> > can be upgraded independently from the runtime versions it manages. It
> > would benefit the community to have frequent minor releases until the
> > majority of the roadmap items are complete that also encourages users to
> > upgrade regularly in reasonable boundaries. Based on some offline
> > discussion with Gyula Fora I would like to propose the following
> > operational model for Operator releases:
> >
> > - time-based release cadence with 2 month release cycles ( This would
> give
> > us roughly 6 weeks pure dev time and leave 2 weeks for the release
> process
> > to finish)
> > - on-demand patch releases for critical issues only
> > - support the current and previous minor releases with bug fixes
> >
> > I am looking forward to your feedback and suggestions on this topic. Once
> > we have an agreement I will formalize it on a Wiki page.
> >
> > Thanks,
> > Matyas
> >
>


Re: Re: [ANNOUNCE] New Apache Flink Committers: Qingsheng Ren, Shengkai Fang

2022-06-21 Thread Yang Wang
Congratulations, Qingsheng and ShengKai.


Best,
Yang

Benchao Li  于2022年6月21日周二 19:33写道:

> Congratulations!
>
> weijie guo  于2022年6月21日周二 13:44写道:
>
> > Congratulations, Qingsheng and ShengKai!
> >
> > Best regards,
> >
> > Weijie
> >
> >
> > Yuan Mei  于2022年6月21日周二 13:07写道:
> >
> > > Congrats Qingsheng and ShengKai!
> > >
> > > Best,
> > >
> > > Yuan
> > >
> > > On Tue, Jun 21, 2022 at 11:27 AM Terry Wang 
> wrote:
> > >
> > > > Congratulations, Qingsheng and ShengKai!
> > > >
> > >
> >
>
>
> --
>
> Best,
> Benchao Li
>


Re: [VOTE] FLIP-224: Blocklist Mechanism

2022-06-19 Thread Yang Wang
+1(binding)

Best,
Yang

Yun Gao  于2022年6月17日周五 18:03写道:

> +1 (binding)
>
> Thanks for the discussion and updates!
>
> Best,
> Yun Gao
>
>
> --
> From:Peter Huang 
> Send Time:2022 Jun. 16 (Thu.) 00:05
> To:dev 
> Subject:Re: [VOTE] FLIP-224: Blocklist Mechanism
>
> +1
>
> On Wed, Jun 15, 2022 at 2:55 AM Xintong Song 
> wrote:
>
> > +1 (binding)
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Wed, Jun 15, 2022 at 5:30 PM Jiangang Liu 
> > wrote:
> >
> > > +1
> > >
> > > Chesnay Schepler  于2022年6月15日周三 17:15写道:
> > >
> > > > +1
> > > >
> > > > On 15/06/2022 10:49, Lijie Wang wrote:
> > > > > Hi everyone,
> > > > >
> > > > > We've received some additional concerns since the last vote [1],
> and
> > > > > therefore made a lot of changes to design.  You can find the
> details
> > in
> > > > [2]
> > > > > and the discussions in [3].
> > > > >
> > > > > Now I'd like to start a new vote thread for FLIP-224. The vote will
> > > last
> > > > > for at least 72 hours unless there is an objection or insufficient
> > > votes.
> > > > >
> > > > > [1]
> https://lists.apache.org/thread/3416vks1j35co9608gkmsoplvcjjz7bg
> > > > > [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-224
> > > > > %3A+Blocklist+Mechanism
> > > > > [3]
> https://lists.apache.org/thread/fngkk52kjbc6b6v9nn0lkfq6hhsbgb1h
> > > > > Best,
> > > > > Lijie
> > > > >
> > > >
> > > >
> > >
> >
>
>


Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

2022-06-16 Thread Yang Wang
Thanks Xintong for the explanation.

It makes sense to leave the discussion about job result store in a
dedicated thread.


Best,
Yang

Xintong Song  于2022年6月16日周四 13:40写道:

> My impression of JobResultStore is more about fault tolerance and high
> availability. Using it for providing information to users sounds worth
> exploring. We probably need more time to think it through.
>
> Given that it doesn't conflict with what we have proposed in this FLIP, I'd
> suggest considering it as a separate thread and exclude it from the scope
> of this one.
>
> Best,
>
> Xintong
>
>
>
> On Thu, Jun 16, 2022 at 11:43 AM Yang Wang  wrote:
>
> > This is a very useful feature both for finished streaming and batch jobs.
> >
> > Except for the WebUI & REST API improvements, I am curious whether we
> could
> > also integrate some critical information(e.g. latest checkpoint) into the
> > job result store[1].
> > I am just feeling this is also somehow related with "Completed Jobs
> > Information Enhancement".
> > And I think the history server is not necessary for all the scenarios
> > especially when users only want to check the job execution result.
> >
> > [1].
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
> >
> >
> > Best,
> > Yang
> >
> > Xintong Song  于2022年6月15日周三 15:37写道:
> >
> > > Thanks Junhan,
> > >
> > > +1 for the proposed improvements.
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Wed, Jun 15, 2022 at 3:16 PM Yangze Guo  wrote:
> > >
> > > > Thanks for driving this, Junhan.
> > > >
> > > > I think it's a valuable usability improvement for both streaming and
> > > > batch users. Looking forward to the community feedback.
> > > >
> > > > Best,
> > > > Yangze Guo
> > > >
> > > >
> > > >
> > > > On Wed, Jun 15, 2022 at 3:10 PM junhan yang <
> yangjunhan1...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > I would like to open a discussion on FLIP-241: Completed Jobs
> > > Information
> > > > > Enhancement.
> > > > >
> > > > > As far as we can tell, streaming and batch users have different
> > > interests
> > > > > in probing a job. As Flink grows into a unified streaming & batch
> > > > processor
> > > > > and is adopted by more and more batch users, the user experience of
> > > > > completed job's inspection has become more and more important.
> After
> > > > doing
> > > > > several market research, there are several potential improvements
> > > > spotted.
> > > > >
> > > > > The main purpose here is due to the involvement of WebUI & REST API
> > > > > changes, which should be openly discussed and voted on as FLIPs.
> > > > >
> > > > > You can find more details in FLIP-241 document[1]. Looking forward
> to
> > > > > your feedback.
> > > > >
> > > > > [1] https://cwiki.apache.org/confluence/x/dRD1D
> > > > >
> > > > > Best regards,
> > > > > Junhan
> > > >
> > >
> >
>


Re: [DISCUSS] Dockerimage + Helm chart only patch releases for the Kubernetes operator

2022-06-15 Thread Yang Wang
>From what I have learned from the last 1.0.0 release, I think it is not too
complicated and a big burden for flink-kubernetes-operator patch releases.
The major bottleneck might be the VOTE duration(e.g. jet lag, the weekend).
We almost need one week for the release after all the blockers are resolved.

Given that we already provide the snapshot image via github packages for
every commit, I believe it is easy for users to upgrade the operator if it
is really necessary and could not wait for the next release.


Best,
Yang

Chesnay Schepler  于2022年6月14日周二 20:46写道:

> My bad, our bylaws actually state that release votes must have a minimum
> 3 days duration.
>
> On 14/06/2022 14:46, Chesnay Schepler wrote:
> > Yes, pretty much.
> >
> > Mind you that the 72h voting duration is a recommendation by the ASF;
> > it's not a strict rule.
> > AFAICT we also haven't locked this down in our bylaws, apart from
> > requiring 3 votes.
> >
> > On 14/06/2022 14:25, Gyula Fóra wrote:
> >> I think what you are referring to is here:
> >> https://www.apache.org/legal/release-policy.html#source-packages
> >>
> >> Based on this we probably cannot simply release the docker image. We
> >> could
> >> decide to not release the maven artifacts though, but that seems to be a
> >> minor difference and probably not worth it.
> >>
> >> Gyula
> >>
> >> On Tue, Jun 14, 2022 at 2:19 PM Gyula Fóra  wrote:
> >>
> >>> Thanks Chesnay, these are exactly the questions I would like to clarify
> >>> because I don't really understand the limitations/boundaries of the
> >>> apache
> >>> release process.
> >>>
> >>> Is there a strict requirement to have a source release accompany the
> >>> docker image? I will have to look this up.
> >>>
> >>> Gyula
> >>>
> >>> On Tue, Jun 14, 2022 at 2:16 PM Chesnay Schepler 
> >>> wrote:
> >>>
>  On 14/06/2022 14:10, Gyula Fóra wrote:
> > For the operator the main logic (and the bugs) are part of the
> > operator
> > docker image and the helm charts associated with it. It would be nice
>  to be
> > able to have lightweight patch releases that only contain the docker
> > image + updated Helm chart.
> >
> > This would allow us to give users new docker image releases in a
> > short
> > period of time with reduced testing and voting overhead (these patch
> > releases could have a shorter voting period also).
>  I'm not sure if this is allowed because it sounds like you're
>  proposing
>  to release binaries without an associated source release
> 
> 
> >
>
>


Re: [DISCUSS] FLIP-241: Completed Jobs Information Enhancement

2022-06-15 Thread Yang Wang
This is a very useful feature both for finished streaming and batch jobs.

Except for the WebUI & REST API improvements, I am curious whether we could
also integrate some critical information(e.g. latest checkpoint) into the
job result store[1].
I am just feeling this is also somehow related with "Completed Jobs
Information Enhancement".
And I think the history server is not necessary for all the scenarios
especially when users only want to check the job execution result.

[1].
https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore


Best,
Yang

Xintong Song  于2022年6月15日周三 15:37写道:

> Thanks Junhan,
>
> +1 for the proposed improvements.
>
> Best,
>
> Xintong
>
>
>
> On Wed, Jun 15, 2022 at 3:16 PM Yangze Guo  wrote:
>
> > Thanks for driving this, Junhan.
> >
> > I think it's a valuable usability improvement for both streaming and
> > batch users. Looking forward to the community feedback.
> >
> > Best,
> > Yangze Guo
> >
> >
> >
> > On Wed, Jun 15, 2022 at 3:10 PM junhan yang 
> > wrote:
> > >
> > > Hi all,
> > >
> > > I would like to open a discussion on FLIP-241: Completed Jobs
> Information
> > > Enhancement.
> > >
> > > As far as we can tell, streaming and batch users have different
> interests
> > > in probing a job. As Flink grows into a unified streaming & batch
> > processor
> > > and is adopted by more and more batch users, the user experience of
> > > completed job's inspection has become more and more important. After
> > doing
> > > several market research, there are several potential improvements
> > spotted.
> > >
> > > The main purpose here is due to the involvement of WebUI & REST API
> > > changes, which should be openly discussed and voted on as FLIPs.
> > >
> > > You can find more details in FLIP-241 document[1]. Looking forward to
> > > your feedback.
> > >
> > > [1] https://cwiki.apache.org/confluence/x/dRD1D
> > >
> > > Best regards,
> > > Junhan
> >
>


Re: [ANNOUNCE] New Apache Flink PMC Member - Jingsong Lee

2022-06-15 Thread Yang Wang
Congrats, Jingsong!

Best,
Yang

Zakelly Lan  于2022年6月16日周四 11:16写道:

> Congrats & well deserved!
>
> Best,
> Zakelly
>
> On Thu, Jun 16, 2022 at 10:36 AM Guowei Ma  wrote:
>
> > Congrats, Jingsong!
> >
> > Best,
> > Guowei
> >
> >
> > On Thu, Jun 16, 2022 at 9:49 AM Hangxiang Yu 
> wrote:
> >
> > > Congrats, Jingsong!
> > >
> > > Best,
> > > Hangxiang
> > >
> > > On Thu, Jun 16, 2022 at 9:46 AM Aitozi  wrote:
> > >
> > > > Congrats, Jingsong!
> > > >
> > > > Best,
> > > > Aitozi
> > > >
> > > > Zhuoluo Yang  于2022年6月16日周四 09:26写道:
> > > >
> > > > > Many congratulations to teacher Lee!
> > > > >
> > > > > Thanks,
> > > > > Zhuoluo
> > > > >
> > > > >
> > > > > Dian Fu  于2022年6月16日周四 08:54写道:
> > > > >
> > > > > > Congratulations, Jingsong!
> > > > > >
> > > > > > Regards,
> > > > > > Dian
> > > > > >
> > > > > > On Thu, Jun 16, 2022 at 1:08 AM Yu Li  wrote:
> > > > > >
> > > > > > > Congrats, Jingsong!
> > > > > > >
> > > > > > > Best Regards,
> > > > > > > Yu
> > > > > > >
> > > > > > >
> > > > > > > On Wed, 15 Jun 2022 at 15:26, Sergey Nuyanzin <
> > snuyan...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > Congratulations, Jingsong!
> > > > > > > >
> > > > > > > > On Wed, Jun 15, 2022 at 8:45 AM Jingsong Li <
> > > > jingsongl...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks everyone.
> > > > > > > > >
> > > > > > > > > It's great to be with you in the Flink community!
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Jingsong
> > > > > > > > >
> > > > > > > > > On Wed, Jun 15, 2022 at 2:11 PM Yun Gao
> > > > >  > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Congratulations, Jingsong!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Yun Gao
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > --
> > > > > > > > > > From:Jing Zhang 
> > > > > > > > > > Send Time:2022 Jun. 14 (Tue.) 11:05
> > > > > > > > > > To:dev 
> > > > > > > > > > Subject:Re: [ANNOUNCE] New Apache Flink PMC Member -
> > Jingsong
> > > > Lee
> > > > > > > > > >
> > > > > > > > > > Congratulations, Jingsong!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jing Zhang
> > > > > > > > > >
> > > > > > > > > > Leonard Xu  于2022年6月14日周二 10:54写道:
> > > > > > > > > >
> > > > > > > > > > > Congratulations, Jingsong!
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Leonard
> > > > > > > > > > >
> > > > > > > > > > > > 2022年6月13日 下午6:52,刘首维 
> > 写道:
> > > > > > > > > > > >
> > > > > > > > > > > > Congratulations and well deserved, Jingsong!
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Shouwei
> > > > > > > > > > > > --原始邮件--
> > > > > > > > > > > > 发件人:
> > > > > > > > > > >
>  "dev"
> > > > > > > > > > >   <
> > > > > > > > > luoyu...@alumni.sjtu.edu.cn
> > > > > > > > > > > ;
> > > > > > > > > > > > 发送时间:2022年6月13日(星期一) 晚上6:09
> > > > > > > > > > > > 收件人:"dev" > > > > > > dev@flink.apache.org
> > > > > > > > > >;
> > > > > > > > > > > >
> > > > > > > > > > > > 主题:Re: [ANNOUNCE] New Apache Flink PMC Member -
> > > > > Jingsong
> > > > > > > Lee
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Congratulations, Jingsong!
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Yuxia
> > > > > > > > > > > >
> > > > > > > > > > > > - 原始邮件 -
> > > > > > > > > > > > 发件人: "Yun Tang"  > > > > > > > > > > > 收件人: "dev"  > > > > > > > > > > > 发送时间: 星期一, 2022年 6 月 13日 下午 6:12:24
> > > > > > > > > > > > 主题: Re: [ANNOUNCE] New Apache Flink PMC Member -
> > Jingsong
> > > > Lee
> > > > > > > > > > > >
> > > > > > > > > > > > Congratulations, Jingsong! Well deserved.
> > > > > > > > > > > >
> > > > > > > > > > > > Best
> > > > > > > > > > > > Yun Tang
> > > > > > > > > > > > 
> > > > > > > > > > > > From: Xingbo Huang  > > > > > > > > > > > Sent: Monday, June 13, 2022 17:39
> > > > > > > > > > > > To: dev  > > > > > > > > > > > Subject: Re: [ANNOUNCE] New Apache Flink PMC Member -
> > > > > Jingsong
> > > > > > > Lee
> > > > > > > > > > > >
> > > > > > > > > > > > Congratulations, Jingsong!
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Xingbo
> > > > > > > > > > > >
> > > > > > > > > > > > Jane Chan  > > > 17:23写道:
> > > > > > > > > > > >
> > > > > > > > > > > >  Congratulations, Jingsong!
> > > > > > > > > > > > 
> > > > > > > > > > > >  Best,
> > > > > > > > > > > >  Jane Chan
> > > > > > > > > > > > 
> > > > > > > > > > > >  On Mon, Jun 13, 2022 at 4:43 PM Shuo Cheng <
> > > > > > > > njucs...@gmail.com
> > > > > > > > > > >  wrote:
> > > > > > 

[ANNOUNCE] Apache Flink Kubernetes Operator 1.0.0 released

2022-06-05 Thread Yang Wang
The Apache Flink community is very happy to announce the release of Apache
Flink Kubernetes Operator 1.0.0.

The Flink Kubernetes Operator allows users to manage their Apache Flink
applications and their lifecycle through native k8s tooling like kubectl.
This is the first production ready release and brings numerous improvements
and new features to almost every aspect of the operator.

Please check out the release blog post for an overview of the release:
https://flink.apache.org/news/2022/06/05/release-kubernetes-operator-1.0.0.html

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator applications can be
found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
Gyula & Yang


[RESULT] [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #4

2022-06-03 Thread Yang Wang
I'm happy to announce that we have unanimously approved this release.

There are 5 approving votes, 3 of which are binding:

* Marton Balassi (binding)

* Gyula Fora (binding)

* Biao Geng (non-binding)

* Jim Busche (non-binding)

* Yang Wang (binding)


There are no disapproving votes.

Thank you all for verifying the release candidate. I will proceed to
finalize

the release on the weekend and announce it once everything is published.



Best,

Yang


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #4

2022-06-03 Thread Yang Wang
I am closing this VOTE since it has run for enough time and there's no more
feedback.

Best,
Yang

Yang Wang  于2022年6月2日周四 12:35写道:

> +1 (binding)
>
> Successfully verified the following:
> - Verify that the checksums and GPG files
> - Verify that the source distributions do not contain any binaries
> - Build binary and image from release source
> - Verify the NOTICE and licenses in source release and the docker image
> - Verify the helm chart values with correct appVersion and image tag
> - Operator functionality manual testing
> - Start a session FlinkDeployment and submit a SessionJob CR
> - Start a Flink Application batch job with 1.15
> - Verify the FlinkUI could be accessed via ingress
> - The operator logs is normal
>
>
> Best,
> Yang
>
> Jim Busche  于2022年6月2日周四 01:04写道:
>
>> Hi Yang,
>>
>> +1 (not-binding)
>>
>>
>>   *   Helm install looks good, logs look normal
>>   *   Podman build from source looks good
>>   *   Security scans of a built image and your
>> ghcr.io/apache/flink-kubernetes-operator:fa2cd14 container look great.
>>   *   UI and basic sample look good.
>>
>>
>>
>> Thank you, Jim
>>
>


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #4

2022-06-01 Thread Yang Wang
+1 (binding)

Successfully verified the following:
- Verify that the checksums and GPG files
- Verify that the source distributions do not contain any binaries
- Build binary and image from release source
- Verify the NOTICE and licenses in source release and the docker image
- Verify the helm chart values with correct appVersion and image tag
- Operator functionality manual testing
- Start a session FlinkDeployment and submit a SessionJob CR
- Start a Flink Application batch job with 1.15
- Verify the FlinkUI could be accessed via ingress
- The operator logs is normal


Best,
Yang

Jim Busche  于2022年6月2日周四 01:04写道:

> Hi Yang,
>
> +1 (not-binding)
>
>
>   *   Helm install looks good, logs look normal
>   *   Podman build from source looks good
>   *   Security scans of a built image and your
> ghcr.io/apache/flink-kubernetes-operator:fa2cd14 container look great.
>   *   UI and basic sample look good.
>
>
>
> Thank you, Jim
>


[VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #4

2022-06-01 Thread Yang Wang
Hi everyone,

Please review and vote on the release candidate #4 for the version 1.0.0 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [7]

All artifacts are signed with the key
2FF2977BBBFFDF283C6FE7C6A301006F3591EE2C [3]

Other links for your review:
* JIRA release notes [4]
* source code tag "release-1.0.0-rc4" [5]
* PR to update the website Downloads page to include Kubernetes Operator
links [6]

**Vote Duration**

Since there's no functional changes from release candidate #3, the voting
time will run for 48 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

Thanks,
Yang

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc4/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1506/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500
[5]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.0-rc4
[6] https://github.com/apache/flink-web/pull/542
[7] ghcr.io/apache/flink-kubernetes-operator:fa2cd14
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release


Re: About Native Deployment's Autoscaling implementation

2022-06-01 Thread Yang Wang
Hi Talat,

Using sub resources for the auto scaling makes a lot of sense to me.

Could you be more specific why you think changing task manager count will
> not work for native deployment ?


The native K8s integration is using active resourcemanager. It means that
the TaskManager count will be calculated by *parallelism / numTaskSlot*. If
we want to add more TaskManager pods, then we need to increase the
parallelism.
Not like the standalone deployment, there's no way to directly configure
the TaskManager count.

Given that we could not set the replicas of TaskManager pods when using
native K8s integration, we need to calculate and configure the parallelism
via [min/max]*replicas * numTaskSlots*. See the prototype in Gyula's PR[1].
So the problem is how could we change the parallelism without creating the
Flink application again. We do not have a restAPI for this.

[1]. https://github.com/apache/flink-kubernetes-operator/pull/227

Best,
Yang

Talat Uyarer  于2022年6月1日周三 08:34写道:

> Hi Yang and Gyula,
>
> Yang, Could you give a little bit more information ?  What prevents us
> from changing task managers' count ? I am aware of ActiveResourceManager of
> Flink. But Flink only calls resources when it initializes a cluster.
> If we set
>
>- jobmanager.scheduler: adaptive
>- cluster.declarative-resource-management.enabled: true
>
> While deploying a Flink Native cluster. Even though it is native
> deployment. Flink will be able to add task manager add/remove behavior.
> Because basically adding/removing a task manager is similar to recovering a
> failed task manager.
>
> Could you be more specific why you think changing task manager count will
> not work for native deployment ? I will not use reactive-mode. Scaling up
> or down will be handled by HPA. We will define sub sources.[1]
> Users will give us starting points such as replicaCount and max count such
> as maxRecplicaCount. Flink clusters will be initialized by replicaCount for
> TaskManager.
>
> Gyula, I want to make HPA part of FlinkDeployment. And introduce auto
> scaling settings such as metric service endpoints and some other default
> settings such as threshold etc. to reduce complexity. Let me start
> implementing something after Yang's answer. When users enable autoscaling
> we need to also set scheduler and declarative resource management settings
> behind the scenes.
>
> Thanks
>
> [1]
> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource
>
> On Mon, May 30, 2022 at 2:25 AM Yang Wang  wrote:
>
>> >
>> > I thought we could enable Adaptive Scheduler, so adding or removing a
>> task
>> > manager is the same as restarting a job when we use an adaptive
>> scheduler.
>> > Do I miss anything ?
>>
>>
>> It is true for standalone mode since adding/removing a TaskManager pod is
>> fully controlled by users(or external tools).
>> But it is not valid for native K8s integration[1]. Currently, we could not
>> dynamically change the TaskManager pods once the job is running.
>>
>> I really hope the HPA could work for both standalone and native mode.
>>
>>
>> [1].
>> https://urldefense.com/v3/__https://flink.apache.org/2021/02/10/native-k8s-with-ha.html__;!!Mt_FR42WkD9csi9Y!fSD-IVwNQGjF2jX2ExbTekF48yBxK1zCCVa3T-thC0h0G6Q_X5CWKiOnr7yYf8pgzdOCFer91CODppTow0VgfaqMAuI$
>>
>> Best,
>> Yang
>>
>> Gyula Fóra  于2022年5月30日周一 12:23写道:
>>
>> > Hi Talat!
>> >
>> > Sorry for the late reply, I have been busy with some fixes for the
>> release
>> > and travelling.
>> >
>> > I think the prometheus metrics integration sounds like a great idea that
>> > would cover the needs of most users.
>> > This way users can also integrate easily with the custom Flink metrics
>> too.
>> >
>> > maxReplicas: We could add this easily to the taskManager resource specs
>> >
>> > Nice workflow picture, I would love to include this in the docs later.
>> One
>> > minor comment, should the HPA be outside of the FlinkDeployment box?
>> >
>> > Cheers,
>> > Gyula
>> >
>> > On Wed, May 25, 2022 at 7:50 PM Talat Uyarer <
>> tuya...@paloaltonetworks.com>
>> > wrote:
>> >
>> >> Hi Yang,
>> >>
>> >> I thought we could enable Adaptive Scheduler, so adding or removing a
>> task
>> >> manager is the same as restarting a job when we use an adaptive
>> scheduler.
>> >> Do I miss anything ?
>> >>
>> >> Thanks
>> >>
>> >> On Tue, May 24, 2022 at 8:16 PM Yang

[jira] [Created] (FLINK-27860) List the CSS/docs dependencies in the NOTICE

2022-05-31 Thread Yang Wang (Jira)
Yang Wang created FLINK-27860:
-

 Summary: List the CSS/docs dependencies in the NOTICE
 Key: FLINK-27860
 URL: https://issues.apache.org/jira/browse/FLINK-27860
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Yang Wang
Assignee: Yang Wang
 Fix For: kubernetes-operator-1.0.0


We should list the CSS/docs dependencies in the NOTICE file.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #3

2022-05-31 Thread Yang Wang
Thanks all for your testing and patience.

And sorry for I have to cancel this VOTE since @Márton Balassi
 found a license issue. We do not list the
CSS/docs dependencies in the NOTICE file of source release[1].

I will create another release candidate today including this fix. Given
that there's no functional changes, we could set the voting period to 48
hours for next RC.


[1].
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc3/

Best,
Yang

Jim Busche  于2022年6月1日周三 08:30写道:

> Hi Yang,  Thank you for RC3
>
> +1 (not-binding)
>
>
>   *   Podman builds look great with your .dockerignore file and “COPY .
> .”  Thank you for the fix.
>   *   Twistlock security scans look clean of both your “
> ghcr.io/apache/flink-kubernetes-operator:52b50c2 image” as well as a
> locally built image from source.
>   *   I’ve tried it on both Openshift 4.8 and 4.10 and the basic test and
> UI look good, using helm repo install from the provided helm repo as well
> as from source with the local image.
>   *   Logs look normal
>
>
>
> Thank you,
>
> Jim
>
>
>


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #3

2022-05-31 Thread Yang Wang
+1 (binding)

Successfully verified the following:
- Verify that the checksums and GPG files
- Verify that the source distributions do not contain any binaries
- Build binary and image from release source
- Verify the NOTICE and licenses in the docker image
- Verify the helm chart values with correct appVersion and image tag
- Operator functionality manual testing
- Start a session FlinkDeployment and submit a SessionJob CR
- Start a Flink Application
- Verify the FlinkUI could be accessed via ingress
- The operator logs is normal


Best,
Yang


Nicholas Jiang  于2022年5月31日周二 16:16写道:

> Hi Yang!
>
> +1 (not-binding)
>
> Verified the following successfully:
>
> - Build from source, build container from source
> - Run the examples for application, session and session job deployments
> successfully and without any errors.
>
> Regards,
> Nicholas Jiang
>
> On 2022/05/31 06:26:02 Yang Wang wrote:
> > Hi everyone,
> >
> > Please review and vote on the release candidate #3 for the version 1.0.0
> of
> > Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [7]
> >
> > All artifacts are signed with the key
> > 2FF2977BBBFFDF283C6FE7C6A301006F3591EE2C [3]
> >
> > Other links for your review:
> > * JIRA release notes [4]
> > * source code tag "release-1.0.0-rc3" [5]
> > * PR to update the website Downloads page to include Kubernetes Operator
> > links [6]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > Thanks,
> > Yang
> >
> > [1]
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc3/
> > [2]
> https://repository.apache.org/content/repositories/orgapacheflink-1505/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500
> > [5]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.0-rc3
> > [6] https://github.com/apache/flink-web/pull/542
> > [7] ghcr.io/apache/flink-kubernetes-operator:52b50c2
> > [8]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
>


[VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #3

2022-05-31 Thread Yang Wang
Hi everyone,

Please review and vote on the release candidate #3 for the version 1.0.0 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [7]

All artifacts are signed with the key
2FF2977BBBFFDF283C6FE7C6A301006F3591EE2C [3]

Other links for your review:
* JIRA release notes [4]
* source code tag "release-1.0.0-rc3" [5]
* PR to update the website Downloads page to include Kubernetes Operator
links [6]

**Vote Duration**

The voting time will run for at least 72 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

Thanks,
Yang

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc3/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1505/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500
[5]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.0-rc3
[6] https://github.com/apache/flink-web/pull/542
[7] ghcr.io/apache/flink-kubernetes-operator:52b50c2
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release


Re: About Native Deployment's Autoscaling implementation

2022-05-30 Thread Yang Wang
>
> I thought we could enable Adaptive Scheduler, so adding or removing a task
> manager is the same as restarting a job when we use an adaptive scheduler.
> Do I miss anything ?


It is true for standalone mode since adding/removing a TaskManager pod is
fully controlled by users(or external tools).
But it is not valid for native K8s integration[1]. Currently, we could not
dynamically change the TaskManager pods once the job is running.

I really hope the HPA could work for both standalone and native mode.


[1]. https://flink.apache.org/2021/02/10/native-k8s-with-ha.html

Best,
Yang

Gyula Fóra  于2022年5月30日周一 12:23写道:

> Hi Talat!
>
> Sorry for the late reply, I have been busy with some fixes for the release
> and travelling.
>
> I think the prometheus metrics integration sounds like a great idea that
> would cover the needs of most users.
> This way users can also integrate easily with the custom Flink metrics too.
>
> maxReplicas: We could add this easily to the taskManager resource specs
>
> Nice workflow picture, I would love to include this in the docs later. One
> minor comment, should the HPA be outside of the FlinkDeployment box?
>
> Cheers,
> Gyula
>
> On Wed, May 25, 2022 at 7:50 PM Talat Uyarer 
> wrote:
>
>> Hi Yang,
>>
>> I thought we could enable Adaptive Scheduler, so adding or removing a task
>> manager is the same as restarting a job when we use an adaptive scheduler.
>> Do I miss anything ?
>>
>> Thanks
>>
>> On Tue, May 24, 2022 at 8:16 PM Yang Wang  wrote:
>>
>> > Thanks for the interesting discussion.
>> >
>> > Compared with reactive mode, leveraging the flink-kubernetes-operator to
>> > do the job restarting/upgrading is another solution for auto-scaling.
>> > Given that fully restarting a Flink application on K8s is not too slow,
>> > this is a reasonable way.
>> > Really hope we could get some progress in such area.
>> >
>> > Best,
>> > Yang
>> >
>> > Gyula Fóra  于2022年5月25日周三 09:04写道:
>> >
>> >> Hi Talat!
>> >>
>> >> It would be great to have a HPA that works based on some flink
>> >> throughput/backlog metrics. I wonder how you are going to access the
>> Flink
>> >> metrics in the HPA, we might need some integration with the k8s metrics
>> >> system.
>> >> In any case whether we need a FLIP or not depends on the complexity, if
>> >> it's simple then we can go without a FLIP.
>> >>
>> >> Cheers,
>> >> Gyula
>> >>
>> >> On Tue, May 24, 2022 at 12:26 PM Talat Uyarer <
>> >> tuya...@paloaltonetworks.com>
>> >> wrote:
>> >>
>> >> > Hi Gyula,
>> >> >
>> >> > This seems very promising for initial scaling. We are using Flink
>> >> > Kubernetes Operators. Most probably we are very early adapters for
>> it :)
>> >> > Let me try it. Get back to you soon.
>> >> >
>> >> > My plan is building a general purpose CPU and backlog/throughput base
>> >> > autoscaling for Flink. I can create a Custom Open Source HPA on top
>> of
>> >> your
>> >> > changes. Do I need to create a FLIP for it ?
>> >> >
>> >> > Just general information about us Today we use another execution env.
>> >> if
>> >> > the Job scheduler does not support autoscaling. Having a HPA works if
>> >> your
>> >> > sources are well balanced. If there is uneven distribution on
>> sources,
>> >> > Having auto scaling feature on scheduler can help better utilization.
>> >> But
>> >> > this is not urgent. We can start using your PR at least for a while.
>> >> >
>> >> > Thanks
>> >> >
>> >> > On Mon, May 23, 2022 at 4:10 AM Gyula Fóra 
>> >> wrote:
>> >> >
>> >> >> Hi Talat!
>> >> >>
>> >> >> One other approach that we are investigating currently is combining
>> >> the Flink
>> >> >> Kubernetes Operator
>> >> >> <
>> >>
>> https://urldefense.com/v3/__https://github.com/apache/flink-kubernetes-operator__;!!Mt_FR42WkD9csi9Y!ZNBiCduZFUmuQI7_9M48gQnkxBkrLEVOIPRWZY0ad0xmltbQ6G2stlfsiw6q9bGi5fctVF2RS1YNL5EkV9nZZNwfcA$
>> >
>> >> with
>> >> >> the K8S scaling capabilities (Horizontal Pod autoscaler)
>> >>

[jira] [Created] (FLINK-27834) Flink kubernetes operator dockerfile could not work with podman

2022-05-29 Thread Yang Wang (Jira)
Yang Wang created FLINK-27834:
-

 Summary: Flink kubernetes operator dockerfile could not work with 
podman
 Key: FLINK-27834
 URL: https://issues.apache.org/jira/browse/FLINK-27834
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Yang Wang


[1/2] STEP 16/19: COPY *.git ./.git

Error: error building at STEP "COPY *.git ./.git": checking on sources under 
"/root/FLINK/release-1.0-rc2/flink-kubernetes-operator-1.0.0": Rel: can't make  
relative to /root/FLINK/release-1.0-rc2/flink-kubernetes-operator-1.0.0; 
copier: stat: ["/*.git"]: no such file or directory

 

podman version
Client:       Podman Engine
Version:      4.0.2
API Version:  4.0.2

 

 

I think the root cause is "*.git" is not respected by podman. Maybe we could 
simply copy the whole directory when building the image.

 
{code:java}
WORKDIR /app

COPY . .

RUN --mount=type=cache,target=/root/.m2 mvn -ntp clean install -pl 
!flink-kubernetes-docs -DskipTests=$SKIP_TESTS {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #2

2022-05-29 Thread Yang Wang
Thanks Jim for providing the information and Gyula for sharing the concerns.

I will create the release candidate #3 after we have more progress on
the FLINK-27804, whether we are sure it is resolved or we have found the
root cause.

Also I will try to make the Dockerfile could work for podman.


Best,
Yang

Gyula Fóra  于2022年5月28日周六 02:37写道:

> Hi Devs!
>
> We have been performing extensive manual testing on the release.
>
> We have found some very infrequent savepoint upgrade issues with Flink 1.15
> deployments that we could not reliably reproduce so far. I have added a fix
> that hopefully eliminates the root cause (
> https://issues.apache.org/jira/browse/FLINK-27804)
> Since adding this we haven't hit the same problem again, so our confidence
> in the fix is growing :)
>
> I think it would make sense to create a new RC including this fix tomorrow,
> or whenever you have time Yang.
> We do not need to rush this release, I would prefer to take an extra 1-2
> days to eliminate these corner cases as much as possible.
>
> Cheers,
> Gyula
>
> On Fri, May 27, 2022 at 11:09 AM Jim Busche  wrote:
>
> > Hi Yang,
> >
> >
> > Oh, I’ve been using podman on Red Hat for testing:
> >
> > podman version
> >
> > Client:   Podman Engine
> >
> > Version:  4.0.2
> >
> > API Version:  4.0.2
> >
> >
> > If I use Docker (version 20.10.13 for me right now) then it builds fine
> > with that COPY git line.  Nice!
> >
> >
> >
> > To use podman, I need to either comment out the COPY git line, or mkdir
> > .git first.
> >
> >
> >
> > Thanks, Jim
> >
> >
> >
> >
> >
> >
>


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #2

2022-05-26 Thread Yang Wang
Thanks Jim for the testing.

Could you please share the docker version you are using to build the image?
It works well for "20.10.8".

*COPY *.git ./.git*

The above copy command should ignore the .git directory if it does not
exist.

Best,
Yang

Jim Busche  于2022年5月27日周五 02:57写道:

> Hi Yang,
>
>
> I still see the git issue when trying to build from source:
>
> [1/2] STEP 16/19: COPY *.git ./.git
>
> Error: error building at STEP "COPY *.git ./.git": checking on sources
> under "/root/FLINK/release-1.0-rc2/flink-kubernetes-operator-1.0.0": Rel:
> can't make  relative to
> /root/FLINK/release-1.0-rc2/flink-kubernetes-operator-1.0.0; copier: stat:
> ["/*.git"]: no such file or directory
>
>
>
> If I remove that COPY git line, or I first make an empty .git filesystem,
> then the build proceeds ok.  Not sure what all we lose in the image if the
> underlying .git items are missing.
>
>
>
> Other testing:
>
>   *   The helm install looks good
>   *   Twistlock vulnerability scan of the Debian image looks good.
>   *   Basic example deployed ok
>
> Thanks, Jim
>
>
>


[VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #2

2022-05-26 Thread Yang Wang
Hi everyone,

Please review and vote on the release candidate #2 for the version 1.0.0 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [7]

All artifacts are signed with the key
2FF2977BBBFFDF283C6FE7C6A301006F3591EE2C [3]

Other links for your review:
* JIRA release notes [4]
* source code tag "release-1.0.0-rc2" [5]
* PR to update the website Downloads page to include Kubernetes Operator
links [6]

**Vote Duration**

The voting time will run for at least 72 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

Thanks,
Yang

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc2/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1504/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500
[5]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.0-rc2
[6] https://github.com/apache/flink-web/pull/542
[7] ghcr.io/apache/flink-kubernetes-operator:6e2b896
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #1

2022-05-25 Thread Yang Wang
I have verified this release candidate with the following and do not find
other issues.
* Install via helm chart
* Create a Flink application and
  * Trigger savepoint manually
  * Status of finished job could be synced to CR status
  * Kill JobManager of finished job and do not trigger a fresh restart
* Create a session cluster and
  * Submit a job via SessionJob
  * Cancel the job
  * delete the SessionJob and session FlinkDeployment
* Access the Flink UI via ingress
* Run a batch job and check the status(please note that only 1.15 are
expected to work well)

We have already addressed all the known issues. So I am canceling this VOTE
and will create the release candidate #2 now.


Best,
Yang

Biao Geng  于2022年5月24日周二 17:04写道:

> Hi Yang,
> Thanks for the work!
> I successfully verified these items:
> 1. Verify that the checksums and GPG files are intact
> 2. Verify that the source distributions do not contain any binaries
> 3. Build the source distribution to ensure all source files
> 4. Validate the Maven artifacts do not contain any external dependency
> with jar
> tf
> 5. Verify Helm chart can be installed without overriding any parameters
> using helm repo add operator-1.0.0-rc1
>
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc1/
> &&
> <https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc1/&;>
> helm install flink-kubernetes-operator
> operator-1.0.0-rc1/flink-kubernetes-operator
> 6. Verify the job using  examples/basic-checkpoint-ha.yaml and check that
> if we manually updated the configmap `flink-operator-config`, the following
> submitted job will take the new default configuration
> 7. Verified operator logs does not contain unexpected things
>
> I have not found new issues besides what Gyula has mentioned. I will follow
> our jira list to validate other new features or improvements.
>
> Best,
> Biao Geng
>
>
> Yang Wang  于2022年5月24日周二 16:08写道:
>
> > Thanks Gyula for sharing the feedback.
> >
> > I have created two tickets[1][2] to fix the problems and will integrate
> > them in the next release candidate.
> >
> > I am not closing this vote and waiting for other verification comments.
> >
> > [1]. https://issues.apache.org/jira/browse/FLINK-27746
> > [2]. https://issues.apache.org/jira/browse/FLINK-27747
> >
> > Best,
> > Yang
> >
> > Gyula Fóra  于2022年5月24日周二 01:03写道:
> >
> > > Hi Yang!
> > >
> > > Thank you for preparing the RC.
> > >
> > > I have successfully verified the following:
> > > - Signatures, Hashes
> > > - No binaries in source release
> > > - Helm Repo works, Helm install works, docker image matches release
> > commit
> > > tag
> > > - Build from source
> > > - Submit example job without errors
> > >
> > > Some problems that I have found:
> > >  - In the Helm chart release the Chart.yaml file doesn't have an apache
> > > license header (the same file in the source release has it)
> > >  - I could not build the Docker image from the source release, getting
> > the
> > > following error:
> > >
> > >
> > > > [build 11/14] COPY .git ./.git:
> > >
> > > --
> > >
> > > failed to compute cache key: "/.git" not found: not found
> > >
> > >
> > > I will continue with further functional / manual verification.
> > >
> > >
> > > Cheers,
> > >
> > > Gyula
> > >
> > > On Mon, May 23, 2022 at 5:58 AM Yang Wang 
> wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > Please review and vote on the release candidate #1 for the version
> > 1.0.0
> > > of
> > > > Apache Flink Kubernetes Operator,
> > > > as follows:
> > > > [ ] +1, Approve the release
> > > > [ ] -1, Do not approve the release (please provide specific comments)
> > > >
> > > > **Release Overview**
> > > >
> > > > As an overview, the release consists of the following:
> > > > a) Kubernetes Operator canonical source distribution (including the
> > > > Dockerfile), to be deployed to the release repository at
> > dist.apache.org
> > > > b) Kubernetes Operator Helm Chart to be deployed to the release
> > > repository
> > > > at dist.apache.org
> > > > c) Maven artifacts to be deployed to the Maven Central Repository
> > > > d) Docker image to be pushed to dockerhub
> > > >
> > > > **Stagi

Re: About Native Deployment's Autoscaling implementation

2022-05-24 Thread Yang Wang
Thanks for the interesting discussion.

Compared with reactive mode, leveraging the flink-kubernetes-operator to do
the job restarting/upgrading is another solution for auto-scaling.
Given that fully restarting a Flink application on K8s is not too slow,
this is a reasonable way.
Really hope we could get some progress in such area.

Best,
Yang

Gyula Fóra  于2022年5月25日周三 09:04写道:

> Hi Talat!
>
> It would be great to have a HPA that works based on some flink
> throughput/backlog metrics. I wonder how you are going to access the Flink
> metrics in the HPA, we might need some integration with the k8s metrics
> system.
> In any case whether we need a FLIP or not depends on the complexity, if
> it's simple then we can go without a FLIP.
>
> Cheers,
> Gyula
>
> On Tue, May 24, 2022 at 12:26 PM Talat Uyarer <
> tuya...@paloaltonetworks.com>
> wrote:
>
> > Hi Gyula,
> >
> > This seems very promising for initial scaling. We are using Flink
> > Kubernetes Operators. Most probably we are very early adapters for it :)
> > Let me try it. Get back to you soon.
> >
> > My plan is building a general purpose CPU and backlog/throughput base
> > autoscaling for Flink. I can create a Custom Open Source HPA on top of
> your
> > changes. Do I need to create a FLIP for it ?
> >
> > Just general information about us Today we use another execution env.  if
> > the Job scheduler does not support autoscaling. Having a HPA works if
> your
> > sources are well balanced. If there is uneven distribution on sources,
> > Having auto scaling feature on scheduler can help better utilization. But
> > this is not urgent. We can start using your PR at least for a while.
> >
> > Thanks
> >
> > On Mon, May 23, 2022 at 4:10 AM Gyula Fóra  wrote:
> >
> >> Hi Talat!
> >>
> >> One other approach that we are investigating currently is combining the
> Flink
> >> Kubernetes Operator
> >> <
> https://urldefense.com/v3/__https://github.com/apache/flink-kubernetes-operator__;!!Mt_FR42WkD9csi9Y!ZNBiCduZFUmuQI7_9M48gQnkxBkrLEVOIPRWZY0ad0xmltbQ6G2stlfsiw6q9bGi5fctVF2RS1YNL5EkV9nZZNwfcA$>
> with
> >> the K8S scaling capabilities (Horizontal Pod autoscaler)
> >>
> >> In this approach the HPA monitors the Taskmanager pods directly and can
> >> modify the FlinkDeployment resource replica number to trigger a stateful
> >> job scale-up/down through the operator.
> >> Obviously not as nice as the reactive mode but it works with the current
> >> Kubernetes Native implementation easily. It is also theoretically
> possible
> >> to integrate this with other custom Flink metrics but we haven't tested
> yet.
> >>
> >> I have a created a POC pull request that showcases these capabilities:
> >> https://github.com/apache/flink-kubernetes-operator/pull/227
> >> <
> https://urldefense.com/v3/__https://github.com/apache/flink-kubernetes-operator/pull/227__;!!Mt_FR42WkD9csi9Y!ZNBiCduZFUmuQI7_9M48gQnkxBkrLEVOIPRWZY0ad0xmltbQ6G2stlfsiw6q9bGi5fctVF2RS1YNL5EkV9nKHxgshA$
> >
> >>
> >> If you are interested it would be nice if you could check it out and
> >> provide feedback, we will get back to refining this after our current
> >> ongoing release.
> >>
> >> Cheers,
> >> Gyula
> >>
> >> On Mon, May 23, 2022 at 12:23 AM David Morávek  wrote:
> >>
> >>> Hi Talat,
> >>>
> >>> This is definitely an interesting and rather complex topic.
> >>>
> >>> Few unstructured thoughts / notes / questions:
> >>>
> >>> - The main struggle has always been that it's hard to come up with a
> >>> generic one-size-fits-it-all metrics for autoscaling.
> >>>   - Flink doesn't have knowledge of the external environment (eg.
> >>> capacity
> >>> planning on the cluster, no notion of pre-emption), so it can not
> really
> >>> make a qualified decision in some cases.
> >>>   - ^ the above goes along the same reasoning as why we don't support
> >>> reactive mode with the session cluster (multi-job scheduling)
> >>> - The re-scaling decision logic most likely needs to be pluggable from
> >>> the
> >>> above reasons
> >>>   - We're in general fairly concerned about running any user code in JM
> >>> for
> >>> stability reasons.
> >>>   - The most flexible option would be allowing to set the desired
> >>> parallelism via rest api and leave the scaling decision to an external
> >>> process, which could be reused for both standalone and "active"
> >>> deployment
> >>> modes (there is actually a prototype by Till, that allows this [1])
> >>>
> >>> How do you intend to make an autoscaling decision? Also note that the
> >>> re-scaling is still a fairly expensive operation (especially with large
> >>> state), so you need to make sure autoscaler doesn't oscillate and
> doesn't
> >>> re-scale too often (this is also something that could vary from
> workload
> >>> to
> >>> workload).
> >>>
> >>> Note on the metrics question with an auto-scaler living in the JM:
> >>> - We shouldn't really collect the metrics into the JM, but instead JM
> can
> >>> pull then from TMs directly on-demand (basically the same thing and
> >>> 

[jira] [Created] (FLINK-27759) Rethink how to get the git commit id for docker image in Flink Kubernetes operator

2022-05-24 Thread Yang Wang (Jira)
Yang Wang created FLINK-27759:
-

 Summary: Rethink how to get the git commit id for docker image in 
Flink Kubernetes operator
 Key: FLINK-27759
 URL: https://issues.apache.org/jira/browse/FLINK-27759
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Yang Wang


Follow the discussion in the PR[1][2], we need to rethink how the get the git 
commit id properly. Currently, we rely on the .git directory. And it is a 
problem when building image from source release.

 

[1]. [https://github.com/apache/flink-kubernetes-operator/pull/243]

[2]. https://github.com/apache/flink-kubernetes-operator/pull/241



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #1

2022-05-24 Thread Yang Wang
Thanks Gyula for sharing the feedback.

I have created two tickets[1][2] to fix the problems and will integrate
them in the next release candidate.

I am not closing this vote and waiting for other verification comments.

[1]. https://issues.apache.org/jira/browse/FLINK-27746
[2]. https://issues.apache.org/jira/browse/FLINK-27747

Best,
Yang

Gyula Fóra  于2022年5月24日周二 01:03写道:

> Hi Yang!
>
> Thank you for preparing the RC.
>
> I have successfully verified the following:
> - Signatures, Hashes
> - No binaries in source release
> - Helm Repo works, Helm install works, docker image matches release commit
> tag
> - Build from source
> - Submit example job without errors
>
> Some problems that I have found:
>  - In the Helm chart release the Chart.yaml file doesn't have an apache
> license header (the same file in the source release has it)
>  - I could not build the Docker image from the source release, getting the
> following error:
>
>
> > [build 11/14] COPY .git ./.git:
>
> --
>
> failed to compute cache key: "/.git" not found: not found
>
>
> I will continue with further functional / manual verification.
>
>
> Cheers,
>
> Gyula
>
> On Mon, May 23, 2022 at 5:58 AM Yang Wang  wrote:
>
> > Hi everyone,
> >
> > Please review and vote on the release candidate #1 for the version 1.0.0
> of
> > Apache Flink Kubernetes Operator,
> > as follows:
> > [ ] +1, Approve the release
> > [ ] -1, Do not approve the release (please provide specific comments)
> >
> > **Release Overview**
> >
> > As an overview, the release consists of the following:
> > a) Kubernetes Operator canonical source distribution (including the
> > Dockerfile), to be deployed to the release repository at dist.apache.org
> > b) Kubernetes Operator Helm Chart to be deployed to the release
> repository
> > at dist.apache.org
> > c) Maven artifacts to be deployed to the Maven Central Repository
> > d) Docker image to be pushed to dockerhub
> >
> > **Staging Areas to Review**
> >
> > The staging areas containing the above mentioned artifacts are as
> follows,
> > for your review:
> > * All artifacts for a,b) can be found in the corresponding dev repository
> > at dist.apache.org [1]
> > * All artifacts for c) can be found at the Apache Nexus Repository [2]
> > * The docker image for d) is staged on github [7]
> >
> > All artifacts are signed with the key
> > 2FF2977BBBFFDF283C6FE7C6A301006F3591EE2C [3]
> >
> > Other links for your review:
> > * JIRA release notes [4]
> > * source code tag "release-1.0.0-rc1" [5]
> > * PR to update the website Downloads page to include Kubernetes Operator
> > links [6]
> >
> > **Vote Duration**
> >
> > The voting time will run for at least 72 hours.
> > It is adopted by majority approval, with at least 3 PMC affirmative
> votes.
> >
> > **Note on Verification**
> >
> > You can follow the basic verification guide here[8].
> > Note that you don't need to verify everything yourself, but please make
> > note of what you have tested together with your +- vote.
> >
> > Thanks,
> > Yang
> >
> > [1]
> >
> >
> https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc1/
> > [2]
> > https://repository.apache.org/content/repositories/orgapacheflink-1503/
> > [3] https://dist.apache.org/repos/dist/release/flink/KEYS
> > [4]
> >
> >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500
> > [5]
> >
> https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.0-rc1
> > [6] https://github.com/apache/flink-web/pull/542
> > [7] ghcr.io/apache/flink-kubernetes-operator:2417603
> > [8]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release
> >
>


[jira] [Created] (FLINK-27747) Flink kubernetes operator helm chart release the Chart.yaml file doesn't have an apache license header

2022-05-23 Thread Yang Wang (Jira)
Yang Wang created FLINK-27747:
-

 Summary: Flink kubernetes operator helm chart release the 
Chart.yaml file doesn't have an apache license header
 Key: FLINK-27747
 URL: https://issues.apache.org/jira/browse/FLINK-27747
 Project: Flink
  Issue Type: Bug
Reporter: Yang Wang
 Fix For: kubernetes-operator-1.0.0


When verifying the 1.0.0-rc1, [~gyfora] found that the Chart.yaml file doesn't 
have an apache license header.

It seems this is caused by {{helm package}} in the {{create_source_release.sh}}.

We also have this issue in the 0.1.0 release[1].

[1]. 
https://dist.apache.org/repos/dist/release/flink/flink-kubernetes-operator-0.1.0/



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27746) Flink kubernetes operator docker image could not build with source release

2022-05-23 Thread Yang Wang (Jira)
Yang Wang created FLINK-27746:
-

 Summary: Flink kubernetes operator docker image could not build 
with source release
 Key: FLINK-27746
 URL: https://issues.apache.org/jira/browse/FLINK-27746
 Project: Flink
  Issue Type: Bug
  Components: Kubernetes Operator
Reporter: Yang Wang
 Fix For: kubernetes-operator-1.0.0


Could not build the Docker image from the source release, getting the
following error:


> [build 11/14] COPY .git ./.git:

--

failed to compute cache key: "/.git" not found: not found



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[VOTE] Apache Flink Kubernetes Operator Release 1.0.0, release candidate #1

2022-05-23 Thread Yang Wang
Hi everyone,

Please review and vote on the release candidate #1 for the version 1.0.0 of
Apache Flink Kubernetes Operator,
as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

**Release Overview**

As an overview, the release consists of the following:
a) Kubernetes Operator canonical source distribution (including the
Dockerfile), to be deployed to the release repository at dist.apache.org
b) Kubernetes Operator Helm Chart to be deployed to the release repository
at dist.apache.org
c) Maven artifacts to be deployed to the Maven Central Repository
d) Docker image to be pushed to dockerhub

**Staging Areas to Review**

The staging areas containing the above mentioned artifacts are as follows,
for your review:
* All artifacts for a,b) can be found in the corresponding dev repository
at dist.apache.org [1]
* All artifacts for c) can be found at the Apache Nexus Repository [2]
* The docker image for d) is staged on github [7]

All artifacts are signed with the key
2FF2977BBBFFDF283C6FE7C6A301006F3591EE2C [3]

Other links for your review:
* JIRA release notes [4]
* source code tag "release-1.0.0-rc1" [5]
* PR to update the website Downloads page to include Kubernetes Operator
links [6]

**Vote Duration**

The voting time will run for at least 72 hours.
It is adopted by majority approval, with at least 3 PMC affirmative votes.

**Note on Verification**

You can follow the basic verification guide here[8].
Note that you don't need to verify everything yourself, but please make
note of what you have tested together with your +- vote.

Thanks,
Yang

[1]
https://dist.apache.org/repos/dist/dev/flink/flink-kubernetes-operator-1.0.0-rc1/
[2] https://repository.apache.org/content/repositories/orgapacheflink-1503/
[3] https://dist.apache.org/repos/dist/release/flink/KEYS
[4]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351500
[5]
https://github.com/apache/flink-kubernetes-operator/tree/release-1.0.0-rc1
[6] https://github.com/apache/flink-web/pull/542
[7] ghcr.io/apache/flink-kubernetes-operator:2417603
[8]
https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Kubernetes+Operator+Release


Re: [ANNOUNCE] Kubernetes Operator release-1.0 branch cut

2022-05-22 Thread Yang Wang
All the blockers and major issues have been merged into release-1.0 branch.
Just follow what we have promised, I am preparing the first release
candidate now and will share it for the community-wide review today.


Best,
Yang

Márton Balassi  于2022年5月18日周三 00:29写道:

> Thanks Gyula and Yang. Awesome!
>
> On Tue, May 17, 2022 at 4:46 PM Gyula Fóra  wrote:
>
> > Hi Flink devs!
> >
> > The release-1.0 branch has been forked from main and version numbers have
> > been upgraded accordingly.
> >
> > https://github.com/apache/flink-kubernetes-operator/tree/release-1.0
> >
> > The version on the main branch has been updated to 1.1-SNAPSHOT.
> >
> > From now on, for PRs that should be presented in 1.0.0, please make sure:
> > - Merge the PRs first to main, then backport to release-1.0 branch
> > - The JIRA ticket should be closed with the correct fix-versions
> >
> > There are still a few outstanding tickets, mainly docs/minor fixes but no
> > currently known blocker issues.
> >
> > We are working together with Yang to prepare the first RC by next monday.
> >
> > Cheers,
> > Gyula
> >
>


Re: [VOTE] Creating an Apache Flink slack workspace

2022-05-17 Thread Yang Wang
+1 (binding)

And thanks Xintong for driving this work.


Best,
Yang

Jingsong Li  于2022年5月17日周二 17:00写道:

> Thank Xintong for driving this work.
>
> +1
>
> Best,
> Jingsong
>
> On Tue, May 17, 2022 at 4:49 PM Martijn Visser 
> wrote:
>
> > +1 (binding)
> >
> > On Tue, 17 May 2022 at 10:38, Yu Li  wrote:
> >
> > > +1 (binding)
> > >
> > > Thanks Xintong for driving this!
> > >
> > > Best Regards,
> > > Yu
> > >
> > >
> > > On Tue, 17 May 2022 at 16:32, Robert Metzger 
> > wrote:
> > >
> > > > Thanks for starting the VOTE!
> > > >
> > > > +1 (binding)
> > > >
> > > >
> > > >
> > > > On Tue, May 17, 2022 at 10:29 AM Jark Wu  wrote:
> > > >
> > > > > Thank Xintong for driving this work.
> > > > >
> > > > > +1 from my side (binding)
> > > > >
> > > > > Best,
> > > > > Jark
> > > > >
> > > > > On Tue, 17 May 2022 at 16:24, Xintong Song 
> > > > wrote:
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > As previously discussed in [1], I would like to open a vote on
> > > creating
> > > > > an
> > > > > > Apache Flink slack workspace channel.
> > > > > >
> > > > > > The proposed actions include:
> > > > > > - Creating a dedicated slack workspace with the name Apache Flink
> > > that
> > > > is
> > > > > > controlled and maintained by the Apache Flink PMC
> > > > > > - Updating the Flink website about rules for using various
> > > > communication
> > > > > > channels
> > > > > > - Setting up an Archive for the Apache Flink slack
> > > > > > - Revisiting this initiative by the end of 2022
> > > > > >
> > > > > > The vote will last for at least 72 hours, and will be accepted
> by a
> > > > > > consensus of active PMC members.
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Xintong
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Next Flink Kubernetes Operator release timeline

2022-05-16 Thread Yang Wang
Thanks Gyula for sharing the progress. It is very likely we could have the
first release candidate next Monday.

Best,
Yang

Gyula Fóra  于2022年5月16日周一 20:50写道:

> Hi Devs!
>
> We are on track for our planned 1.0.0 release timeline. There are no
> outstanding blocker issues on JIRA for the release.
>
> There are 3 outstanding new feature PRs. They are all in pretty good shape
> and should be merged within a day:
> https://github.com/apache/flink-kubernetes-operator/pull/213
> https://github.com/apache/flink-kubernetes-operator/pull/216
> https://github.com/apache/flink-kubernetes-operator/pull/217
>
> As we agreed previously we should not merge any more new features for
> 1.0.0 and focus our efforts on testing, bug fixes and documentation for
> this week.
>
> I will cut the release branch tomorrow once these PRs are merged. And the
> target day for the first release candidate is next Monday.
>
> The release managers for this release will be Yang Wang and myself.
>
> Cheers,
> Gyula
>
> On Wed, Apr 27, 2022 at 11:28 AM Yang Wang  wrote:
>
>> Thanks @Chesnay Schepler  for pointing out this.
>>
>> The only public interface the flink-kubernetes-operator provides is the
>> CRD[1]. We are trying to stabilize the CRD from v1beta1.
>> If more fields are introduced to support new features(e.g. standalone
>> mode,
>> SQL jobs), they should have the default value to ensure compatibility.
>> Currently, we do not have some tools to enforce the compatibility
>> guarantees. But we have created a ticket[1] to follow this and hope it
>> could be resolved before releasing 1.0.0.
>>
>> Just as you said, now is also a good time to think more about the approach
>> of releases. Since flink-kubernetes-operator is much simpler than Flink,
>> we
>> could have a shorter release cycle.
>> Two month for a major release(1.0, 1.1, etc.) is reasonable to me. And
>> this
>> could be shorten for the minor releases. Also we need to support at least
>> the last two major versions.
>>
>> Maybe the standalone mode support is a big enough feature for version 2.0.
>>
>>
>> [1].
>>
>> https://github.com/apache/flink-kubernetes-operator/tree/main/helm/flink-kubernetes-operator/crds
>> [2]. https://issues.apache.org/jira/browse/FLINK-26955
>>
>>
>> @Hao t Chang  We do not have regular sync up meeting
>> so
>> far. But I think we could schedule some sync up for the 1.0.0 release if
>> necessary. Anyone who is interested are welcome.
>>
>>
>> Best,
>> Yang
>>
>>
>>
>>
>> Hao t Chang  于2022年4月27日周三 07:45写道:
>>
>> > Hi Gyula,
>> >
>> > Thanks for the release timeline information. I would like to learn the
>> > gathered knowledge and volunteer as well. Will there be sync up
>> > meeting/call for this collaboration ?
>> >
>> > From: Gyula Fóra 
>> > Date: Monday, April 25, 2022 at 11:22 AM
>> > To: dev 
>> > Subject: [DISCUSS] Next Flink Kubernetes Operator release timeline
>> > Hi Devs!
>> >
>> > The community has been working hard on cleaning up the operator logic
>> and
>> > adding some core features that have been missing from the preview
>> release
>> > (session jobs for example). We have also added some significant
>> > improvements around deployment/operations.
>> >
>> > With the current pace of the development I think in a few weeks we
>> should
>> > be in a good position to release next version of the operator. This
>> would
>> > also give us the opportunity to add support for the upcoming 1.15
>> release
>> > :)
>> >
>> > We have to decide on 2 main things:
>> >  1. Target release date
>> >  2. Release version
>> >
>> > With the current state of the project I am confident that we could cut a
>> > really good release candidate towards the end of May. I would suggest a
>> > feature *freeze mid May (May 16)*, with a target *RC0 date of May 23*.
>> If
>> > on May 16 we feel that we are ready we could also prepare the release
>> > candidate earlier.
>> >
>> > As for the release version, I personally feel that this is a good time
>> > for *version
>> > 1.0.0*.
>> > While 1.0.0 signals a certain confidence in the stability of the current
>> > API (compared to the preview release) I would keep the kubernetes
>> resource
>> > version v1beta1.
>> >
>> > It would also be great if someone could volunteer to join me to help
>> manage
>> > the release process this time so I can share the knowledge gathered
>> during
>> > the preview release :)
>> >
>> > Let me know what you think!
>> >
>> > Cheers,
>> > Gyula
>> >
>>
>


Re: Re: [ANNOUNCE] New Flink PMC member: Yang Wang

2022-05-11 Thread Yang Wang
Thanks for your warm welcome. It is my pleasure to work in such a nice
community.



Best,

Yang

Thomas Weise  于2022年5月11日周三 00:10写道:

> Congratulations, Yang!
>
> On Tue, May 10, 2022 at 3:15 AM Márton Balassi 
> wrote:
> >
> > Congrats, Yang. Well deserved :-)
> >
> > On Tue, May 10, 2022 at 9:16 AM Terry Wang  wrote:
> >
> > > Congrats Yang!
> > >
> > > On Mon, May 9, 2022 at 11:19 AM LuNing Wang 
> wrote:
> > >
> > > > Congrats Yang!
> > > >
> > > > Best,
> > > > LuNing Wang
> > > >
> > > > Dian Fu  于2022年5月7日周六 17:21写道:
> > > >
> > > > > Congrats Yang!
> > > > >
> > > > > Regards,
> > > > > Dian
> > > > >
> > > > > On Sat, May 7, 2022 at 12:51 PM Jacky Lau 
> > > wrote:
> > > > >
> > > > > > Congrats Yang and well Deserved!
> > > > > >
> > > > > > Best,
> > > > > > Jacky Lau
> > > > > >
> > > > > > Yun Gao  于2022年5月7日周六 10:44写道:
> > > > > >
> > > > > > > Congratulations Yang!
> > > > > > >
> > > > > > > Best,
> > > > > > > Yun Gao
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >  --Original Mail --
> > > > > > > Sender:David Morávek 
> > > > > > > Send Date:Sat May 7 01:05:41 2022
> > > > > > > Recipients:Dev 
> > > > > > > Subject:Re: [ANNOUNCE] New Flink PMC member: Yang Wang
> > > > > > > Nice! Congrats Yang, well deserved! ;)
> > > > > > >
> > > > > > > On Fri 6. 5. 2022 at 17:53, Peter Huang <
> > > huangzhenqiu0...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Congrats, Yang!
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > Peter Huang
> > > > > > > >
> > > > > > > > On Fri, May 6, 2022 at 8:46 AM Yu Li 
> wrote:
> > > > > > > >
> > > > > > > > > Congrats and welcome, Yang!
> > > > > > > > >
> > > > > > > > > Best Regards,
> > > > > > > > > Yu
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, 6 May 2022 at 14:48, Paul Lam <
> paullin3...@gmail.com>
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Congrats, Yang! Well Deserved!
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Paul Lam
> > > > > > > > > >
> > > > > > > > > > > 2022年5月6日 14:38,Yun Tang  写道:
> > > > > > > > > > >
> > > > > > > > > > > Congratulations, Yang!
> > > > > > > > > > >
> > > > > > > > > > > Best
> > > > > > > > > > > Yun Tang
> > > > > > > > > > > 
> > > > > > > > > > > From: Jing Ge 
> > > > > > > > > > > Sent: Friday, May 6, 2022 14:24
> > > > > > > > > > > To: dev 
> > > > > > > > > > > Subject: Re: [ANNOUNCE] New Flink PMC member: Yang Wang
> > > > > > > > > > >
> > > > > > > > > > > Congrats Yang and well Deserved!
> > > > > > > > > > >
> > > > > > > > > > > Best regards,
> > > > > > > > > > > Jing
> > > > > > > > > > >
> > > > > > > > > > > On Fri, May 6, 2022 at 7:38 AM Lincoln Lee <
> > > > > > lincoln.8...@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Congratulations Yang!
> > > > > > > > > > >>
> > > > > > > > > > >> Best,
> > > > > > > > > > >> Lincoln Lee
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> Őrhidi Mátyás  于2022年5月6日周五
> > > > 12:46写道:
> > > > > > > > > > >>
> > > > > > > > > > >>> Congrats Yang! Well deserved!
> > > > > > > > > > >>> Best,
> > > > > > > > > > >>> Matyas
> > > > > > > > > > >>>
> > > > > > > > > > >>> On Fri, May 6, 2022 at 5:30 AM huweihua <
> > > > > > huweihua@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >>>
> > > > > > > > > > >>>> Congratulations Yang!
> > > > > > > > > > >>>>
> > > > > > > > > > >>>> Best,
> > > > > > > > > > >>>> Weihua
> > > > > > > > > > >>>>
> > > > > > > > > > >>>>
> > > > > > > > > > >>>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards,
> > > Terry Wang
> > >
>


Re: Flink job restarted from empty state when execution.shutdown-on-application-finish is enabled

2022-05-11 Thread Yang Wang
I assume this is the responsibility of job result store[1]. However, it
seems that it does not work as expected.

[1].
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435

Best,
Yang

Gyula Fóra  于2022年5月11日周三 12:55写道:

> Sorry I messed up the email, I meant false .
>
> So when we set it to not shut down … :)
>
> Gyula
>
> On Wed, 11 May 2022 at 05:06, Yun Tang  wrote:
>
> > Hi Gyula,
> >
> > Why are you sure that the configuration of
> > execution.shutdown-on-application-finish leading to this error? I noticed
> > that the default value of this configuration is just "true".
> >
> > From my understanding, the completed checkpoint store should only clear
> > its persisted checkpoint information on shutdown when the job status is
> > globally terminated.
> > Did you ever check the configmap, which used to store the completed
> > checkpoint store, that its content has been empty after you just trigger
> a
> > job manager failure?
> >
> > Best
> > Yun Tang
> >
> > 
> > From: Gyula F?ra 
> > Sent: Wednesday, May 11, 2022 3:41
> > To: dev 
> > Subject: Flink job restarted from empty state when
> > execution.shutdown-on-application-finish is enabled
> >
> > Hi Devs!
> >
> > I ran into a concerning situation and would like to hear your thoughts on
> > this.
> >
> > I am running Flink 1.15 on Kubernetes native mode (using the operator but
> > that is besides the point here) with Flink Kubernetes HA enabled.
> >
> > We have enabled
> > *execution.shutdown-on-application-finish = true*
> >
> > I noticed that if after the job failed/finished, if I kill the jobmanager
> > pod (triggering a jobmanager failover), the job would be resubmitted
> from a
> > completely empty state (as if starting for the first time).
> >
> > Has anyone encountered this issue? This makes using this config option
> > pretty risky.
> >
> > Thank you!
> > Gyula
> >
>


Re: [DISCUSS] FLIP-224: Blacklist Mechanism

2022-05-06 Thread Yang Wang
Thanks Lijie and ZhuZhu for the explanation.

I just overlooked the "MARK_BLOCKLISTED". For tasks level, it is indeed
some functionalities the external tools(e.g. kubectl taint) could not
support.


Best,
Yang

Lijie Wang  于2022年5月6日周五 22:18写道:

> Thanks for your feedback, Jiangang and Martijn.
>
> @Jiangang
>
>
> > For auto-detecting, I wonder how to make the strategy and mark a node
> blocked?
>
> In fact, we currently plan to not support auto-detection in this FLIP. The
> part about auto-detection may be continued in a separate FLIP in the
> future. Some guys have the same concerns as you, and the correctness and
> necessity of auto-detection may require further discussion in the future.
>
> > In session mode, multi jobs can fail on the same bad node and the node
> should be marked blocked.
> By design, the blocklist information will be shared among all jobs in a
> cluster/session. The JM will sync blocklist information with RM.
>
> @Martijn
>
> > I agree with Yang Wang on this.
> As Zhu Zhu and I mentioned above, we think the MARK_BLOCKLISTED(Just limits
> the load of the node and does not  kill all the processes on it) is also
> important, and we think that external systems (*yarn rmadmin or kubectl
> taint*) cannot support it. So we think it makes sense even only *manually*.
>
> > I also agree with Chesnay that magical mechanisms are indeed super hard
> to get right.
> Yes, as you see, Jiangang(and a few others) have the same concern.
> However, we currently plan to not support auto-detection in this FLIP, and
> only *manually*. In addition, I'd like to say that the FLIP provides a
> mechanism to support MARK_BLOCKLISTED and
> MARK_BLOCKLISTED_AND_EVACUATE_TASKS,
> the auto-detection may be done by external systems.
>
> Best,
> Lijie
>
> Martijn Visser  于2022年5月6日周五 19:04写道:
>
> > > If we only support to block nodes manually, then I could not see
> > the obvious advantages compared with current SRE's approach(via *yarn
> > rmadmin or kubectl taint*).
> >
> > I agree with Yang Wang on this.
> >
> > >  To me this sounds yet again like one of those magical mechanisms that
> > will rarely work just right.
> >
> > I also agree with Chesnay that magical mechanisms are indeed super hard
> to
> > get right.
> >
> > Best regards,
> >
> > Martijn
> >
> > On Fri, 6 May 2022 at 12:03, Jiangang Liu 
> > wrote:
> >
> >> Thanks for the valuable design. The auto-detecting can decrease great
> work
> >> for us. We have implemented the similar feature in our inner flink
> >> version.
> >> Below is something that I care about:
> >>
> >>1. For auto-detecting, I wonder how to make the strategy and mark a
> >> node
> >>blocked? Sometimes the blocked node is hard to be detected, for
> >> example,
> >>the upper node or the down node will be blocked when network
> >> unreachable.
> >>2. I see that the strategy is made in JobMaster side. How about
> >>implementing the similar logic in resource manager? In session mode,
> >> multi
> >>jobs can fail on the same bad node and the node should be marked
> >> blocked.
> >>If the job makes the strategy, the node may be not marked blocked if
> >> the
> >>fail times don't exceed the threshold.
> >>
> >>
> >> Zhu Zhu  于2022年5月5日周四 23:35写道:
> >>
> >> > Thank you for all your feedback!
> >> >
> >> > Besides the answers from Lijie, I'd like to share some of my thoughts:
> >> > 1. Whether to enable automatical blocklist
> >> > Generally speaking, it is not a goal of FLIP-224.
> >> > The automatical way should be something built upon the blocklist
> >> > mechanism and well decoupled. It was designed to be a configurable
> >> > blocklist strategy, but I think we can further decouple it by
> >> > introducing a abnormal node detector, as Becket suggested, which just
> >> > uses the blocklist mechanism once bad nodes are detected. However, it
> >> > should be a separate FLIP with further dev discussions and feedback
> >> > from users. I also agree with Becket that different users have
> different
> >> > requirements, and we should listen to them.
> >> >
> >> > 2. Is it enough to just take away abnormal nodes externally
> >> > My answer is no. As Lijie has mentioned, we need a way to avoid
> >> > deploying tasks to temporary hot nodes. In this case, users may just
> >> > want to l

Re: [ANNOUNCE] Apache Flink 1.15.0 released

2022-05-05 Thread Yang Wang
Congratulations!

Thanks Yun Gao, Till and Joe for driving this release and everyone who made
this release happen.



Best,
Yang

Jingsong Li  于2022年5月5日周四 16:04写道:

> Cheers! Congratulations!
>
> Thank you very much! And thank all who contributed to this release.
>
> Best,
> Jingsong
>
> On Thu, May 5, 2022 at 3:57 PM Xintong Song  wrote:
> >
> > Congratulations~!
> >
> > Thank the release managers, and thank all who contributed to this
> release.
> >
> > Thank you~
> >
> > Xintong Song
> >
> >
> >
> > On Thu, May 5, 2022 at 3:45 PM Guowei Ma  wrote:
> >
> > > Hi, Yun
> > >
> > > Great job!
> > > Thank you very much for your efforts to release Flink-1.15 during this
> > > time.
> > > Thanks also to all the contributors who worked on this release!
> > >
> > > Best,
> > > Guowei
> > >
> > >
> > > On Thu, May 5, 2022 at 3:24 PM Peter Schrott 
> > > wrote:
> > >
> > > > Great!
> > > >
> > > > Will install it on the cluster asap! :)
> > > >
> > > > One thing I noticed: the linked release notes in the blog
> announcement
> > > > under "Upgrade Notes" result in a 404
> > > > (
> > > >
> > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/release-notes/flink-1.15/
> > > > )
> > > >
> > > > They are also not linked on the main page:
> > > > https://nightlies.apache.org/flink/flink-docs-release-1.15/
> > > >
> > > > Keep it up!
> > > > Peter
> > > >
> > > >
> > > > On Thu, May 5, 2022 at 8:43 AM Martijn Visser  >
> > > > wrote:
> > > >
> > > > > Thank you Yun Gao, Till and Joe for driving this release. Your
> efforts
> > > > are
> > > > > greatly appreciated!
> > > > >
> > > > > To everyone who has opened Jira tickets, provided PRs, reviewed
> code,
> > > > > written documentation or anything contributed in any other way,
> this
> > > > > release was (once again) made possible by you! Thank you.
> > > > >
> > > > > Best regards,
> > > > >
> > > > > Martijn
> > > > >
> > > > > Op do 5 mei 2022 om 08:38 schreef Yun Gao 
> > > > >
> > > > >> The Apache Flink community is very happy to announce the release
> of
> > > > >> Apache Flink 1.15.0, which is the first release for the Apache
> Flink
> > > > >> 1.15 series.
> > > > >>
> > > > >> Apache Flink® is an open-source stream processing framework for
> > > > >> distributed, high-performing, always-available, and accurate data
> > > > >> streaming applications.
> > > > >>
> > > > >> The release is available for download at:
> > > > >> https://flink.apache.org/downloads.html
> > > > >>
> > > > >> Please check out the release blog post for an overview of the
> > > > >> improvements for this release:
> > > > >> https://flink.apache.org/news/2022/05/05/1.15-announcement.html
> > > > >>
> > > > >> The full release notes are available in Jira:
> > > > >>
> > > > >>
> > > >
> > >
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12350442
> > > > >>
> > > > >> We would like to thank all contributors of the Apache Flink
> community
> > > > >> who made this release possible!
> > > > >>
> > > > >> Regards,
> > > > >> Joe, Till, Yun Gao
> > > > >>
> > > > > --
> > > > >
> > > > > Martijn Visser | Product Manager
> > > > >
> > > > > mart...@ververica.com
> > > > >
> > > > > 
> > > > >
> > > > >
> > > > > Follow us @VervericaData
> > > > >
> > > > > --
> > > > >
> > > > > Join Flink Forward  - The Apache Flink
> > > > > Conference
> > > > >
> > > > > Stream Processing | Event Driven | Real Time
> > > > >
> > > > >
> > > >
> > >
>


Re: [VOTE] FLIP-225: Implement standalone mode support in the kubernetes operator

2022-05-05 Thread Yang Wang
+1 (binding)

Best,
Yang

Danny Cranmer  于2022年5月4日周三 20:54写道:

> +1 (binding)
>
> Thanks,
> Danny
>
> On Wed, May 4, 2022 at 1:34 PM Gyula Fóra  wrote:
>
> > +1
> >
> > Gyula
> >
> > On Wed, May 4, 2022 at 2:32 PM Jassat, Usamah  >
> > wrote:
> >
> > > Hi everyone,
> > >
> > > Thanks for the feedback for FLIP-225: Implement standalone mode support
> > in
> > > the kubernetes operator [1] on the discussion thread [2]
> > >
> > > I’d like to start a vote for it. The vote will be open for at-least 72
> > > hours unless there is an objection or not enough votes.
> > >
> > > [1] (
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator
> > > )
> > >
> > > [2] (https://lists.apache.org/thread/rv964g6rq5bkc8kwx36y80nwfqcgn2s4)
> > >
> >
>


[jira] [Created] (FLINK-27491) Support env replacement in flink-kubernetes-operator CR

2022-05-04 Thread Yang Wang (Jira)
Yang Wang created FLINK-27491:
-

 Summary: Support env replacement in flink-kubernetes-operator CR
 Key: FLINK-27491
 URL: https://issues.apache.org/jira/browse/FLINK-27491
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Yang Wang


Flink deployment resources support env interpolation natively using $() 
syntax[1]. Users expected this to "just work" like other resources when using 
the operator, but it does not.

It would be a great addition, simplifying job startup decision-making while 
following existing conventions.

 

 
{code:java}
job:
  jarURI: local:///my.jar
  entryClass: my.JobMainKt
  args:
    - "--kafka.bootstrap.servers"
    - "my.kafka.host:9093"
    - "--kafka.sasl.username"
    - "$(KAFKA_SASL_USERNAME)"
    - "--kafka.sasl.password"
    - "$(KAFKA_SASL_PASSWORD)" {code}
 
[1]. 
[https://kubernetes.io/docs/tasks/inject-data-application/_print/#use-environment-variables-to-define-arguments]
 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [DISCUSS] DockerHub repository maintainers

2022-05-03 Thread Yang Wang
The flink-kubernetes-operator project is only published
via apache/flink-kubernetes-operator on docker hub and github packages.
We do not find the obvious advantages by using docker hub official images.

Best,
Yang

Xintong Song  于2022年4月28日周四 19:27写道:

> I agree with you that doing QA for the image after the release has been
> finalized doesn't feel right. IIUR, that is mostly because official image
> PR needs 1) the binary release being deployed and propagated and 2) the
> corresponding git commit being specified. I'm not completely sure about
> this. Maybe we can improve the process by investigating more about the
> feasibility of pre-verifying an official image PR before finalizing the
> release. It's definitely a good thing to do if possible.
>
> I also agree that QA from DockerHub folks is valuable to us.
>
> I'm not against publishing official-images, and I'm not against working
> closely with the DockerHub folks to improve the process of delivering the
> official image. However, I don't think these should become reasons that we
> don't release our own apache/flink images.
>
> Taking the 1.12.0 as an example, admittedly it would be nice for us to
> comply with the DockerHub folks' standards and not have a
> just-for-kubernetes command in our entrypoint. However, this is IMO far
> less important compared to delivering the image to our users timely. I
> guess that's where the DockerHub folks and us have different
> priorities, and that's why I think we should have a path that is fully
> controlled by this community to deliver images. We could take their
> valuable inputs and improve afterwards. Actually, that's what we did for
> 1.12.0 by starting to release to apache/flink.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Apr 28, 2022 at 6:30 PM Chesnay Schepler 
> wrote:
>
> > I still think that's mostly a process issue.
> > Of course we can be blind-sided if we do the QA for a release artifact
> > after the release has been finalized.
> > But that's a clearly broken process from the get-go.
> >
> > At the very least we should already open a PR when the RC is created to
> > get earlier feedback.
> >
> > Moreover, nowadays the docker images are way slimmer and we are much
> > more careful on what is actually added to the scripts.
> >
> > Finally, the problems they found did show that their QA is very valuable
> > to us. And side-stepping that for such an essential piece of a release
> > isn't a good idea imo.
> >
> > On 28/04/2022 11:31, Xintong Song wrote:
> > > I'm overall against only releasing to official-images.
> > >
> > > We started releasing to apache/flink, in addition to the
> official-image,
> > in
> > > 1.12.0. That was because releasing the official-image needs approval
> from
> > > the DockerHub folks, which is not under control of the Flink community.
> > For
> > > 1.12.0 there were unfortunately some divergences between us and the
> > > DockerHub folks, and it ended-up taking us nearly 2 months to get that
> > > official-image PR merged [1][2]. Many users, especially those who need
> > > Flink's K8s & Native-K8s deployment modes, were asking for the image
> > after
> > > 1.12.0 was announced.
> > >
> > > One could argue that what happened for 1.12.0 is not a regular case.
> > > However, I'd like to point out that the docker images are not something
> > > nice-to-have, but a practically necessary piece of the release for the
> > k8s
> > > / native-k8s deployments to work. I'm strongly against a release
> process
> > > where such an important piece depends on the approval of a 3rd party.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-20650
> > >
> > > [2] https://github.com/docker-library/official-images/pull/9249
> > >
> > >
> > >
> > > On Thu, Apr 28, 2022 at 2:43 PM Chesnay Schepler 
> > wrote:
> > >
> > >> We could just stop releasing to apache/flink and only go for the
> > >> official-images route.
> > >>
> > >> On 28/04/2022 07:43, Xintong Song wrote:
> > >>> Forgot to mention that, we have also proposed to use one shared
> account
> > >> and
> > >>> limit its access to the PMC members, like what we do with the PyPI
> > >> account.
> > >>> Unfortunately, INFRA rejected this proposal [1].
> > >>>
> > >>>
> > >>> Thank you~
> > >>>
> > >>> Xintong Song
> > >>>
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/INFRA-23208
> > >>>
> > >>> On Thu, Apr 28, 2022 at 1:39 PM Xintong Song 
> > >> wrote:
> >  Hi devs,
> > 
> >  I'd like to start a discussion about maintainers for DockerHub
> >  repositories under the *apache* namespace [1].
> > 
> >  Currently, the Flink community maintains various repositories
> (flink,
> >  flink-statefun, flink-statefun-playground, and
> > >> flink-kubernetes-operator)
> >  on DockerHub under the *apache* namespace. There's a limitation on
> how
> > >> many
> >  members the *apache* namespace can add, and recently INFRA is
> > >> complaining
> > 

Re: [DISCUSS] FLIP-224: Blacklist Mechanism

2022-05-03 Thread Yang Wang
Thanks Lijie and Zhu for creating the proposal.

I want to share some thoughts about Flink cluster operations.

In the production environment, the SRE(aka Site Reliability Engineer)
already has many tools to detect the unstable nodes, which could take the
system logs/metrics into consideration.
Then they use graceful-decomission in YARN and taint in K8s to prevent new
allocations on these unstable nodes.
At last, they will evict all the containers and pods running on these nodes.
This mechanism also works for planned maintenance. So I am afraid this is
not the typical use case for FLIP-224.

If we only support to block nodes manually, then I could not see
the obvious advantages compared with current SRE's approach(via *yarn
rmadmin or kubectl taint*).
At least, we need to have a pluggable component which could expose the
potential unstable nodes automatically and block them if enabled explicitly.


Best,
Yang



Becket Qin  于2022年5月2日周一 16:36写道:

> Thanks for the proposal, Lijie.
>
> This is an interesting feature and discussion, and somewhat related to the
> design principle about how people should operate Flink.
>
> I think there are three things involved in this FLIP.
>  a) Detect and report the unstable node.
>  b) Collect the information of the unstable node and form a blocklist.
>  c) Take the action to block nodes.
>
> My two cents:
>
> 1. It looks like people all agree that Flink should have c). It is not only
> useful for cases of node failures, but also handy for some planned
> maintenance.
>
> 2. People have different opinions on b), i.e. who should be the brain to
> make the decision to block a node. I think this largely depends on who we
> talk to. Different users would probably give different answers. For people
> who do have a centralized node health management service, let Flink do just
> do a) and c) would be preferred. So essentially Flink would be one of the
> sources that may detect unstable nodes, report it to that service, and then
> take the command from that service to block the problematic nodes. On the
> other hand, for users who do not have such a service, simply letting Flink
> be clever by itself to block the suspicious nodes might be desired to
> ensure the jobs are running smoothly.
>
> So that indicates a) and b) here should be pluggable / optional.
>
> In light of this, maybe it would make sense to have something pluggable
> like a UnstableNodeReporter which exposes unstable nodes actively. (A more
> general interface should be JobInfoReporter which can be used to report
> any information of type . But I'll just keep the scope relevant to this
> FLIP here). Personally speaking, I think it is OK to have a default
> implementation of a reporter which just tells Flink to take action to block
> problematic nodes and also unblocks them after timeout.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
>
> On Mon, May 2, 2022 at 3:27 PM Роман Бойко  wrote:
>
> > Thanks for good initiative, Lijie and Zhu!
> >
> > If it's possible I'd like to participate in development.
> >
> > I agree with 3rd point of Konstantin's reply - we should consider to move
> > somehow the information of blocklisted nodes/TMs from active
> > ResourceManager to non-active ones. Probably storing inside
> > Zookeeper/Configmap might be helpful here.
> >
> > And I agree with Martijn that a lot of organizations don't want to expose
> > such API for a cluster user group. But I think it's necessary to have the
> > mechanism for unblocking the nodes/TMs anyway for avoiding incorrect
> > automatic behaviour.
> >
> > And another one small suggestion - I think it would be better to extend
> the
> > *BlocklistedItem* class with the *endTimestamp* field and fill it at the
> > item creation. This simple addition will allow to:
> >
> >-
> >
> >Provide the ability to users to setup the exact time of blocklist end
> >through RestAPI
> >-
> >
> >Not being tied to a single value of
> >*cluster.resource-blacklist.item.timeout*
> >
> >
> > On Mon, 2 May 2022 at 14:17, Chesnay Schepler 
> wrote:
> >
> > > I do share the concern between blurring the lines a bit.
> > >
> > > That said, I'd prefer to not have any auto-detection and only have an
> > > opt-in mechanism
> > > to manually block processes/nodes. To me this sounds yet again like one
> > > of those
> > > magical mechanisms that will rarely work just right.
> > > An external system can leverage way more information after all.
> > >
> > > Moreover, I'm quite concerned about the complexity of this proposal.
> > > Tracking on both the RM/JM side; syncing between components;
> adjustments
> > > to the
> > > slot and resource protocol.
> > >
> > > In a way it seems overly complicated.
> > >
> > > If we look at it purely from an active resource management perspective,
> > > then there
> > > isn't really a need to touch the slot protocol at all (or in fact to
> > > anything in the JobMaster),
> > > because there isn't any point in keeping around blocked 

Re: [DISCUSS] Next Flink Kubernetes Operator release timeline

2022-04-27 Thread Yang Wang
Thanks @Chesnay Schepler  for pointing out this.

The only public interface the flink-kubernetes-operator provides is the
CRD[1]. We are trying to stabilize the CRD from v1beta1.
If more fields are introduced to support new features(e.g. standalone mode,
SQL jobs), they should have the default value to ensure compatibility.
Currently, we do not have some tools to enforce the compatibility
guarantees. But we have created a ticket[1] to follow this and hope it
could be resolved before releasing 1.0.0.

Just as you said, now is also a good time to think more about the approach
of releases. Since flink-kubernetes-operator is much simpler than Flink, we
could have a shorter release cycle.
Two month for a major release(1.0, 1.1, etc.) is reasonable to me. And this
could be shorten for the minor releases. Also we need to support at least
the last two major versions.

Maybe the standalone mode support is a big enough feature for version 2.0.


[1].
https://github.com/apache/flink-kubernetes-operator/tree/main/helm/flink-kubernetes-operator/crds
[2]. https://issues.apache.org/jira/browse/FLINK-26955


@Hao t Chang  We do not have regular sync up meeting so
far. But I think we could schedule some sync up for the 1.0.0 release if
necessary. Anyone who is interested are welcome.


Best,
Yang




Hao t Chang  于2022年4月27日周三 07:45写道:

> Hi Gyula,
>
> Thanks for the release timeline information. I would like to learn the
> gathered knowledge and volunteer as well. Will there be sync up
> meeting/call for this collaboration ?
>
> From: Gyula Fóra 
> Date: Monday, April 25, 2022 at 11:22 AM
> To: dev 
> Subject: [DISCUSS] Next Flink Kubernetes Operator release timeline
> Hi Devs!
>
> The community has been working hard on cleaning up the operator logic and
> adding some core features that have been missing from the preview release
> (session jobs for example). We have also added some significant
> improvements around deployment/operations.
>
> With the current pace of the development I think in a few weeks we should
> be in a good position to release next version of the operator. This would
> also give us the opportunity to add support for the upcoming 1.15 release
> :)
>
> We have to decide on 2 main things:
>  1. Target release date
>  2. Release version
>
> With the current state of the project I am confident that we could cut a
> really good release candidate towards the end of May. I would suggest a
> feature *freeze mid May (May 16)*, with a target *RC0 date of May 23*. If
> on May 16 we feel that we are ready we could also prepare the release
> candidate earlier.
>
> As for the release version, I personally feel that this is a good time
> for *version
> 1.0.0*.
> While 1.0.0 signals a certain confidence in the stability of the current
> API (compared to the preview release) I would keep the kubernetes resource
> version v1beta1.
>
> It would also be great if someone could volunteer to join me to help manage
> the release process this time so I can share the knowledge gathered during
> the preview release :)
>
> Let me know what you think!
>
> Cheers,
> Gyula
>


[jira] [Created] (FLINK-27422) Do not create temporary pod template files for JobManager and TaskManager if not configured explicitly

2022-04-26 Thread Yang Wang (Jira)
Yang Wang created FLINK-27422:
-

 Summary: Do not create temporary pod template files for JobManager 
and TaskManager if not configured explicitly
 Key: FLINK-27422
 URL: https://issues.apache.org/jira/browse/FLINK-27422
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Yang Wang
 Fix For: kubernetes-operator-1.0.0


We do not need to create temporary pod template files for JobManager and 
TaskManager if it is not configured explicitly via 
{{.spec.JobManagerSpec.podTemplate}} or 
{{{}.spec.TaskManagerSpec.podTemplate{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


Re: [DISCUSS] FLIP-223: Implement standalone mode support in the kubernetes operator

2022-04-26 Thread Yang Wang
Thanks for creating the FLIP-223 and starting the discussion.

I have some quick questions.

# The TaskManager replicas


The TaskManager replicas need to be configured both for standalone session
and application. Because it could not be calculated if the parallelism is
set via java codes.


# How the JobManager and TaskManager pods are managed?

We could use k8s Deployment to manage the JobManager pods. Of cause, k8s
Job, StatefulSet also make sense.


What would you like to do for the TaskManager pods?



# Version support

Native support could work from 1.13 and I have created a ticket for this[1].


Considering the last-state upgrade mode, the K8s HA should be enabled. I am
afraid even standalone mode before 1.12 could not work.

Do you want to introduce the ZooKeeper HA or add some limitations for
version choice?




[1]. https://issues.apache.org/jira/browse/FLINK-27412



Best,

Yang

Jassat, Usamah  于2022年4月25日周一 18:25写道:

> Hi everyone,
>
> We would like to start the discussion of the adding standalone mode
> support to the Flink Kubernetes operator. Standalone mode was initially
> considered as part of FLIP-212 but decided to be out of scope to focus on
> Flink native k8s integration for that FLIP [1]. Standalone support will
> also open the door to supporting previous Flink versions in the operator
> which I would also like to open discussion about.
>
> I have created a FLIP with the details on the general changes that we are
> proposing:
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-223%3A+Implement+standalone+mode+support+in+the+kubernetes+operator
>
>
> Looking forward to your feedback.
>
> Regards,
> Usamah
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
>


  1   2   3   4   5   >