[jira] [Created] (FLINK-25839) 'Run kubernetes application HA test' failed on azure due to could not get 3 completed checkpoints in 120 sec

2022-01-26 Thread Yun Gao (Jira)
Yun Gao created FLINK-25839:
---

 Summary: 'Run kubernetes application HA test' failed on azure due 
to could not get 3 completed checkpoints in 120 sec
 Key: FLINK-25839
 URL: https://issues.apache.org/jira/browse/FLINK-25839
 Project: Flink
  Issue Type: Bug
  Components: Deployment / Kubernetes
Affects Versions: 1.15.0
Reporter: Yun Gao



{code:java}
Jan 27 02:07:33 deployment.apps/flink-native-k8s-application-ha-1 condition met
Jan 27 02:07:33 Waiting for job 
(flink-native-k8s-application-ha-1-d8dc997d5-v8cpz) to have at least 3 
completed checkpoints ...
Jan 27 02:09:45 Could not get 3 completed checkpoints in 120 sec
Jan 27 02:09:45 Stopping job timeout watchdog (with pid=217858)
Jan 27 02:09:45 Debugging failed Kubernetes test:
Jan 27 02:09:45 Currently existing Kubernetes resources
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30261=logs=af885ea8-6b05-5dc2-4a37-eab9c0d1ab09=f779a55a-0ffe-5bbc-8824-3a79333d4559=5376



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-26 Thread Aitozi
Hi jacky,

Thanks for bring up this discussion, I think it's an useful feature
which can make the performance tuning more portable, +1 for this

Best,
Aitozi

Jacky Lau <281293...@qq.com.invalid> 于2022年1月24日周一 16:48写道:

> Hi All,
>   I would like to start the discussion on FLIP-213 <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> ;
> which aims to provide taskmanager level(process level) flame graph
> by async profiler, which is most popular tool in java performance. and the
> arthas and intellij both use it.
> And we support it in our ant group company.
>  AndFlink supports FLIP-165: Operator's Flame Graphs
> now. and it draw flame graph by thefront-end
> librariesd3-flame-graph, which has some problem in jobs
> oflarge of parallelism.
>  Please be aware that the FLIP wiki area is not fully done
> since i don't konw whether it will accept by flinkcommunity.
>  Feel free to add your thoughts to make this feature better! i
> am looking forward to all your response. Thanks too much!
>
>
>
>
> Best Jacky Lau


[jira] [Created] (FLINK-25838) BatchArrowPythonOverWindowAggregateFunctionOperatorTest. testFinishBundleTriggeredByTime failed in Azure

2022-01-26 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-25838:


 Summary: BatchArrowPythonOverWindowAggregateFunctionOperatorTest. 
testFinishBundleTriggeredByTime failed in Azure
 Key: FLINK-25838
 URL: https://issues.apache.org/jira/browse/FLINK-25838
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.14.3, 1.15.0
Reporter: Huang Xingbo
Assignee: Huang Xingbo


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30261=logs=5cae8624-c7eb-5c51-92d3-4d2dacedd221=5acec1b4-945b-59ca-34f8-168928ce5199

{code:java}
2022-01-27T05:23:36.0047569Z Jan 27 05:23:35 [ERROR] Tests run: 4, Failures: 1, 
Errors: 0, Skipped: 0, Time elapsed: 1.226 s <<< FAILURE! - in 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.batch.BatchArrowPythonOverWindowAggregateFunctionOperatorTest
2022-01-27T05:23:36.0060644Z Jan 27 05:23:36 [ERROR] 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.batch.BatchArrowPythonOverWindowAggregateFunctionOperatorTest.testFinishBundleTriggeredByTime
  Time elapsed: 0.11 s  <<< FAILURE!
2022-01-27T05:23:36.0061804Z Jan 27 05:23:36 java.lang.AssertionError: 
expected:<4> but was:<3>
2022-01-27T05:23:36.0062505Z Jan 27 05:23:36at 
org.junit.Assert.fail(Assert.java:89)
2022-01-27T05:23:36.0063140Z Jan 27 05:23:36at 
org.junit.Assert.failNotEquals(Assert.java:835)
2022-01-27T05:23:36.0065604Z Jan 27 05:23:36at 
org.junit.Assert.assertEquals(Assert.java:647)
2022-01-27T05:23:36.0066132Z Jan 27 05:23:36at 
org.junit.Assert.assertEquals(Assert.java:633)
2022-01-27T05:23:36.0067101Z Jan 27 05:23:36at 
org.apache.flink.table.runtime.util.RowDataHarnessAssertor.assertOutputEquals(RowDataHarnessAssertor.java:80)
2022-01-27T05:23:36.0068344Z Jan 27 05:23:36at 
org.apache.flink.table.runtime.util.RowDataHarnessAssertor.assertOutputEquals(RowDataHarnessAssertor.java:60)
2022-01-27T05:23:36.0070375Z Jan 27 05:23:36at 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.ArrowPythonAggregateFunctionOperatorTestBase.assertOutputEquals(ArrowPythonAggregateFunctionOperatorTestBase.java:62)
2022-01-27T05:23:36.0071970Z Jan 27 05:23:36at 
org.apache.flink.table.runtime.operators.python.aggregate.arrow.batch.BatchArrowPythonOverWindowAggregateFunctionOperatorTest.testFinishBundleTriggeredByTime(BatchArrowPythonOverWindowAggregateFunctionOperatorTest.java:161)
2022-01-27T05:23:36.0073109Z Jan 27 05:23:36at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
2022-01-27T05:23:36.0073968Z Jan 27 05:23:36at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
2022-01-27T05:23:36.0074876Z Jan 27 05:23:36at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
2022-01-27T05:23:36.0075802Z Jan 27 05:23:36at 
java.lang.reflect.Method.invoke(Method.java:498)
2022-01-27T05:23:36.0076471Z Jan 27 05:23:36at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
2022-01-27T05:23:36.0077209Z Jan 27 05:23:36at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
2022-01-27T05:23:36.0077932Z Jan 27 05:23:36at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
2022-01-27T05:23:36.0078998Z Jan 27 05:23:36at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
2022-01-27T05:23:36.0079682Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2022-01-27T05:23:36.0080368Z Jan 27 05:23:36at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
2022-01-27T05:23:36.0081041Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
2022-01-27T05:23:36.0081723Z Jan 27 05:23:36at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
2022-01-27T05:23:36.0082444Z Jan 27 05:23:36at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
2022-01-27T05:23:36.0083105Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
2022-01-27T05:23:36.0083742Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
2022-01-27T05:23:36.0084381Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
2022-01-27T05:23:36.0085033Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
2022-01-27T05:23:36.0088283Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
2022-01-27T05:23:36.0089997Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
2022-01-27T05:23:36.0090530Z Jan 27 05:23:36at 
org.junit.runners.ParentRunner.run(ParentRunner.java:413)

[jira] [Created] (FLINK-25837) RPC Serialization failure when multiple job managers

2022-01-26 Thread Matthew McMahon (Jira)
Matthew McMahon created FLINK-25837:
---

 Summary: RPC Serialization failure when multiple job managers
 Key: FLINK-25837
 URL: https://issues.apache.org/jira/browse/FLINK-25837
 Project: Flink
  Issue Type: Bug
Affects Versions: 1.14.3
Reporter: Matthew McMahon


I'm using 1.14.3 with the Lyft FlinkK8sOperator to run the cluster.

Previously with 1.10.1 I had no problems, but now it seems when I have multiple 
Job Managers deployed, I am continually seeing this error, that prevents the 
Jobs from starting.

{code}
org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Failed to 
serialize the result for RPC call : requestMultipleJobDetails.
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:417)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$sendAsyncResponse$2(AkkaRpcActor.java:373)
at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
at 
java.util.concurrent.CompletableFuture.uniHandleStage(CompletableFuture.java:848)
at 
java.util.concurrent.CompletableFuture.handle(CompletableFuture.java:2168)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.sendAsyncResponse(AkkaRpcActor.java:365)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:332)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175)
Caused by: java.io.NotSerializableException: java.util.HashMap$Values
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1184)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at 
org.apache.flink.util.InstantiationUtil.serializeObject(InstantiationUtil.java:632)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcSerializedValue.valueOf(AkkaRpcSerializedValue.java:66)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.serializeRemoteResultAndVerifySize(AkkaRpcActor.java:400)
... 29 more
{code}

I don't see where a HashMap exists in the requestMultipleJobDetails path



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25836) test_keyed_process_function_with_state of BatchModeDataStreamTests faild in PyFlink

2022-01-26 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-25836:


 Summary: test_keyed_process_function_with_state of 
BatchModeDataStreamTests  faild in PyFlink
 Key: FLINK-25836
 URL: https://issues.apache.org/jira/browse/FLINK-25836
 Project: Flink
  Issue Type: Bug
  Components: API / Python
Affects Versions: 1.15.0
Reporter: Huang Xingbo
Assignee: Huang Xingbo
 Fix For: 1.15.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30264=logs=9cada3cb-c1d3-5621-16da-0f718fb86602=c67e71ed-6451-5d26-8920-5a8cf9651901



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25835) The task initialization duration is recorded in logs

2022-01-26 Thread Bo Cui (Jira)
Bo Cui created FLINK-25835:
--

 Summary: The task initialization duration is recorded in logs
 Key: FLINK-25835
 URL: https://issues.apache.org/jira/browse/FLINK-25835
 Project: Flink
  Issue Type: Improvement
  Components: Runtime / State Backends
Affects Versions: 1.12.2, 1.15.0
Reporter: Bo Cui


[https://github.com/apache/flink/blob/a543e658acfbc22c1579df0d043654037b9ec4b0/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/StreamTask.java#L644]

We are testing the time of state backend initialization for different data 
levels.However, the task initialization time cannot be obtained from the log 
file and the time taken to restore the status at the backend cannot be obtained.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25834) 'flink run' command can not use 'pipeline.classpaths' in flink-conf.yaml

2022-01-26 Thread Ada Wong (Jira)
Ada Wong created FLINK-25834:


 Summary: 'flink run' command can not use 'pipeline.classpaths' in 
flink-conf.yaml
 Key: FLINK-25834
 URL: https://issues.apache.org/jira/browse/FLINK-25834
 Project: Flink
  Issue Type: Bug
  Components: Client / Job Submission, Command Line Client
Affects Versions: 1.14.3
Reporter: Ada Wong


When we use 'flink run' or CliFrontend class to submit job. If not set 
-C/-classpaths, it disable 'pipeline.classpaths'.

Example:

 flink-conf.yaml content :
{code:java}
pipeline.classpaths: 
file:///opt/flink-1.14.2/other/flink-sql-connector-elasticsearch7_2.12-1.14.2.jar{code}
submit command:
{code:java}
bin/flink run 
/flink14-sql/target/flink14-sql-1.0-SNAPSHOT-jar-with-dependencies.jar{code}
it will throw elasticsearch7 class not found exception.

There are two reasons for this:
 # ProgramOptions#applyToConfiguration will use a list which size is zero to 
overwrite 'pipeline.classpaths' value in configuration.
 # ProgramOptions#buildProgram do not set 'pipeline.classpaths' into 
PackagedProgram.

To solve the 1) problem, could we add a directly return judgement when list 
size is zero in ConfigUtils#encodeCollectionToConfig()

To solve the 2) problem, could we append 'pipeline.classpaths' values into 
classpaths and pass setUserClassPaths together.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25833) FlinkUserCodeClassLoader is not registered as parallel capable

2022-01-26 Thread nicolasyang (Jira)
nicolasyang created FLINK-25833:
---

 Summary: FlinkUserCodeClassLoader is not registered as parallel 
capable
 Key: FLINK-25833
 URL: https://issues.apache.org/jira/browse/FLINK-25833
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Reporter: nicolasyang


A classloader can be registered as parallel capable only if all of its super 
classes are registered as parallel capable[1]. It must be registered to make 
ChildFirstClassLoader and ParentFirstClassLoader parallel capable.

[1] 
https://docs.oracle.com/javase/8/docs/api/java/lang/ClassLoader.html#registerAsParallelCapable--




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25832) When the TaskManager is closed, its associated slot is not set to the released state.

2022-01-26 Thread john (Jira)
john created FLINK-25832:


 Summary: When the TaskManager is closed, its associated slot is 
not set to the released state.
 Key: FLINK-25832
 URL: https://issues.apache.org/jira/browse/FLINK-25832
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Task
Affects Versions: 1.14.3, 1.14.2
Reporter: john
 Attachments: image-2022-01-27-10-55-14-758.png, 
image-2022-01-27-10-55-59-119.png, image-2022-01-27-10-57-26-223.png

I deployed a standalone flink cluster on k8s and enabled 
scheduler-mode=reactive. When Taskmanager is closed, I actively call the 
closeTaskManagerConnection method of ResourceManager. However, when 
AdaptiveScheduler actively starts to restart the job, it calls the cancel 
method of Execution at this time, but this method does not judge whether the 
status of its associated slot is Alive. The Taskmanager to which this slot 
belongs has been closed, and RpcTimeout is triggered at this time.
But when I change the cancel method of Execution, after judging whether the 
status of the slot is Alive before cancel, repeating the above operation is 
still invalid, that is, RpcTimeout will still be triggered. My problem is: 
Active in the ResourceManager's closeTaskManagerConnection method, does not 
affect the state of its associated allocated slot. I think this is a bug. We 
should optimize the behavior of cancel to speed up the execution of cancel.

!image-2022-01-27-10-55-59-119.png!

!image-2022-01-27-10-57-26-223.png!!image-2022-01-27-10-55-14-758.png!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-26 Thread Yun Tang
Hi Jacky,

I like the idea to leverage the async profiler to debug, as this professional 
tool could also detect the native CPU call apart from the stacks within JVM, 
which could help profile native programs such as RocksDB.

However, as Alexander raised, how we handle the security and the 
multi-platforms problems?


Best,
Yun Tang

From: David Morávek 
Sent: Thursday, January 27, 2022 2:56
To: dev 
Subject: Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

I'd second to Alex's concerns. Is there a reason why you can't use the
async-profiler directly? In what kind of environment are your Flink
clusters running (YARN / k8s / ...)?

Best,
D.

On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
wrote:

> Hi Jacky,
>
> Could you please clarify what kind of *problems* you experience with the
> large parallelism? You referred to D3, is it something related to rendering
> on the browser side or is it about the samples collection process? Were you
> able to identify the bottleneck?
>
> Fundamentally I have some concerns regarding the proposed approach:
> 1. Calling shell scripts triggered via the web UI is a security concern and
> it needs to be evaluated carefully if it could introduce any unexpected
> attack vectors (depending on the implementation, passed parameters etc.)
> 2. My understanding is that the async-profiler implementation is
> system-dependent. How do you propose to handle multiple architectures?
> Would you like to ship each available implementation within Flink? [1]
> 3. Do you plan to make use of full async-profiler features including native
> calls sampling with perf_events? If so, the issue I see is that some
> environments restrict ptrace calls by default [2]
>
> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> [2]
>
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
>
>
> Best,
> Alexander Fedulov
>
> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
>
> > This is an expected feature, as we also experienced browser crashes on
> > existing operator-level flame graphs
> >
> > Best,
> > Echo Lee
> >
> > > 在 2022年1月24日,下午6:16,David Morávek  写道:
> > >
> > > Hi Jacky,
> > >
> > > The link seems to be broken, here is the correct one [1].
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >
> > > Best,
> > > D.
> > >
> > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <281293...@qq.com.invalid>
> > wrote:
> > >>
> > >> Hi All,
> > >>   I would like to start the discussion on FLIP-213 <
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >> ;
> > >> which aims to provide taskmanager level(process level) flame
> graph
> > >> by async profiler, which is most popular tool in java performance. and
> > the
> > >> arthas and intellij both use it.
> > >> And we support it in our ant group company.
> > >>  AndFlink supports FLIP-165: Operator's Flame Graphs
> > >> now. and it draw flame graph by thefront-end
> > >> librariesd3-flame-graph, which has some problem in jobs
> > >> oflarge of parallelism.
> > >>  Please be aware that the FLIP wiki area is not fully done
> > >> since i don't konw whether it will accept by
> flinkcommunity.
> > >>  Feel free to add your thoughts to make this feature
> > better! i
> > >> am looking forward to all your response. Thanks too much!
> > >>
> > >>
> > >>
> > >>
> > >> Best Jacky Lau
> >
>


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-26 Thread David Morávek
Thanks the update, I'll go over it tomorrow.

On Wed, Jan 26, 2022 at 5:33 PM Gabor Somogyi 
wrote:

> Hi All,
>
> Since it has turned out that DTM can't be added as member of JobMaster
> <
> https://github.com/gaborgsomogyi/flink/blob/8ab75e46013f159778ccfce52463e7bc63e395a9/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L176
> >
> I've
> came up with a better proposal.
> David, thanks for pinpointing this out, you've caught a bug in the early
> phase!
>
> Namely ResourceManager
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> >
> is
> a single instance class where DTM can be added as member variable.
> It has a list of all already registered TMs and new TM registration is also
> happening here.
> The following can be added from logic perspective to be more specific:
> * Create new DTM instance in ResourceManager
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L124
> >
> and
> start it (re-occurring thread to obtain new tokens)
> * Add a new function named "updateDelegationTokens" to TaskExecutorGateway
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutorGateway.java#L54
> >
> * Call "updateDelegationTokens" on all registered TMs to propagate new DTs
> * In case of new TM registration call "updateDelegationTokens" before
> registration succeeds to setup new TM properly
>
> This way:
> * only a single DTM would live within a cluster which is the expected
> behavior
> * DTM is going to be added to a central place where all deployment target
> can make use of it
> * DTs are going to be pushed to TMs which would generate less network
> traffic than pull based approach
> (please see my previous mail where I've described both approaches)
> * HA scenario is going to be consistent because such
> <
> https://github.com/apache/flink/blob/674bc96662285b25e395fd3dddf9291a602fc183/flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java#L1069
> >
> a solution can be added to "updateDelegationTokens"
>
> @David or all others plz share whether you agree on this or you have better
> idea/suggestion.
>
> BR,
> G
>
>
> On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi 
> wrote:
>
> > First of all thanks for investing your time and helping me out. As I see
> > you have pretty solid knowledge in the RPC area.
> > I would like to rely on your knowledge since I'm learning this part.
> >
> > > - Do we need to introduce a new RPC method or can we for example
> > piggyback
> > on heartbeats?
> >
> > I'm fine with either solution but one thing is important conceptually.
> > There are fundamentally 2 ways how tokens can be updated:
> > - Push way: When there are new DTs then JM JVM pushes DTs to TM JVMs.
> This
> > is the preferred one since tiny amount of control logic needed.
> > - Pull way: Each time a TM would like to poll JM whether there are new
> > tokens and each TM wants to decide alone whether DTs needs to be updated
> or
> > not.
> > As you've mentioned here some ID needs to be generated, it would
> generated
> > quite some additional network traffic which can be definitely avoided.
> > As a final thought in Spark we've had this way of DT propagation logic
> and
> > we've had major issues with it.
> >
> > So all in all DTM needs to obtain new tokens and there must a way to send
> > this data to all TMs from JM.
> >
> > > - What delivery semantics are we looking for? (what if we're only able
> to
> > update subset of TMs / what happens if we exhaust retries / should we
> even
> > have the retry mechanism whatsoever) - I have a feeling that somehow
> > leveraging the existing heartbeat mechanism could help to answer these
> > questions
> >
> > Let's go through these questions one by one.
> > > What delivery semantics are we looking for?
> >
> > DTM must receive an exception when at least one TM was not able to get
> DTs.
> >
> > > what if we're only able to update subset of TMs?
> >
> > Such case DTM will reschedule token obtain after
> > "security.kerberos.tokens.retry-wait" time.
> >
> > > what happens if we exhaust retries?
> >
> > There is no number of retries. In default configuration tokens needs to
> be
> > re-obtained after one day.
> > DTM tries to obtain new tokens after 1day * 0.75
> > (security.kerberos.tokens.renewal-ratio) = 18 hours.
> > When fails it retries after "security.kerberos.tokens.retry-wait" which
> is
> > 1 hour by default.
> > If it never succeeds then authentication error is going to happen on the
> > TM side and the workload is
> > going to stop.
> >
> > > should we even have the retry mechanism whatsoever?
> >
> > Yes, because there are always temporary cluster issues.
> >
> > 

Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-26 Thread David Morávek
I'd second to Alex's concerns. Is there a reason why you can't use the
async-profiler directly? In what kind of environment are your Flink
clusters running (YARN / k8s / ...)?

Best,
D.

On Wed, Jan 26, 2022 at 4:32 PM Alexander Fedulov 
wrote:

> Hi Jacky,
>
> Could you please clarify what kind of *problems* you experience with the
> large parallelism? You referred to D3, is it something related to rendering
> on the browser side or is it about the samples collection process? Were you
> able to identify the bottleneck?
>
> Fundamentally I have some concerns regarding the proposed approach:
> 1. Calling shell scripts triggered via the web UI is a security concern and
> it needs to be evaluated carefully if it could introduce any unexpected
> attack vectors (depending on the implementation, passed parameters etc.)
> 2. My understanding is that the async-profiler implementation is
> system-dependent. How do you propose to handle multiple architectures?
> Would you like to ship each available implementation within Flink? [1]
> 3. Do you plan to make use of full async-profiler features including native
> calls sampling with perf_events? If so, the issue I see is that some
> environments restrict ptrace calls by default [2]
>
> [1] https://github.com/jvm-profiling-tools/async-profiler#download
> [2]
>
> https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces
>
>
> Best,
> Alexander Fedulov
>
> On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:
>
> > This is an expected feature, as we also experienced browser crashes on
> > existing operator-level flame graphs
> >
> > Best,
> > Echo Lee
> >
> > > 在 2022年1月24日,下午6:16,David Morávek  写道:
> > >
> > > Hi Jacky,
> > >
> > > The link seems to be broken, here is the correct one [1].
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >
> > > Best,
> > > D.
> > >
> > >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <281293...@qq.com.invalid>
> > wrote:
> > >>
> > >> Hi All,
> > >>   I would like to start the discussion on FLIP-213 <
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> > >> ;
> > >> which aims to provide taskmanager level(process level) flame
> graph
> > >> by async profiler, which is most popular tool in java performance. and
> > the
> > >> arthas and intellij both use it.
> > >> And we support it in our ant group company.
> > >>  AndFlink supports FLIP-165: Operator's Flame Graphs
> > >> now. and it draw flame graph by thefront-end
> > >> librariesd3-flame-graph, which has some problem in jobs
> > >> oflarge of parallelism.
> > >>  Please be aware that the FLIP wiki area is not fully done
> > >> since i don't konw whether it will accept by
> flinkcommunity.
> > >>  Feel free to add your thoughts to make this feature
> > better! i
> > >> am looking forward to all your response. Thanks too much!
> > >>
> > >>
> > >>
> > >>
> > >> Best Jacky Lau
> >
>


Re: [VOTE] FLIP-203: Incremental savepoints

2022-01-26 Thread Anton Kalashnikov

+1 (non-binding)

Thanks Piotr.
--
Best regards,
Anton Kalashnikov

26.01.2022 11:21, David Anderson пишет:

+1 (non-binding)

I'm pleased to see this significant improvement coming along, as well as
the effort made in the FLIP to document what is and isn't supported (and
where ??? remain).

On Wed, Jan 26, 2022 at 10:58 AM Yu Li  wrote:


+1 (binding)

Thanks for driving this Piotr! Just one more (belated) suggestion: in the
"Checkpoint vs savepoint guarantees" section, there are still question
marks scattered in the table, and I suggest putting all TODO works into the
"Limitations" section, or adding a "Future Work" section, for easier later
tracking.

Best Regards,
Yu


On Mon, 24 Jan 2022 at 18:48, Konstantin Knauf  wrote:


Thanks, Piotr. Proposal looks good.

+1 (binding)

On Mon, Jan 24, 2022 at 11:20 AM David Morávek  wrote:


+1 (non-binding)

Best,
D.

On Mon, Jan 24, 2022 at 10:54 AM Dawid Wysakowicz <

dwysakow...@apache.org>

wrote:


+1 (binding)

Best,

Dawid

On 24/01/2022 09:56, Piotr Nowojski wrote:

Hi,

As there seems to be no further questions about the FLIP-203 [1] I

would

propose to start a voting thread for it.

For me there are still two unanswered questions, whether we want to

support

schema evolution and State Processor API with native format

snapshots

or

not. But I would propose to tackle them as follow ups, since those

are

pre-existing issues of the native format checkpoints, and could be

done

completely independently of providing the native format support in
savepoints.

Best,
Piotrek

[1]


https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints


--

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk


Re: [DISCUSS] FLIP-211: Kerberos delegation token framework

2022-01-26 Thread Gabor Somogyi
Hi All,

Since it has turned out that DTM can't be added as member of JobMaster

I've
came up with a better proposal.
David, thanks for pinpointing this out, you've caught a bug in the early
phase!

Namely ResourceManager

is
a single instance class where DTM can be added as member variable.
It has a list of all already registered TMs and new TM registration is also
happening here.
The following can be added from logic perspective to be more specific:
* Create new DTM instance in ResourceManager

and
start it (re-occurring thread to obtain new tokens)
* Add a new function named "updateDelegationTokens" to TaskExecutorGateway

* Call "updateDelegationTokens" on all registered TMs to propagate new DTs
* In case of new TM registration call "updateDelegationTokens" before
registration succeeds to setup new TM properly

This way:
* only a single DTM would live within a cluster which is the expected
behavior
* DTM is going to be added to a central place where all deployment target
can make use of it
* DTs are going to be pushed to TMs which would generate less network
traffic than pull based approach
(please see my previous mail where I've described both approaches)
* HA scenario is going to be consistent because such

a solution can be added to "updateDelegationTokens"

@David or all others plz share whether you agree on this or you have better
idea/suggestion.

BR,
G


On Tue, Jan 25, 2022 at 11:00 AM Gabor Somogyi 
wrote:

> First of all thanks for investing your time and helping me out. As I see
> you have pretty solid knowledge in the RPC area.
> I would like to rely on your knowledge since I'm learning this part.
>
> > - Do we need to introduce a new RPC method or can we for example
> piggyback
> on heartbeats?
>
> I'm fine with either solution but one thing is important conceptually.
> There are fundamentally 2 ways how tokens can be updated:
> - Push way: When there are new DTs then JM JVM pushes DTs to TM JVMs. This
> is the preferred one since tiny amount of control logic needed.
> - Pull way: Each time a TM would like to poll JM whether there are new
> tokens and each TM wants to decide alone whether DTs needs to be updated or
> not.
> As you've mentioned here some ID needs to be generated, it would generated
> quite some additional network traffic which can be definitely avoided.
> As a final thought in Spark we've had this way of DT propagation logic and
> we've had major issues with it.
>
> So all in all DTM needs to obtain new tokens and there must a way to send
> this data to all TMs from JM.
>
> > - What delivery semantics are we looking for? (what if we're only able to
> update subset of TMs / what happens if we exhaust retries / should we even
> have the retry mechanism whatsoever) - I have a feeling that somehow
> leveraging the existing heartbeat mechanism could help to answer these
> questions
>
> Let's go through these questions one by one.
> > What delivery semantics are we looking for?
>
> DTM must receive an exception when at least one TM was not able to get DTs.
>
> > what if we're only able to update subset of TMs?
>
> Such case DTM will reschedule token obtain after
> "security.kerberos.tokens.retry-wait" time.
>
> > what happens if we exhaust retries?
>
> There is no number of retries. In default configuration tokens needs to be
> re-obtained after one day.
> DTM tries to obtain new tokens after 1day * 0.75
> (security.kerberos.tokens.renewal-ratio) = 18 hours.
> When fails it retries after "security.kerberos.tokens.retry-wait" which is
> 1 hour by default.
> If it never succeeds then authentication error is going to happen on the
> TM side and the workload is
> going to stop.
>
> > should we even have the retry mechanism whatsoever?
>
> Yes, because there are always temporary cluster issues.
>
> > What does it mean for the running application (how does this look like
> from
> the user perspective)? As far as I remember the logs are only collected
> ("aggregated") after the container is stopped, is that correct?
>
> With default config it works like that but it can be forced to aggregate
> at specific intervals.
> A useful feature is 

[jira] [Created] (FLINK-25831) ExecutionVertex.getLatestPriorAllocation fails if there is an unsuccessful restart attempt

2022-01-26 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-25831:
-

 Summary: ExecutionVertex.getLatestPriorAllocation fails if there 
is an unsuccessful restart attempt
 Key: FLINK-25831
 URL: https://issues.apache.org/jira/browse/FLINK-25831
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Coordination
Affects Versions: 1.14.3, 1.15.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann


The {{ExecutionVertex.getLatestPriorAllocation}} does not return the latest 
prior allocation if there was an unsuccessful restart attempt in between. The 
problem is that we only look at the last {{Execution}}. Due to this, 
{{ExecutionVertex.getLatestPriorAllocation}} sometimes returns {{null}} even 
though there is a prior {{AllocationID}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-26 Thread Alexander Fedulov
Hi Jacky,

Could you please clarify what kind of *problems* you experience with the
large parallelism? You referred to D3, is it something related to rendering
on the browser side or is it about the samples collection process? Were you
able to identify the bottleneck?

Fundamentally I have some concerns regarding the proposed approach:
1. Calling shell scripts triggered via the web UI is a security concern and
it needs to be evaluated carefully if it could introduce any unexpected
attack vectors (depending on the implementation, passed parameters etc.)
2. My understanding is that the async-profiler implementation is
system-dependent. How do you propose to handle multiple architectures?
Would you like to ship each available implementation within Flink? [1]
3. Do you plan to make use of full async-profiler features including native
calls sampling with perf_events? If so, the issue I see is that some
environments restrict ptrace calls by default [2]

[1] https://github.com/jvm-profiling-tools/async-profiler#download
[2]
https://kubernetes.io/docs/concepts/policy/pod-security-policy/#host-namespaces


Best,
Alexander Fedulov

On Wed, Jan 26, 2022 at 1:59 PM 李森  wrote:

> This is an expected feature, as we also experienced browser crashes on
> existing operator-level flame graphs
>
> Best,
> Echo Lee
>
> > 在 2022年1月24日,下午6:16,David Morávek  写道:
> >
> > Hi Jacky,
> >
> > The link seems to be broken, here is the correct one [1].
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> >
> > Best,
> > D.
> >
> >> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <281293...@qq.com.invalid>
> wrote:
> >>
> >> Hi All,
> >>   I would like to start the discussion on FLIP-213 <
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> >> ;
> >> which aims to provide taskmanager level(process level) flame graph
> >> by async profiler, which is most popular tool in java performance. and
> the
> >> arthas and intellij both use it.
> >> And we support it in our ant group company.
> >>  AndFlink supports FLIP-165: Operator's Flame Graphs
> >> now. and it draw flame graph by thefront-end
> >> librariesd3-flame-graph, which has some problem in jobs
> >> oflarge of parallelism.
> >>  Please be aware that the FLIP wiki area is not fully done
> >> since i don't konw whether it will accept by flinkcommunity.
> >>  Feel free to add your thoughts to make this feature
> better! i
> >> am looking forward to all your response. Thanks too much!
> >>
> >>
> >>
> >>
> >> Best Jacky Lau
>


[jira] [Created] (FLINK-25830) Typo mistake in "Metric Report page" of the documentation

2022-01-26 Thread ChengKai Yang (Jira)
ChengKai Yang created FLINK-25830:
-

 Summary:  Typo mistake in "Metric Report page" of the documentation
 Key: FLINK-25830
 URL: https://issues.apache.org/jira/browse/FLINK-25830
 Project: Flink
  Issue Type: Improvement
  Components: Documentation
Reporter: ChengKai Yang


Hi, I think I find a typo mistake in the documentation:

[This page:"Report 
page"|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/metric_reporters/#reporter]
 .

In the YAML part:
{code:java}
```yaml
metrics.reporters: my_jmx_reporter,my_other_reporter

metrics.reporter.my_jmx_reporter.factory.class: 
org.apache.flink.metrics.jmx.JMXReporterFactory
metrics.reporter.my_jmx_reporter.port: 9020-9040
metrics.reporter.my_jmx_reporter.scope.variables.excludes:job_id;task_attempt_num
metrics.reporter.my_jmx_reporter.scope.variables.additional: 
cluster_name:my_test_cluster,tag_name:tag_value

metrics.reporter.my_other_reporter.class: 
org.apache.flink.metrics.graphite.GraphiteReporter
metrics.reporter.my_other_reporter.host: 192.168.1.1
metrics.reporter.my_other_reporter.port: 1
```
{code}
This property: 
metrics.reporter.my_jmx_reporter.scope.variables.excludes:job_id;task_attempt_num

If the user copies this property to the conf/flink-conf.yaml, the user will get 
a warning in the log like this:
{code:java}
// code placeholder
2022-01-26 22:45:31,642 WARN  
org.apache.flink.configuration.GlobalConfiguration           [] - Error while 
trying to split key and value in configuration file 
/opt/module/flink-1.14.2/conf/flink-conf.yaml:277: 
"metrics.reporter.my_jmx_reporter.scope.variables.excludes:job_id;task_attempt_num"
 {code}
So there should have a space between 
"metrics.reporter.my_jmx_reporter.scope.variables.excludes:" and 
"job_id;task_attempt_num".

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25829) Do not deploy empty statefun-sdk-js jar

2022-01-26 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-25829:
-

 Summary: Do not deploy empty statefun-sdk-js jar
 Key: FLINK-25829
 URL: https://issues.apache.org/jira/browse/FLINK-25829
 Project: Flink
  Issue Type: Improvement
  Components: Stateful Functions
Affects Versions: statefun-3.2.0, statefun-3.3.0
Reporter: Till Rohrmann
Assignee: Till Rohrmann


The {{statefun-sdk-js}} module does not produce a meaningful jar as output. 
Therefore, we can prevent the deployment of it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25828) build_stateful_functions.sh can fail if target contains source or javadoc jar

2022-01-26 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-25828:
-

 Summary: build_stateful_functions.sh can fail if target contains 
source or javadoc jar
 Key: FLINK-25828
 URL: https://issues.apache.org/jira/browse/FLINK-25828
 Project: Flink
  Issue Type: New Feature
  Components: Stateful Functions
Affects Versions: statefun-3.2.0, statefun-3.3.0
Reporter: Till Rohrmann


{{build_stateful_functions.sh}} can fail if target contains source or javadoc 
jar. I suggest to exclude these files from the search.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25826) Handle symbols at a central place with serializable format

2022-01-26 Thread Timo Walther (Jira)
Timo Walther created FLINK-25826:


 Summary: Handle symbols at a central place with serializable format
 Key: FLINK-25826
 URL: https://issues.apache.org/jira/browse/FLINK-25826
 Project: Flink
  Issue Type: Sub-task
  Components: Table SQL / Planner
Reporter: Timo Walther
Assignee: Timo Walther


Symbols are quite messy in the code base. We should unify all locations and 
define a serializable format for the JSON plan.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25825) MySqlCatalogITCase fails on azure

2022-01-26 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-25825:
-

 Summary: MySqlCatalogITCase fails on azure
 Key: FLINK-25825
 URL: https://issues.apache.org/jira/browse/FLINK-25825
 Project: Flink
  Issue Type: Bug
  Components: Connectors / JDBC
Affects Versions: 1.15.0
Reporter: Roman Khachatryan
 Fix For: 1.15.0


https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30189=logs=e9af9cde-9a65-5281-a58e-2c8511d36983=c520d2c3-4d17-51f1-813b-4b0b74a0c307=13677
 
{code}
2022-01-26T06:04:42.8019913Z Jan 26 06:04:42 [ERROR] 
org.apache.flink.connector.jdbc.catalog.MySqlCatalogITCase.testFullPath  Time 
elapsed: 2.166 *s  <<< FAILURE!
2022-01-26T06:04:42.8025522Z Jan 26 06:04:42 java.lang.AssertionError: 
expected: java.util.ArrayList<[+I[1, -1, 1, null, true, null, hello, 2021-0 
8-04, 2021-08-04T01:54:16, -1, 1, -1.0, 1.0, enum2, -9.1, 9.1, -1, 1, -1, 1, 
\{"k1": "v1"}, null, col_longtext, null, -1, 1, col_mediumtext, -99, 9 9, -1.0, 
1.0, set_ele1, -1, 1, col_text, 10:32:34, 2021-08-04T01:54:16, col_tinytext, 
-1, 1, null, col_varchar, 2021-08-04T01:54:16.463, 09:33:43,  
2021-08-04T01:54:16.463, null], +I[2, -1, 1, null, true, null, hello, 
2021-08-04, 2021-08-04T01:53:19, -1, 1, -1.0, 1.0, enum2, -9.1, 9.1, -1, 1,  
-1, 1, \{"k1": "v1"}, null, col_longtext, null, -1, 1, col_mediumtext, -99, 99, 
-1.0, 1.0, set_ele1,set_ele12, -1, 1, col_text, 10:32:34, 2021-08- 04T01:53:19, 
col_tinytext, -1, 1, null, col_varchar, 2021-08-04T01:53:19.098, 09:33:43, 
2021-08-04T01:53:19.098, null]]> but was: java.util.ArrayL ist<[+I[1, -1, 1, 
null, true, null, hello, 2021-08-04, 2021-08-04T01:54:16, -1, 1, -1.0, 1.0, 
enum2, -9.1, 9.1, -1, 1, -1, 1, \{"k1": "v1"}, null,  col_longtext, null, -1, 
1, col_mediumtext, -99, 99, -1.0, 1.0, set_ele1, -1, 1, col_text, 10:32:34, 
2021-08-04T01:54:16, col_tinytext, -1, 1, null , col_varchar, 
2021-08-04T01:54:16.463, 09:33:43, 2021-08-04T01:54:16.463, null], +I[2, -1, 1, 
null, true, null, hello, 2021-08-04, 2021-08-04T01: 53:19, -1, 1, -1.0, 1.0, 
enum2, -9.1, 9.1, -1, 1, -1, 1, \{"k1": "v1"}, null, col_longtext, null, -1, 1, 
col_mediumtext, -99, 99, -1.0, 1.0, set_el e1,set_ele12, -1, 1, col_text, 
10:32:34, 2021-08-04T01:53:19, col_tinytext, -1, 1, null, col_varchar, 
2021-08-04T01:53:19.098, 09:33:43, 2021-08-0 4T01:53:19.098, null]]>
2022-01-26T06:04:42.8029336Z Jan 26 06:04:42    at 
org.junit.Assert.fail(Assert.java:89)
2022-01-26T06:04:42.8029824Z Jan 26 06:04:42    at 
org.junit.Assert.failNotEquals(Assert.java:835)
2022-01-26T06:04:42.8030319Z Jan 26 06:04:42    at 
org.junit.Assert.assertEquals(Assert.java:120)
2022-01-26T06:04:42.8030815Z Jan 26 06:04:42    at 
org.junit.Assert.assertEquals(Assert.java:146)
2022-01-26T06:04:42.8031419Z Jan 26 06:04:42    at 
org.apache.flink.connector.jdbc.catalog.MySqlCatalogITCase.testFullPath(MySqlCatalogITCase.java*:306)
{code}
 
{code}
2022-01-26T06:04:43.2899378Z Jan 26 06:04:43 [ERROR] Failures:
2022-01-26T06:04:43.2907942Z Jan 26 06:04:43 [ERROR]   
MySqlCatalogITCase.testFullPath:306 expected: java.util.ArrayList<[+I[1, -1, 1, 
null, true,
2022-01-26T06:04:43.2914065Z Jan 26 06:04:43 [ERROR]   
MySqlCatalogITCase.testGetTable:253 expected:<(
2022-01-26T06:04:43.2983567Z Jan 26 06:04:43 [ERROR]   
MySqlCatalogITCase.testSelectToInsert:323 expected: java.util.ArrayList<[+I[1, 
-1, 1, null,
2022-01-26T06:04:43.2997373Z Jan 26 06:04:43 [ERROR]   
MySqlCatalogITCase.testWithoutCatalog:291 expected: java.util.ArrayList<[+I[1, 
-1, 1, null,
2022-01-26T06:04:43.3010450Z Jan 26 06:04:43 [ERROR]   
MySqlCatalogITCase.testWithoutCatalogDB:278 expected: 
java.util.ArrayList<[+I[1, -1, 1, nul
{code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25824) E2e test phase fails on AZP after "Unable to locate package moreutils"

2022-01-26 Thread Roman Khachatryan (Jira)
Roman Khachatryan created FLINK-25824:
-

 Summary: E2e test phase fails on AZP after "Unable to locate 
package moreutils"
 Key: FLINK-25824
 URL: https://issues.apache.org/jira/browse/FLINK-25824
 Project: Flink
  Issue Type: Bug
  Components: Test Infrastructure
Reporter: Roman Khachatryan
 Fix For: 1.15.0


[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30209=logs=bea52777-eaf8-5663-8482-18fbc3630e81=b2642e3a-5b86-574d-4c8a-f7e2842bfb14=17]

[https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30209=logs=af184cdd-c6d8-5084-0b69-7e9c67b35f7a=160c9ae5-96fd-516e-1c91-deb81f59292a=17]

 
{code:java}
E: Unable to locate package moreutils
Running command 'flink-end-to-end-tests/run-nightly-tests.sh 1' with a timeout 
of 287 minutes.
./tools/azure-pipelines/uploading_watchdog.sh: line 76: ts: command not found
/home/vsts/work/1/s/flink-end-to-end-tests/../tools/ci/maven-utils.sh: line 96: 
NPM_PROXY_PROFILE_ACTIVATION: command not found
The STDIO streams did not close within 10 seconds of the exit event from 
process '/usr/bin/bash'. This may indicate a child process inherited the STDIO 
streams and has not yet exited.
##[error]Bash exited with code '141'. {code}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25823) Remove Mesos from Flink Landing page

2022-01-26 Thread Matthias Pohl (Jira)
Matthias Pohl created FLINK-25823:
-

 Summary: Remove Mesos from Flink Landing page
 Key: FLINK-25823
 URL: https://issues.apache.org/jira/browse/FLINK-25823
 Project: Flink
  Issue Type: Bug
  Components: Documentation
Reporter: Matthias Pohl


[Flink's landing page|https://flink.apache.org/] overview image still includes 
Mesos



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-213: TaskManager's Flame Graphs

2022-01-26 Thread 李森
This is an expected feature, as we also experienced browser crashes on existing 
operator-level flame graphs

Best,
Echo Lee

> 在 2022年1月24日,下午6:16,David Morávek  写道:
> 
> Hi Jacky,
> 
> The link seems to be broken, here is the correct one [1].
> 
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
> 
> Best,
> D.
> 
>> On Mon, Jan 24, 2022 at 9:48 AM Jacky Lau <281293...@qq.com.invalid> wrote:
>> 
>> Hi All,
>>   I would like to start the discussion on FLIP-213 <
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-213%3A+TaskManager%27s+Flame+Graphs
>> ;
>> which aims to provide taskmanager level(process level) flame graph
>> by async profiler, which is most popular tool in java performance. and the
>> arthas and intellij both use it.
>> And we support it in our ant group company.
>>  AndFlink supports FLIP-165: Operator's Flame Graphs
>> now. and it draw flame graph by thefront-end
>> librariesd3-flame-graph, which has some problem in jobs
>> oflarge of parallelism.
>>  Please be aware that the FLIP wiki area is not fully done
>> since i don't konw whether it will accept by flinkcommunity.
>>  Feel free to add your thoughts to make this feature better! i
>> am looking forward to all your response. Thanks too much!
>> 
>> 
>> 
>> 
>> Best Jacky Lau


Re: [VOTE] Apache Flink Stateful Functions 3.2.0, release candidate #1

2022-01-26 Thread Till Rohrmann
Hi everyone,

a quick update on the vote:

The correct link for the artifacts at the Apache Nexus repository is
https://repository.apache.org/content/repositories/orgapacheflink-1485/.

Moreover, there is now also a tag for the GoLang SDK:
https://github.com/apache/flink-statefun/tree/statefun-sdk-go/v3.2.0-rc1.

Cheers,
Till

On Tue, Jan 25, 2022 at 10:49 PM Till Rohrmann  wrote:

> Hi everyone,
>
> Please review and vote on the release candidate #1 for the version 3.2.0
> of Apache Flink Stateful Functions, as follows:
>
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> **Release Overview**
>
> As an overview, the release consists of the following:
> a) Stateful Functions canonical source distribution, to be deployed to the
> release repository at dist.apache.org
> b) Stateful Functions Python SDK distributions to be deployed to PyPI
> c) Maven artifacts to be deployed to the Maven Central Repository
> d) New Dockerfiles for the release
> e) GoLang SDK (contained in the repository)
> f) JavaScript SDK (contained in the repository; will be uploaded to npm
> after the release)
>
> **Staging Areas to Review**
>
> The staging areas containing the above mentioned artifacts are as follows,
> for your review:
> * All artifacts for a) and b) can be found in the corresponding dev
> repository at dist.apache.org [2]
> * All artifacts for c) can be found at the Apache Nexus Repository [3]
>
> All artifacts are signed with the key
> B9499FA69EFF5DEEEBC3C1F5BA7E4187C6F73D82 [4]
>
> Other links for your review:
> * JIRA release notes [5]
> * source code tag "release-3.2.0-rc1" [6]
> * PR for the new Dockerfiles [7]
> * PR to update the website Downloads page to include Stateful Functions
> links [8]
> * GoLang SDK [9]
> * JavaScript SDK [10]
>
> **Vote Duration**
>
> The voting time will run for at least 72 hours.
> It is adopted by majority approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Till
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Verifying+a+Flink+Stateful+Functions+Release
> [2] https://dist.apache.org/repos/dist/dev/flink/flink-statefun-3.2.0-rc1/
> [3]
> https://repository.apache.org/content/repositories/orgapacheflink-1483/
> [4] https://dist.apache.org/repos/dist/release/flink/KEYS
> [5]
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12350540
> [6] https://github.com/apache/flink-statefun/tree/release-3.2.0-rc1
> [7] https://github.com/apache/flink-statefun-docker/pull/19
> [8] https://github.com/apache/flink-web/pull/501
> [9]
> https://github.com/apache/flink-statefun/tree/release-3.2.0-rc1/statefun-sdk-go
> [10]
> https://github.com/apache/flink-statefun/tree/release-3.2.0-rc1/statefun-sdk-js
>


[jira] [Created] (FLINK-25822) Migrate away from using Source- and SinkFunctions for internal tests

2022-01-26 Thread Alexander Fedulov (Jira)
Alexander Fedulov created FLINK-25822:
-

 Summary: Migrate away from using Source- and SinkFunctions for 
internal tests
 Key: FLINK-25822
 URL: https://issues.apache.org/jira/browse/FLINK-25822
 Project: Flink
  Issue Type: Improvement
  Components: API / DataStream, Table SQL / API, Tests
Reporter: Alexander Fedulov


The Source- and SinkFunction interfaces are going to be deprecated with the 
adoption of the new Source and Sink Interfaces. Currently, the 
Source/SinkFunction interfaces are used as main testing facilities for internal 
purposes. New utilities based on the new interfaces have to be developed and 
existing tests migrated to them.


Example: SinkFunctionProvider used in TestValuesTableFactory



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

2022-01-26 Thread Yang Wang
Hi Biao,

# 1 Flink Native vs Standalone integration
I think we have got a trend in this discussion[1] that the newly introduced
Flink K8s operator will start with native K8s integration first.
Do you have some concerns about this?

# 2 K8S StatefulSet v.s. K8S Deployment
IIUC, the FlinkDeployment is just a custom resource name. It does not mean
that we need to create a corresponding K8s deployment for JobManager or
TaskManager.
If we are using native K8s integration, the JobManager is started with K8s
deployment while TaskManagers are naked pods managed by
FlinkResourceManager.

Actually, I think "FlinkDeployment" is easier to understand than
"FlinkStatefulSet" :)


[1]. https://lists.apache.org/thread/l1dkp8v4bhlcyb4tdts99g7w4wdglfy4


Best,
Yang

Biao Geng  于2022年1月26日周三 18:00写道:

> Hi Thomas,
> Thanks a lot for the great efforts in this well-organized FLIP! After
> reading the FLIP carefully, I think Yang has given some great feedback and
> I just want to share some of my concerns:
> # 1 Flink Native vs Standalone integration
> I believe it is reasonable to support both modes in the long run but in the
> FLIP and previous thread[1], it seems that we have not made a decision on
> which one to implement initially. The FLIP mentioned "Maybe start with
> support for Flink Native" for reusing codes in [2]. Is it the selected one
> finally?
> # 2 K8S StatefulSet v.s. K8S Deployment
> In the CR Example, I notice that the kind we use is FlinkDeployment. I
> would like to check if we have made the decision to use K8S Deployment
> workload resource. As the name implies, StatefulSet is for stateful apps
> while Deployment is usually for stateless apps. I think it is worthwhile to
> consider the choice more carefully due to some user case in gcp
> operator[3], which may influence our other design choices(like the Flink
> application deletion strategy).
>
> Again, thanks for the work and I believe this FLIP is pretty useful for
> many customers and I hope I can make some contributions to this FLIP impl!
>
> Best regard,
> Biao Geng
>
> [1] https://lists.apache.org/thread/l1dkp8v4bhlcyb4tdts99g7w4wdglfy4
> [2] https://github.com/wangyang0918/flink-native-k8s-operator
> [3] https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/354
>
> Yang Wang  于2022年1月26日周三 15:25写道:
>
> > Thanks Thomas for creating FLIP-212 to introduce the Flink Kubernetes
> > Operator.
> >
> > The proposal looks already very good to me and has integrated all the
> input
> > in the previous discussion(e.g. native K8s VS standalone, Go VS java).
> >
> > I read the FLIP carefully and have some questions that need to be
> > clarified.
> >
> > # How do we run a Flink job from a CR?
> > 1. Start a session cluster and then followed by submitting the Flink job
> > via rest API
> > 2. Start a Flink application cluster which bundles one or more Flink jobs
> > It is not clear enough to me which way we will choose. It seems that the
> > existing google/lyft K8s operator is using #1. But I lean to #2 in the
> new
> > introduced K8s operator.
> > If #2 is the case, how could we get the job status when it finished or
> > failed? Maybe FLINK-24113[1] and FLINK-25715[2] could help. Or we may
> need
> > to enable the Flink history server[3].
> >
> >
> > # ApplicationDeployer Interface or "flink run-application" /
> > "kubernetes-session.sh"
> > How do we start the Flink application or session cluster?
> > It will be great if we have the public and stable interfaces for
> deployment
> > in Flink. But currently we only have an internal interface
> > *ApplicationDeployer* to deploy the application cluster and
> > no interfaces for deploying session cluster.
> > Of cause, we could also use the CLI command for submission. However, it
> > will have poor performance when launching multiple applications.
> >
> >
> > # Pod Template
> > Is the pod template in CR same with what Flink has already supported[4]?
> > Then I am afraid not the arbitrary field(e.g. cpu/memory resources) could
> > take effect.
> >
> >
> > [1]. https://issues.apache.org/jira/browse/FLINK-24113
> > [2]. https://issues.apache.org/jira/browse/FLINK-25715
> > [3].
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/advanced/historyserver/
> > [4].
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/#pod-template
> >
> >
> >
> > Best,
> > Yang
> >
> >
> > Thomas Weise  于2022年1月25日周二 13:08写道:
> >
> > > Hi,
> > >
> > > As promised in [1] we would like to start the discussion on the
> > > addition of a Kubernetes operator to the Flink project as FLIP-212:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > >
> > > Please note that the FLIP is currently focussed on the overall
> > > direction; the intention is to fill in more details once we converge
> > > on the high level plan.
> > >
> > > Thanks and looking forward to a 

[jira] [Created] (FLINK-25821) Add the doc of execution mode in PyFlink

2022-01-26 Thread Huang Xingbo (Jira)
Huang Xingbo created FLINK-25821:


 Summary: Add the doc of execution mode in PyFlink
 Key: FLINK-25821
 URL: https://issues.apache.org/jira/browse/FLINK-25821
 Project: Flink
  Issue Type: Sub-task
  Components: API / Python, Documentation
Affects Versions: 1.15.0
Reporter: Huang Xingbo
Assignee: Huang Xingbo
 Fix For: 1.15.0






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [VOTE] FLIP-203: Incremental savepoints

2022-01-26 Thread David Anderson
+1 (non-binding)

I'm pleased to see this significant improvement coming along, as well as
the effort made in the FLIP to document what is and isn't supported (and
where ??? remain).

On Wed, Jan 26, 2022 at 10:58 AM Yu Li  wrote:

> +1 (binding)
>
> Thanks for driving this Piotr! Just one more (belated) suggestion: in the
> "Checkpoint vs savepoint guarantees" section, there are still question
> marks scattered in the table, and I suggest putting all TODO works into the
> "Limitations" section, or adding a "Future Work" section, for easier later
> tracking.
>
> Best Regards,
> Yu
>
>
> On Mon, 24 Jan 2022 at 18:48, Konstantin Knauf  wrote:
>
> > Thanks, Piotr. Proposal looks good.
> >
> > +1 (binding)
> >
> > On Mon, Jan 24, 2022 at 11:20 AM David Morávek  wrote:
> >
> > > +1 (non-binding)
> > >
> > > Best,
> > > D.
> > >
> > > On Mon, Jan 24, 2022 at 10:54 AM Dawid Wysakowicz <
> > dwysakow...@apache.org>
> > > wrote:
> > >
> > > > +1 (binding)
> > > >
> > > > Best,
> > > >
> > > > Dawid
> > > >
> > > > On 24/01/2022 09:56, Piotr Nowojski wrote:
> > > > > Hi,
> > > > >
> > > > > As there seems to be no further questions about the FLIP-203 [1] I
> > > would
> > > > > propose to start a voting thread for it.
> > > > >
> > > > > For me there are still two unanswered questions, whether we want to
> > > > support
> > > > > schema evolution and State Processor API with native format
> snapshots
> > > or
> > > > > not. But I would propose to tackle them as follow ups, since those
> > are
> > > > > pre-existing issues of the native format checkpoints, and could be
> > done
> > > > > completely independently of providing the native format support in
> > > > > savepoints.
> > > > >
> > > > > Best,
> > > > > Piotrek
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints
> > > > >
> > > >
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


[jira] [Created] (FLINK-25820) Introduce Table Store Flink Source

2022-01-26 Thread Jingsong Lee (Jira)
Jingsong Lee created FLINK-25820:


 Summary: Introduce Table Store Flink Source
 Key: FLINK-25820
 URL: https://issues.apache.org/jira/browse/FLINK-25820
 Project: Flink
  Issue Type: Sub-task
  Components: Table Store
Reporter: Jingsong Lee
Assignee: Jingsong Lee
 Fix For: table-store-0.1.0


Introduce FLIP-27 source implementation for table file store.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Looking for Maintainers for Flink on YARN

2022-01-26 Thread Biao Geng
Hi Konstantin,
I am willing to be one maintainer of Flink on YARN. I have some relevant
experience in maintaining Flink on YARN clusters in Alibaba and I hope I
can  make contributions to some of the cases in your list.

Best,
Biao Geng

Konstantin Knauf 于2022年1月26日 周三17:17写道:

> Hi everyone,
>
> We are seeing an increasing number of test instabilities related to YARN
> [1]. Does someone in this group have the time to pick these up? The Flink
> Confluence contains a guide on how to triage test instability tickets.
>
> Thanks,
>
> Konstantin
>
> [1]
>
> https://issues.apache.org/jira/browse/FLINK-25514?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20%3D%20%22Deployment%20%2F%20YARN%22%20AND%20labels%20%3D%20test-stability
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/Triage+Test+Instability+Tickets
>
> On Mon, Sep 13, 2021 at 2:22 PM 柳尘  wrote:
>
> > Thanks to Konstantin for raising this question, and to Marton and Gabor
> > To strengthen!
> >
> >  If i can help
> > In order to better participate in the work, please let me know.
> >
> > the best,
> > cheng xingyuan
> >
> >
> > > 2021年7月29日 下午4:15,Konstantin Knauf  写道:
> > >
> > > Dear community,
> > >
> > > We are looking for community members, who would like to maintain
> Flink's
> > > YARN support going forward. So far, this has been handled by teams at
> > > Ververica & Alibaba. The focus of these teams has shifted over the past
> > > months so that we only have little time left for this topic. Still, we
> > > think, it is important to maintain high quality support for Flink on
> > YARN.
> > >
> > > What does "Maintaining Flink on YARN" mean? There are no known bigger
> > > efforts outstanding. We are mainly talking about addressing
> > > "test-stability" issues, bugs, version upgrades, community
> contributions
> > &
> > > smaller feature requests. The prioritization of these would be up to
> the
> > > future maintainers, except "test-stability" issues which are important
> to
> > > address for overall productivity.
> > >
> > > If a group of community members forms itself, we are happy to give an
> > > introduction to relevant pieces of the code base, principles,
> > assumptions,
> > > ... and hand over open threads.
> > >
> > > If you would like to take on this responsibility or can join this
> effort
> > in
> > > a supporting role, please reach out!
> > >
> > > Cheers,
> > >
> > > Konstantin
> > > for the Deployment & Coordination Team at Ververica
> > >
> > > --
> > >
> > > Konstantin Knauf
> > >
> > > https://twitter.com/snntrable
> > >
> > > https://github.com/knaufk
> >
> >
>
> --
>
> Konstantin Knauf | Head of Product
>
> +49 160 91394525
>
>
> Follow us @VervericaData Ververica 
>
>
> --
>
> Join Flink Forward
> <
> https://flink-forward.org/> - The
> 
> Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>


[jira] [Created] (FLINK-25819) NetworkBufferPoolTest.testIsAvailableOrNotAfterRequestAndRecycleMultiSegments fails on AZP

2022-01-26 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-25819:
-

 Summary: 
NetworkBufferPoolTest.testIsAvailableOrNotAfterRequestAndRecycleMultiSegments 
fails on AZP
 Key: FLINK-25819
 URL: https://issues.apache.org/jira/browse/FLINK-25819
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Network
Affects Versions: 1.14.3
Reporter: Till Rohrmann


The 
{{NetworkBufferPoolTest.testIsAvailableOrNotAfterRequestAndRecycleMultiSegments}}
 fails on AZP with:

{code}
Jan 26 07:57:03 [ERROR] testIsAvailableOrNotAfterRequestAndRecycleMultiSegments 
 Time elapsed: 10.028 s  <<< ERROR!
Jan 26 07:57:03 org.junit.runners.model.TestTimedOutException: test timed out 
after 10 seconds
Jan 26 07:57:03 at sun.misc.Unsafe.park(Native Method)
Jan 26 07:57:03 at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
Jan 26 07:57:03 at 
java.util.concurrent.FutureTask.awaitDone(FutureTask.java:426)
Jan 26 07:57:03 at 
java.util.concurrent.FutureTask.get(FutureTask.java:204)
Jan 26 07:57:03 at 
org.junit.internal.runners.statements.FailOnTimeout.getResult(FailOnTimeout.java:167)
Jan 26 07:57:03 at 
org.junit.internal.runners.statements.FailOnTimeout.evaluate(FailOnTimeout.java:128)
Jan 26 07:57:03 at 
org.junit.rules.ExpectedException$ExpectedExceptionStatement.evaluate(ExpectedException.java:258)
Jan 26 07:57:03 at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
Jan 26 07:57:03 at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
Jan 26 07:57:03 at 
java.util.concurrent.FutureTask.run(FutureTask.java:266)
Jan 26 07:57:03 at java.lang.Thread.run(Thread.java:748)
{code}

https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=30187=logs=4d4a0d10-fca2-5507-8eed-c07f0bdf4887=7b25afdf-cc6c-566f-5459-359dc2585798=7350



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

2022-01-26 Thread Biao Geng
Hi Thomas,
Thanks a lot for the great efforts in this well-organized FLIP! After
reading the FLIP carefully, I think Yang has given some great feedback and
I just want to share some of my concerns:
# 1 Flink Native vs Standalone integration
I believe it is reasonable to support both modes in the long run but in the
FLIP and previous thread[1], it seems that we have not made a decision on
which one to implement initially. The FLIP mentioned "Maybe start with
support for Flink Native" for reusing codes in [2]. Is it the selected one
finally?
# 2 K8S StatefulSet v.s. K8S Deployment
In the CR Example, I notice that the kind we use is FlinkDeployment. I
would like to check if we have made the decision to use K8S Deployment
workload resource. As the name implies, StatefulSet is for stateful apps
while Deployment is usually for stateless apps. I think it is worthwhile to
consider the choice more carefully due to some user case in gcp
operator[3], which may influence our other design choices(like the Flink
application deletion strategy).

Again, thanks for the work and I believe this FLIP is pretty useful for
many customers and I hope I can make some contributions to this FLIP impl!

Best regard,
Biao Geng

[1] https://lists.apache.org/thread/l1dkp8v4bhlcyb4tdts99g7w4wdglfy4
[2] https://github.com/wangyang0918/flink-native-k8s-operator
[3] https://github.com/GoogleCloudPlatform/flink-on-k8s-operator/pull/354

Yang Wang  于2022年1月26日周三 15:25写道:

> Thanks Thomas for creating FLIP-212 to introduce the Flink Kubernetes
> Operator.
>
> The proposal looks already very good to me and has integrated all the input
> in the previous discussion(e.g. native K8s VS standalone, Go VS java).
>
> I read the FLIP carefully and have some questions that need to be
> clarified.
>
> # How do we run a Flink job from a CR?
> 1. Start a session cluster and then followed by submitting the Flink job
> via rest API
> 2. Start a Flink application cluster which bundles one or more Flink jobs
> It is not clear enough to me which way we will choose. It seems that the
> existing google/lyft K8s operator is using #1. But I lean to #2 in the new
> introduced K8s operator.
> If #2 is the case, how could we get the job status when it finished or
> failed? Maybe FLINK-24113[1] and FLINK-25715[2] could help. Or we may need
> to enable the Flink history server[3].
>
>
> # ApplicationDeployer Interface or "flink run-application" /
> "kubernetes-session.sh"
> How do we start the Flink application or session cluster?
> It will be great if we have the public and stable interfaces for deployment
> in Flink. But currently we only have an internal interface
> *ApplicationDeployer* to deploy the application cluster and
> no interfaces for deploying session cluster.
> Of cause, we could also use the CLI command for submission. However, it
> will have poor performance when launching multiple applications.
>
>
> # Pod Template
> Is the pod template in CR same with what Flink has already supported[4]?
> Then I am afraid not the arbitrary field(e.g. cpu/memory resources) could
> take effect.
>
>
> [1]. https://issues.apache.org/jira/browse/FLINK-24113
> [2]. https://issues.apache.org/jira/browse/FLINK-25715
> [3].
>
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/advanced/historyserver/
> [4].
>
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/#pod-template
>
>
>
> Best,
> Yang
>
>
> Thomas Weise  于2022年1月25日周二 13:08写道:
>
> > Hi,
> >
> > As promised in [1] we would like to start the discussion on the
> > addition of a Kubernetes operator to the Flink project as FLIP-212:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> >
> > Please note that the FLIP is currently focussed on the overall
> > direction; the intention is to fill in more details once we converge
> > on the high level plan.
> >
> > Thanks and looking forward to a lively discussion!
> >
> > Thomas
> >
> > [1] https://lists.apache.org/thread/l1dkp8v4bhlcyb4tdts99g7w4wdglfy4
> >
>


Re: [VOTE] FLIP-203: Incremental savepoints

2022-01-26 Thread Yu Li
+1 (binding)

Thanks for driving this Piotr! Just one more (belated) suggestion: in the
"Checkpoint vs savepoint guarantees" section, there are still question
marks scattered in the table, and I suggest putting all TODO works into the
"Limitations" section, or adding a "Future Work" section, for easier later
tracking.

Best Regards,
Yu


On Mon, 24 Jan 2022 at 18:48, Konstantin Knauf  wrote:

> Thanks, Piotr. Proposal looks good.
>
> +1 (binding)
>
> On Mon, Jan 24, 2022 at 11:20 AM David Morávek  wrote:
>
> > +1 (non-binding)
> >
> > Best,
> > D.
> >
> > On Mon, Jan 24, 2022 at 10:54 AM Dawid Wysakowicz <
> dwysakow...@apache.org>
> > wrote:
> >
> > > +1 (binding)
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > On 24/01/2022 09:56, Piotr Nowojski wrote:
> > > > Hi,
> > > >
> > > > As there seems to be no further questions about the FLIP-203 [1] I
> > would
> > > > propose to start a voting thread for it.
> > > >
> > > > For me there are still two unanswered questions, whether we want to
> > > support
> > > > schema evolution and State Processor API with native format snapshots
> > or
> > > > not. But I would propose to tackle them as follow ups, since those
> are
> > > > pre-existing issues of the native format checkpoints, and could be
> done
> > > > completely independently of providing the native format support in
> > > > savepoints.
> > > >
> > > > Best,
> > > > Piotrek
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-203%3A+Incremental+savepoints
> > > >
> > >
> >
>
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>


Re: Looking for Maintainers for Flink on YARN

2022-01-26 Thread Konstantin Knauf
The link to the filter is not correct. Here is the correct link:
https://issues.apache.org/jira/issues/?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20%3D%20%22Deployment%20%2F%20YARN%22%20AND%20labels%20%3D%20test-stability

On Wed, Jan 26, 2022 at 10:17 AM Konstantin Knauf 
wrote:

> Hi everyone,
>
> We are seeing an increasing number of test instabilities related to YARN
> [1]. Does someone in this group have the time to pick these up? The Flink
> Confluence contains a guide on how to triage test instability tickets.
>
> Thanks,
>
> Konstantin
>
> [1]
> https://issues.apache.org/jira/browse/FLINK-25514?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20%3D%20%22Deployment%20%2F%20YARN%22%20AND%20labels%20%3D%20test-stability
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/Triage+Test+Instability+Tickets
>
> On Mon, Sep 13, 2021 at 2:22 PM 柳尘  wrote:
>
>> Thanks to Konstantin for raising this question, and to Marton and Gabor
>> To strengthen!
>>
>>  If i can help
>> In order to better participate in the work, please let me know.
>>
>> the best,
>> cheng xingyuan
>>
>>
>> > 2021年7月29日 下午4:15,Konstantin Knauf  写道:
>> >
>> > Dear community,
>> >
>> > We are looking for community members, who would like to maintain Flink's
>> > YARN support going forward. So far, this has been handled by teams at
>> > Ververica & Alibaba. The focus of these teams has shifted over the past
>> > months so that we only have little time left for this topic. Still, we
>> > think, it is important to maintain high quality support for Flink on
>> YARN.
>> >
>> > What does "Maintaining Flink on YARN" mean? There are no known bigger
>> > efforts outstanding. We are mainly talking about addressing
>> > "test-stability" issues, bugs, version upgrades, community
>> contributions &
>> > smaller feature requests. The prioritization of these would be up to the
>> > future maintainers, except "test-stability" issues which are important
>> to
>> > address for overall productivity.
>> >
>> > If a group of community members forms itself, we are happy to give an
>> > introduction to relevant pieces of the code base, principles,
>> assumptions,
>> > ... and hand over open threads.
>> >
>> > If you would like to take on this responsibility or can join this
>> effort in
>> > a supporting role, please reach out!
>> >
>> > Cheers,
>> >
>> > Konstantin
>> > for the Deployment & Coordination Team at Ververica
>> >
>> > --
>> >
>> > Konstantin Knauf
>> >
>> > https://twitter.com/snntrable
>> >
>> > https://github.com/knaufk
>>
>>
>
> --
>
> Konstantin Knauf | Head of Product
>
> +49 160 91394525
>
>
> Follow us @VervericaData Ververica 
>
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
> Jinwei (Kevin) Zhang
>


-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk


[jira] [Created] (FLINK-25818) Add explanation how Kafka Source deals with idleness when parallelism is higher then the number of partitions

2022-01-26 Thread Martijn Visser (Jira)
Martijn Visser created FLINK-25818:
--

 Summary: Add explanation how Kafka Source deals with idleness when 
parallelism is higher then the number of partitions
 Key: FLINK-25818
 URL: https://issues.apache.org/jira/browse/FLINK-25818
 Project: Flink
  Issue Type: Improvement
  Components: Connectors / Kafka, Documentation
Reporter: Martijn Visser


Add a section to the Kafka Source documentation to explain what happens with 
the Kafka Source with regards to idleness when parallelism is higher then the 
number of partitions



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-25817) FLIP-201: Persist local state in working directory

2022-01-26 Thread Till Rohrmann (Jira)
Till Rohrmann created FLINK-25817:
-

 Summary: FLIP-201: Persist local state in working directory
 Key: FLINK-25817
 URL: https://issues.apache.org/jira/browse/FLINK-25817
 Project: Flink
  Issue Type: New Feature
  Components: Runtime / Coordination
Affects Versions: 1.15.0
Reporter: Till Rohrmann


This issue is the umbrella ticket for 
[FLIP-201|https://cwiki.apache.org/confluence/x/wJuqCw] which aims at adding 
support for persisting local state in Flink's working directory. This would 
enable Flink in certain scenarios to recover locally even in case of process 
failures.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


Re: Looking for Maintainers for Flink on YARN

2022-01-26 Thread Konstantin Knauf
Hi everyone,

We are seeing an increasing number of test instabilities related to YARN
[1]. Does someone in this group have the time to pick these up? The Flink
Confluence contains a guide on how to triage test instability tickets.

Thanks,

Konstantin

[1]
https://issues.apache.org/jira/browse/FLINK-25514?jql=project%20%3D%20FLINK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20%3D%20%22Deployment%20%2F%20YARN%22%20AND%20labels%20%3D%20test-stability
[2]
https://cwiki.apache.org/confluence/display/FLINK/Triage+Test+Instability+Tickets

On Mon, Sep 13, 2021 at 2:22 PM 柳尘  wrote:

> Thanks to Konstantin for raising this question, and to Marton and Gabor
> To strengthen!
>
>  If i can help
> In order to better participate in the work, please let me know.
>
> the best,
> cheng xingyuan
>
>
> > 2021年7月29日 下午4:15,Konstantin Knauf  写道:
> >
> > Dear community,
> >
> > We are looking for community members, who would like to maintain Flink's
> > YARN support going forward. So far, this has been handled by teams at
> > Ververica & Alibaba. The focus of these teams has shifted over the past
> > months so that we only have little time left for this topic. Still, we
> > think, it is important to maintain high quality support for Flink on
> YARN.
> >
> > What does "Maintaining Flink on YARN" mean? There are no known bigger
> > efforts outstanding. We are mainly talking about addressing
> > "test-stability" issues, bugs, version upgrades, community contributions
> &
> > smaller feature requests. The prioritization of these would be up to the
> > future maintainers, except "test-stability" issues which are important to
> > address for overall productivity.
> >
> > If a group of community members forms itself, we are happy to give an
> > introduction to relevant pieces of the code base, principles,
> assumptions,
> > ... and hand over open threads.
> >
> > If you would like to take on this responsibility or can join this effort
> in
> > a supporting role, please reach out!
> >
> > Cheers,
> >
> > Konstantin
> > for the Deployment & Coordination Team at Ververica
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
>
>

-- 

Konstantin Knauf | Head of Product

+49 160 91394525


Follow us @VervericaData Ververica 


--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Karl Anton Wehner, Holger Temme, Yip Park Tung Jason,
Jinwei (Kevin) Zhang


Re: [DISCUSS] Pushing Apache Flink 1.15 Feature Freeze

2022-01-26 Thread David Morávek
+1, especially for the reasons Yuan has mentioned

D.

On Wed, Jan 26, 2022 at 9:15 AM Yu Li  wrote:

> +1 to extend the feature freeze date to Feb. 14th, which might be a good
> Valentine's Day present for all Flink developers as well (smile).
>
> Best Regards,
> Yu
>
>
> On Wed, 26 Jan 2022 at 14:50, Yuan Mei  wrote:
>
> > +1 extending feature freeze for one week.
> >
> > Code Freeze on 6th (end of Spring Festival) is equivalent to say code
> > freeze at the end of this week for Chinese buddies, since Spring Festival
> > starts next week.
> > It also means they should be partially available during the holiday,
> > otherwise they would block the release if any unexpected issues arise.
> >
> > The situation sounds a bit stressed and can be resolved very well by
> > extending the freeze date for a bit.
> >
> > Best
> > Yuan
> >
> > On Wed, Jan 26, 2022 at 11:18 AM Yun Tang  wrote:
> >
> > > Since the official Spring Festival holidays in China starts from Jan
> 31th
> > > to Feb 6th, and many developers in China would enjoy the holidays at
> that
> > > time.
> > > +1 for extending the feature freeze.
> > >
> > > Best
> > > Yun Tang
> > > 
> > > From: Jingsong Li 
> > > Sent: Wednesday, January 26, 2022 10:32
> > > To: dev 
> > > Subject: Re: [DISCUSS] Pushing Apache Flink 1.15 Feature Freeze
> > >
> > > +1 for extending the feature freeze.
> > >
> > > Thanks Joe for driving.
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Wed, Jan 26, 2022 at 12:04 AM Martijn Visser  >
> > > wrote:
> > > >
> > > > Hi all,
> > > >
> > > > +1 for extending the feature freeze. We could use the time to try to
> > wrap
> > > > up some important SQL related features and improvements.
> > > >
> > > > Best regards,
> > > >
> > > > Martijn
> > > >
> > > > On Tue, 25 Jan 2022 at 16:38, Johannes Moser 
> > wrote:
> > > >
> > > > > Dear Flink community,
> > > > >
> > > > > as mentioned in the summary mail earlier some contributors voiced
> > that
> > > > > they would benefit from pushing the feature freeze for 1.15. by a
> > week.
> > > > > This would mean Monday, 14th of February 2022, end of business
> CEST.
> > > > >
> > > > > Please let us know in case you got any concerns.
> > > > >
> > > > >
> > > > > Best,
> > > > > Till, Yun Gao & Joe
> > >
> >
>


Re: [DISCUSS] Pushing Apache Flink 1.15 Feature Freeze

2022-01-26 Thread Yu Li
+1 to extend the feature freeze date to Feb. 14th, which might be a good
Valentine's Day present for all Flink developers as well (smile).

Best Regards,
Yu


On Wed, 26 Jan 2022 at 14:50, Yuan Mei  wrote:

> +1 extending feature freeze for one week.
>
> Code Freeze on 6th (end of Spring Festival) is equivalent to say code
> freeze at the end of this week for Chinese buddies, since Spring Festival
> starts next week.
> It also means they should be partially available during the holiday,
> otherwise they would block the release if any unexpected issues arise.
>
> The situation sounds a bit stressed and can be resolved very well by
> extending the freeze date for a bit.
>
> Best
> Yuan
>
> On Wed, Jan 26, 2022 at 11:18 AM Yun Tang  wrote:
>
> > Since the official Spring Festival holidays in China starts from Jan 31th
> > to Feb 6th, and many developers in China would enjoy the holidays at that
> > time.
> > +1 for extending the feature freeze.
> >
> > Best
> > Yun Tang
> > 
> > From: Jingsong Li 
> > Sent: Wednesday, January 26, 2022 10:32
> > To: dev 
> > Subject: Re: [DISCUSS] Pushing Apache Flink 1.15 Feature Freeze
> >
> > +1 for extending the feature freeze.
> >
> > Thanks Joe for driving.
> >
> > Best,
> > Jingsong
> >
> > On Wed, Jan 26, 2022 at 12:04 AM Martijn Visser 
> > wrote:
> > >
> > > Hi all,
> > >
> > > +1 for extending the feature freeze. We could use the time to try to
> wrap
> > > up some important SQL related features and improvements.
> > >
> > > Best regards,
> > >
> > > Martijn
> > >
> > > On Tue, 25 Jan 2022 at 16:38, Johannes Moser 
> wrote:
> > >
> > > > Dear Flink community,
> > > >
> > > > as mentioned in the summary mail earlier some contributors voiced
> that
> > > > they would benefit from pushing the feature freeze for 1.15. by a
> week.
> > > > This would mean Monday, 14th of February 2022, end of business CEST.
> > > >
> > > > Please let us know in case you got any concerns.
> > > >
> > > >
> > > > Best,
> > > > Till, Yun Gao & Joe
> >
>