Re: Profiling on flink jobs

2023-12-01 Thread Matthias Pohl via user
I missed the Reply All button in my previous message. Here's my previous
email for the sake of transparency sent to the user ML once more:

Hi Oscar,
sorry for the late reply. I didn't see that you posted the question at the
beginning of the month already.

I used jmap [1] in the past to get some statistics out and generate *.hprof
files. I haven't looked into creating dump files as documented in [2].

env.java.opts.all will be passed to each java process that's triggered
within Apache Flink.  "dumponexit" (which is used in the documented code
parameter list) suggests that the dump file would be created when the JVM
process exits. Without any more detailed investigation on how the Java
Flight Recorder works, I'd assume that a *.hprof file should be created
when killing the JobManager/TaskManager process rather than cancelling an
individual job. Cancelling the job should only trigger this file creation
if you're using Flink in Application Mode because terminating the job would
trigger the shutdown of the Flink cluster entirely in that case.

Best,
Matthias

[1]
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr014.html
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/

On Thu, Nov 9, 2023 at 9:39 AM Oscar Perez via user 
wrote:

> hi [image: :wave:]  I am trying to do profiling on one of our flink jobs
> according to these docs:
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/application_profiling/We
> are using OpenJDK 8.0. I am adding this line to the flink properties file
> in docker-compose:
>
> env.java.opts.all: "-XX:+UnlockCommercialFeatures 
> -XX:+UnlockDiagnosticVMOptions -XX:+FlightRecorder -XX:+DebugNonSafepoints 
> -XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/tmp/dump.jfr"
>
> I would expect to see the dump.jfr file created once I cancel the job but
> unfortunately I dont see anything created. How can I manage to get a valid
> profile file? Thanks!
> Regards,
> Oscar
>


Re: Doubts about state and table API

2023-11-29 Thread Matthias Pohl via user
Hi Oscar,
could you provide the Java code to illustrate what you were doing?
The difference between version A and B might be especially helpful. I
assume you already looked into the FAQ about operator IDs [1]?

Adding the JM and TM logs might help as well to investigate the issue, as
Yu Chen mentioned.

Best,
Matthias

On Sun, Nov 26, 2023 at 2:18 PM Yu Chen  wrote:

> Hi Oscar,
>
> The Operator ID of the SQL job was generated by
> `StreamingJobGraphGenerator`, it was releated with the topology of the
> stream graph.
> If you would like to confirm that the problem was caused by the changes of
> opearator id or not, please remove --allowNonRestoredState, and you will
> get the exception of the failed restore operator id.
>
> However, the lost of the operator state would only produce some erroneous
> results and would not result in `not able to return any row`. It would be
> better to provide logs after restoring to locate a more specific problem.
>
> Best,
> Yu Chen
> --
> *发件人:* Oscar Perez via user 
> *发送时间:* 2023年11月25日 0:08
> *收件人:* Oscar Perez via user 
> *主题:* Doubts about state and table API
>
> Hi,
>
> We are having a job in production where we use table API to join multiple
> topics. The query looks like this:
>
>
> SELECT *
> FROM topic1 AS t1
> JOIN topic2 AS t2 ON t1.userId = t2.userId
> JOIN topic3 AS t3 ON t1.userId = t3.accountUserId
>
>
> This works and produces an EnrichedActivity any time any of the topics
> receives a new event, which is what we expect. This SQL query is linked to
> a processor function and the processElement gets triggered whenever a new
> EnrichedActivity occurs
>
> We have experienced an issue a couple of times in production where we have
> deployed a new version from savepoint and then suddenly we
> stopped receiving EnrichedActivities in the process function.
>
> Our assumption is that this is related to the table API state and that
> some operators are lost from going from one savepoint to new deployment.
>
> Let me illustrate with one example:
>
> version A of the job is deployed
> version B of the job is deployed
>
> version B UID for some table api operators changes and this operator is
> removed when deploying version B as it is unable to be mapped (we have the
> --allowNonRestoredState enabled)
>
> The state for the table api stores bot the committed offset and the
> contents of the topic but just the contents are lost and the committed
> offset is still in the offset
>
> Therefore, when doing the join of the query, it is not able to return any
> row as it is unable to get data from topic2 or topic 3.
>
> Can this be the case?
> We are having a hard time trying to understand how the table api and state
> works internally so any help in this regard would be truly helpful!
>
> Thanks,
> Oscar
>
>
>


Re: Java 17 as default

2023-11-29 Thread Matthias Pohl via user
The 1.18 Docker images were pushed on Oct 31. This also included Java 17
images [1].

[1] https://hub.docker.com/_/flink/tags?page=1=java17

On Wed, Nov 15, 2023 at 7:56 AM Tauseef Janvekar 
wrote:

> Dear Team,
>
> I saw the documentation for 1.18 and Java 17 is not supported and the
> image is created from Java11. I guess there is separate docker image for
> java_17.
> When do we plan to release main image with Java 17.
>
> Thanks,
> Tauseef
>


Re: [DISCUSS][FLINK-33240] Document deprecated options as well

2023-10-30 Thread Matthias Pohl via user
Thanks for your proposal, Zhanghao Chen. I think it adds more transparency
to the configuration documentation.

+1 from my side on the proposal

On Wed, Oct 11, 2023 at 2:09 PM Zhanghao Chen 
wrote:

> Hi Flink users and developers,
>
> Currently, Flink won't generate doc for the deprecated options. This might
> confuse users when upgrading from an older version of Flink: they have to
> either carefully read the release notes or check the source code for
> upgrade guidance on deprecated options.
>
> I propose to document deprecated options as well, with a "(deprecated)"
> tag placed at the beginning of the option description to highlight the
> deprecation status [1].
>
> Looking forward to your feedbacks on it.
>
> [1] https://issues.apache.org/jira/browse/FLINK-33240
>
> Best,
> Zhanghao Chen
>


Re: [ANNOUNCE] Flink Table Store Joins Apache Incubator as Apache Paimon(incubating)

2023-03-27 Thread Matthias Pohl via user
Congratulations and good luck with pushing the project forward.

On Mon, Mar 27, 2023 at 2:35 PM Jing Ge via user 
wrote:

> Congrats!
>
> Best regards,
> Jing
>
> On Mon, Mar 27, 2023 at 2:32 PM Leonard Xu  wrote:
>
>> Congratulations!
>>
>>
>> Best,
>> Leonard
>>
>> On Mar 27, 2023, at 5:23 PM, Yu Li  wrote:
>>
>> Dear Flinkers,
>>
>>
>>
>> As you may have noticed, we are pleased to announce that Flink Table Store 
>> has joined the Apache Incubator as a separate project called Apache 
>> Paimon(incubating) [1] [2] [3]. The new project still aims at building a 
>> streaming data lake platform for high-speed data ingestion, change data 
>> tracking and efficient real-time analytics, with the vision of supporting a 
>> larger ecosystem and establishing a vibrant and neutral open source 
>> community.
>>
>>
>>
>> We would like to thank everyone for their great support and efforts for the 
>> Flink Table Store project, and warmly welcome everyone to join the 
>> development and activities of the new project. Apache Flink will continue to 
>> be one of the first-class citizens supported by Paimon, and we believe that 
>> the Flink and Paimon communities will maintain close cooperation.
>>
>>
>> 亲爱的Flinkers,
>>
>>
>> 正如您可能已经注意到的,我们很高兴地宣布,Flink Table Store 已经正式加入 Apache
>> 孵化器独立孵化 [1] [2] [3]。新项目的名字是
>> Apache 
>> Paimon(incubating),仍致力于打造一个支持高速数据摄入、流式数据订阅和高效实时分析的新一代流式湖仓平台。此外,新项目将支持更加丰富的生态,并建立一个充满活力和中立的开源社区。
>>
>>
>> 在这里我们要感谢大家对 Flink Table Store
>> 项目的大力支持和投入,并热烈欢迎大家加入新项目的开发和社区活动。Apache Flink 将继续作为 Paimon 支持的主力计算引擎之一,我们也相信
>> Flink 和 Paimon 社区将继续保持密切合作。
>>
>>
>> Best Regards,
>> Yu (on behalf of the Apache Flink PMC and Apache Paimon PPMC)
>>
>> 致礼,
>> 李钰(谨代表 Apache Flink PMC 和 Apache Paimon PPMC)
>>
>> [1] https://paimon.apache.org/
>> [2] https://github.com/apache/incubator-paimon
>> [3] https://cwiki.apache.org/confluence/display/INCUBATOR/PaimonProposal
>>
>>
>>


Re: Issue with the flink version 1.10.1

2023-03-27 Thread Matthias Pohl via user
Hi Kiran,
it's really hard to come up with an answer based on your description.
Usually, it helps to share some logs with the exact error that's appearing
and a clear description on what you're observing and what you're expecting.
A plain "no jobs are running" is too general to come up with a conclusion.
Sorry.

Additionally, let me state that Flink 1.10 and 1.9 are quite old versions.
The community doesn't support those versions anymore. It might be the case
that you're running into issues that are already fixed in newer versions.
Investigating code from years ago can be quite tedious.

Best,
Matthias

On Mon, Mar 27, 2023 at 2:29 PM Kiran Kumar Kathe <
kirankumarkathe...@gmail.com> wrote:

> When I submit a job using flink version 1.10.1 ,  it is not upadating the
> jobs that are running and completed successfully in the Web UI of YARN
> resource manager . But When I use flink version 1.9.3 it is working fine
> and I am able to see the jobs that are running and completed in
> YARN resource manager Web UI . And to find why this is happening I just
> tried with replacing the application folders and in lib folder when I use
> the flink_dist jar of version 1.9.3 in place of flink_dist of version
> 1.10.1 it is running fine and I am able to see the jobs running and
> completed. Is it the right way , if not will I face any compatible issues
> in future with this change of flink_dist jar in lib folder.
>


Re: [ANNOUNCE] Apache Flink 1.17.0 released

2023-03-27 Thread Matthias Pohl via user
Here are a few things I noticed from the 1.17 release retrospectively which
I want to share (other release managers might have a different view or
might disagree):

- Google Meet might not be the best choice for the release sync. We need to
be able to invite attendees even if the creator of the meeting isn't
available (maybe try Zoom or jitsi instead?)

- Release sync every 2 weeks and a switch to weekly after feature freeze
felt reasonable

- Slack worked well as a collaboration tool to document the monitoring
tasks (#builds [1], #flink-dev-benchmarks [2]) in a team with multiple
release managers

- The Slack Azure Pipeline bot seems to be buggy. It swallows some build
failures. It's not a severe issue, though. We created #builds-debug [3] to
monitor whether it's happening consistently. The issue is covered in
FLINK-30733 [4].

- Having dedicated people for monitoring the build failures helps getting a
more consistent picture of test instabilities

- We experienced occasional issues in the manual steps of the release
creation in the past (e.g. japicmp config was not properly pushed).
Creating Jira issues for the release helped to make the release creation
more transparent and made the steps more reviewable [5][6][7][8].
Additionally, it helped to distribute subtasks to different people with
Jira being the tool for documentation and synchronization. That's
especially helpful when there is more than one person in charge of creating
the release.

- We had backports/merges without PRs happening by committers occasionally
during the 1.17 release which broke master/release branches (probably,
changes were done locally before merging which were not part of the PR to
have a faster backport experience). It might make sense to remind everyone
that this should be avoided. Not sure whether we want/can restrict that.

- We observed a good response on fixing test instabilities by the end of
the release cycle but had some long running issues earlier in the cycle
which caused extra efforts on the release managers due to reoccurring test
failures.

- Release testing picked up “slowly”: Initially, we planned 2 weeks for
release testing. But there was not really any progress (tickets being
created and worked on) in the first week. In the end, we had to extend the
phase by another week resulting in 3 instead of 2 weeks of release testing.
I guess we could encourage the community to create release testing tasks
earlier and label them properly to be able to monitor the effort. That
would even enable us to do release testing for a certain feature after the
feature is done and not necessarily only at the end of the release cycle.

- Manual test data generation is tedious (FLINK-31593 [9]). But this should
be fixed in 1.18 with FLINK-27518 [10] being almost done.

- We started creating documentation for release management [11]. The goal
is to collect what tasks are there to help support a Flink release to
encourage newcomers to pick up the task.

I'm going to add these to the Flink 1.17 release documentation [12] as
feedback as well.

Best,
Matthias

[1] https://apache-flink.slack.com/archives/C03MR1HQHK2
[2] https://apache-flink.slack.com/archives/C0471S0DFJ9
[3] https://apache-flink.slack.com/archives/C04LZM3EE9E
[4] https://issues.apache.org/jira/browse/FLINK-30733
[5] https://issues.apache.org/jira/browse/FLINK-31146
[6] https://issues.apache.org/jira/browse/FLINK-31154
[7] https://issues.apache.org/jira/browse/FLINK-31562
[8] https://issues.apache.org/jira/browse/FLINK-31567
[9] https://issues.apache.org/jira/browse/FLINK-31593
[10] https://issues.apache.org/jira/browse/FLINK-27518
[11]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Release+Management
[12] https://cwiki.apache.org/confluence/display/FLINK/1.17+Release

On Sat, Mar 25, 2023 at 8:29 AM Hang Ruan  wrote:

> Thanks for the great work ! Congrats all!
>
> Best,
> Hang
>
> Panagiotis Garefalakis  于2023年3月25日周六 03:22写道:
>
>> Congrats all! Well done!
>>
>> Cheers,
>> Panagiotis
>>
>> On Fri, Mar 24, 2023 at 2:46 AM Qingsheng Ren  wrote:
>>
>> > I'd like to say thank you to all contributors of Flink 1.17. Your
>> support
>> > and great work together make this giant step forward!
>> >
>> > Also like Matthias mentioned, feel free to leave us any suggestions and
>> > let's improve the releasing procedure together.
>> >
>> > Cheers,
>> > Qingsheng
>> >
>> > On Fri, Mar 24, 2023 at 5:00 PM Etienne Chauchot 
>> > wrote:
>> >
>> >> Congrats to all the people involved!
>> >>
>> >> Best
>> >>
>> >> Etienne
>> >>
>> >> Le 23/03/2023 à 10:19, Leonard Xu a écrit :
>> >> > The Apache Flink community is very happy to announce the release of
>> >> Apache Flink 1.17.0, which is the first release for the Apache Flink
>> 1.17
>> >> series.
>> >> >
>> >> > Apache Flink® is an open-source unified stream and batch data
>> >> processing framework for distributed, high-performing,
>> always-available,
>> >> and accurate data applications.
>> >> >
>> >> > The release is available 

Re: [ANNOUNCE] Apache Flink 1.17.0 released

2023-03-23 Thread Matthias Pohl via user
Thanks for making this release getting over the finish line.

One additional thing:
Feel free to reach out to the release managers (or respond to this thread)
with feedback on the release process. Our goal is to constantly improve the
release process. Feedback on what could be improved or things that didn't
go so well during the 1.17.0 release cycle are much appreciated.

Best,
Matthias

On Thu, Mar 23, 2023 at 11:02 AM Jing Ge via user 
wrote:

> Excellent work! Congratulations! Appreciate the hard work and
> contributions of everyone in the Apache Flink community who helped make
> this release possible. Looking forward to those new features. Cheers!
>
> Best regards,
> Jing
>
> On Thu, Mar 23, 2023 at 10:24 AM Leonard Xu  wrote:
>
>> The Apache Flink community is very happy to announce the release of Apache 
>> Flink
>> 1.17.0, which is the first release for the Apache Flink 1.17 series.
>>
>> Apache Flink® is an open-source unified stream and batch data processing 
>> framework
>> for distributed, high-performing, always-available, and accurate data
>> applications.
>>
>> The release is available for download at:
>>
>> *https://flink.apache.org/downloads.html
>> *
>> Please check out the release blog post for an overview of the improvements
>> for this release:
>>
>> *https://flink.apache.org/2023/03/23/announcing-the-release-of-apache-flink-1.17/
>> *
>> The full release notes are available in Jira:
>>
>> *https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522=12351585
>> 
>> *
>> We would like to thank all contributors of the Apache Flink community who
>> made this release possible!
>>
>> Best regards,
>> Qingsheng, Martijn, Matthias and Leonard
>>
>


Re: Job Cancellation Failing

2023-02-21 Thread Matthias Pohl via user
I noticed a test instability that sounds quite similar to what you're
experiencing. I created FLINK-31168 [1] to follow-up on this one.

[1] https://issues.apache.org/jira/browse/FLINK-31168

On Mon, Feb 20, 2023 at 4:50 PM Matthias Pohl 
wrote:

> What do you mean by "earlier it used to fail due to ExecutionGraphStore
> not existing in /tmp" folder? Did you get the error message "Could not
> create executionGraphStorage directory in /tmp." and creating this folder
> fixed the issue?
>
> It also looks like the stacktrace doesn't match any of the 1.15 versions
> in terms of line numbers. Or I might miss something here. Could you provide
> the exact Flink version you're using?
>
> I might also help to share the JobManager logs to understand the context
> in which the cancel operation was triggered.
>
> Matthias
>
> On Mon, Feb 20, 2023 at 1:53 AM Puneet Duggal 
> wrote:
>
>> Flink Cluster Context:
>>
>>
>>- Flink Version - 1.15
>>- Deployment Mode - Session
>>- Number of Job Managers - 3 (HA)
>>- Number of Task Managers - 1
>>
>>
>> Cancellation of Job fails due to following
>>
>> org.apache.flink.runtime.rest.NotFoundException: Job
>> 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found
>> at
>> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99)
>> at
>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
>> at
>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387)
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93)
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
>> at
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>> at
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>> at
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>> at
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>> at
>> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45)
>> at akka.dispatch.OnComplete.internal(Future.scala:299)
>> at akka.dispatch.OnComplete.internal(Future.scala:297)
>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224)
>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221)
>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
>> at
>> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65)
>> at
>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
>> at
>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
>> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621)
>> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:118)
>> at
>> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:1144)
>> at akka.actor.Actor.aroundReceive(Actor.scala:537)
>> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
>> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:540)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
>> at 

Re: Job Cancellation Failing

2023-02-20 Thread Matthias Pohl via user
What do you mean by "earlier it used to fail due to ExecutionGraphStore not
existing in /tmp" folder? Did you get the error message "Could not create
executionGraphStorage directory in /tmp." and creating this folder fixed
the issue?

It also looks like the stacktrace doesn't match any of the 1.15 versions in
terms of line numbers. Or I might miss something here. Could you provide
the exact Flink version you're using?

I might also help to share the JobManager logs to understand the context in
which the cancel operation was triggered.

Matthias

On Mon, Feb 20, 2023 at 1:53 AM Puneet Duggal 
wrote:

> Flink Cluster Context:
>
>
>- Flink Version - 1.15
>- Deployment Mode - Session
>- Number of Job Managers - 3 (HA)
>- Number of Task Managers - 1
>
>
> Cancellation of Job fails due to following
>
> org.apache.flink.runtime.rest.NotFoundException: Job
> 1cb2185d4d72c8c6f0a3a549d7de4ef0 not found
> at
> org.apache.flink.runtime.rest.handler.job.AbstractExecutionGraphHandler.lambda$handleRequest$1(AbstractExecutionGraphHandler.java:99)
> at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
> at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.rest.handler.legacy.DefaultExecutionGraphCache.lambda$getExecutionGraphInternal$0(DefaultExecutionGraphCache.java:109)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.rpc.akka.AkkaInvocationHandler.lambda$invokeRpc$1(AkkaInvocationHandler.java:252)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.util.concurrent.FutureUtils.doForward(FutureUtils.java:1387)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$null$1(ClassLoadingUtils.java:93)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.lambda$guardCompletionWithContextClassLoader$2(ClassLoadingUtils.java:92)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$1.onComplete(AkkaFutureUtils.java:45)
> at akka.dispatch.OnComplete.internal(Future.scala:299)
> at akka.dispatch.OnComplete.internal(Future.scala:297)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:224)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:221)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
> at
> org.apache.flink.runtime.concurrent.akka.AkkaFutureUtils$DirectExecutionContext.execute(AkkaFutureUtils.java:65)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68)
> at
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
> at
> scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284)
> at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:621)
> at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:118)
> at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:1144)
> at akka.actor.Actor.aroundReceive(Actor.scala:537)
> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:540)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> at
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
> at 

Re: Blob server connection problem

2023-01-24 Thread Matthias Pohl via user
We had issues like that in the past (e.g. FLINK-24923 [1], FLINK-10683
[2]). The error you're observing is caused by an unexpected byte being read
from the socket. The BlobServer protocol expects either 0 (for put
messages) or 1 (for get messages) being retrieved as a header for new
message blocks [3].
Reading different values might mean that there is some other process
sending data to the port the BlobServer is listening on. May you check your
network traffic?

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-24923
[2] https://issues.apache.org/jira/browse/FLINK-10683
[3]
https://github.com/apache/flink/blob/ab264e4ab5a3bc6961a5128b1c7e19752508a7ca/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServerConnection.java#L115

On Fri, Jan 20, 2023 at 11:26 PM Yang Liu  wrote:

> Hello,
>
> Is anyone familiar with the "blob server connection"? We have constantly
> been seeing the "Error while executing Blob connection" error, which
> sometimes causes a job stuck in the middle of a run if there are too many
> connection errors and eventually causes a failure, though most of the time
> the streaming run mode can recover from that failure in the subsequent
> iterations of runs, but that slows down the entire process. We tried
> adjusting the blob.fetch.num-concurrent and some other blob parameters, but
> it was not very helpful, so we want to know what might be the root cause of
> the issue. Are there any Flink metrics or tools to help us monitor the blob
> server connections?
>
> We use:
>
>- Flink Kubernetes Operator
>- Flink 1.15.3 and 1.16.0
>- Kafka, filesystem(S3)
>- Hudi 0.11.1
>
> Full error message:
>
> java.io.IOException: Unknown operation 71
>   at 
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:116)
>  [flink-dist-1.15.3.jar:1.15.3]
> 2023-01-19 16:44:37,448 ERROR 
> org.apache.flink.runtime.blob.BlobServerConnection   [] - Error while 
> executing BLOB connection.
>
>
> Best regards,
> Yang
>


Re: How does Flink plugin system work?

2023-01-02 Thread Matthias Pohl via user
Yes, Ruibin confirmed in a private message that using the factory class
works. But thanks for digging into it once more Yanfei. I missed to
consider in my previous message that the plugin classes are loaded using
their own class loaders which, indeed, can result in a
ClassNotFoundException being thrown.

Best,
Matthias

On Tue, Jan 3, 2023 at 4:45 AM Yanfei Lei  wrote:

> Hi Ruibin,
>
> "metrics.reporter.prom.class" is deprecated in 1.16, maybe "
> metrics.reporter.prom.factory.class"[1] can solve your problem.
> After reading the related code[2], I think the root cause is that  "
> metrics.reporter.prom.class" would load the code via flink's classpath
> instead of MetricReporterFactory, due to "Plugins cannot access classes
> from other plugins or from Flink that have not been specifically
> whitelisted"[3], so ClassNotFoundException is thrown.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/metric_reporters/#prometheus
> [2]
> https://github.com/apache/flink/blob/release-1.16/flink-runtime/src/main/java/org/apache/flink/runtime/metrics/ReporterSetup.java#L457
> [3]
> https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/
>
> Matthias Pohl via user  于2023年1月2日周一 20:27写道:
>
>> Hi Ruibin,
>> could you switch to using the currently supported way for instantiating
>> reporters using the factory configuration parameter [1][2]?
>>
>> Based on the ClassNotFoundException, your suspicion might be right that
>> the plugin didn't make it onto the classpath. Could you share the
>> startup logs of the JM and TMs. That might help getting a bit more context
>> on what's going on. Your approach on integrating the reporter through the
>> plugin system [3] sounds about right as far as I can see.
>>
>> Matthias
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#factory-class
>> [2]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#prometheus
>> [3]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/plugins/
>>
>> On Fri, Dec 30, 2022 at 11:42 AM Ruibin Xing  wrote:
>>
>>> Hi community,
>>>
>>> I am having difficulty understanding the Flink plugin system. I am
>>> attempting to enable the Prometheus exporter with the official Flink image
>>> 1.16.0, but I am experiencing issues with library dependencies. According
>>> to the plugin documentation (
>>> https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/),
>>> as long as the library is located in the /opt/flink/plugins/
>>> directory, Flink should automatically load it, similar to how it loads
>>> libraries in the /opt/flink/lib directory. However, Flink does not seem to
>>> detect the plugin.
>>>
>>> Here is the directory structure for /opt/flink:
>>> > tree /opt/flink
>>> .
>>> 
>>> ├── plugins
>>> │   ├── metrics-prometheus
>>> │   │   └── flink-metrics-prometheus-1.16.0.jar
>>> ...
>>>
>>> And here is the related Flink configuration:
>>> > metrics.reporter.prom.class:
>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>>
>>> The error logs in the task manager show the following:
>>> 2022-12-30 10:03:55,840 WARN
>>>  org.apache.flink.runtime.metrics.ReporterSetup   [] - The
>>> reporter configuration of 'prom' configures the reporter class, which is a
>>> deprecated approach to configure reporters. Please configure a factory
>>> class instead: 'metrics.reporter.prom.factory.class: ' to
>>> ensure that the configuration continues to work with future versions.
>>> 2022-12-30 10:03:55,841 ERROR
>>> org.apache.flink.runtime.metrics.ReporterSetup   [] - Could not
>>> instantiate metrics reporter prom. Metrics might not be exposed/reported.
>>> java.lang.ClassNotFoundException:
>>> org.apache.flink.metrics.prometheus.PrometheusReporter
>>> at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source)
>>> ~[?:?]
>>> at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
>>> Source) ~[?:?]
>>> at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]
>>> at java.lang.Class.forName0(Native Method) ~[?:?]
>>> at java.lang.Class.forName(Unknown Source) ~[?:?]
>>> at
>>> org.apache.flink.runtime.metrics.ReporterSetup.loadViaReflection(ReporterSetup.java:456)
>>

Re: The use of zookeeper in flink

2023-01-02 Thread Matthias Pohl via user
And I screwed up the reply again. -.- Here's my previous response for the
ML thread and not only spoon_lz:

Hi spoon_lz,
Thanks for reaching out to the community and sharing your use case. You're
right about the fact that Flink's HA feature relies on the leader election.
The HA backend not being responsive for too long might cause problems. I'm
not sure I understand fully what you mean by the standby JobManagers
struggling with the ZK outage shouldn't affect the running jobs. If ZK is
not responding for the standby JMs, the actual JM leader should be affected
as well which, as a consequence, would affect the job execution. But I
might misunderstand your post. Logs would be helpful to get a better
understanding of your post's context.

Best,
Matthias

FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
recovery of too many jobs affecting Flink's performance.

[1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj

On Thu, Dec 29, 2022 at 8:55 AM spoon_lz  wrote:

> Hi All,
> We use zookeeper to achieve high availability of jobs. Recently, a failure
> occurred in our flink cluster. It was due to the abnormal downtime of the
> zookeeper service that all the flink jobs using this zookeeper all occurred
> failover. The failover startup of a large number of jobs in a short period
> of time caused the cluster The pressure is too high, which in turn causes
> the cluster to crash.
> Afterwards, I checked the HA function of zk:
> 1. Leader election
> 2. Service discovery
> 3.State persistence:
>
> The unavailability of the zookeeper service leads to failover of the flink
> job. It seems that because of the first point, JM cannot confirm whether it
> is Active or Standby, and the other two points should not affect it. But we
> didn't use the Standby JobManager.
> So in my opinion, if the JobManager of Standby is not used, whether the zk
> service is available should not affect the jobs that are running
> normally(of course, it is understandable that the task cannot be recovered
> correctly if an exception occurs), and I don’t know if there is a way to
> achieve a similar purpose
>


Re: How does Flink plugin system work?

2023-01-02 Thread Matthias Pohl via user
Hi Ruibin,
could you switch to using the currently supported way for instantiating
reporters using the factory configuration parameter [1][2]?

Based on the ClassNotFoundException, your suspicion might be right that the
plugin didn't make it onto the classpath. Could you share the startup logs
of the JM and TMs. That might help getting a bit more context on what's
going on. Your approach on integrating the reporter through the plugin
system [3] sounds about right as far as I can see.

Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#factory-class
[2]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/metric_reporters/#prometheus
[3]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/filesystems/plugins/

On Fri, Dec 30, 2022 at 11:42 AM Ruibin Xing  wrote:

> Hi community,
>
> I am having difficulty understanding the Flink plugin system. I am
> attempting to enable the Prometheus exporter with the official Flink image
> 1.16.0, but I am experiencing issues with library dependencies. According
> to the plugin documentation (
> https://nightlies.apache.org/flink/flink-docs-stable/docs/deployment/filesystems/plugins/),
> as long as the library is located in the /opt/flink/plugins/
> directory, Flink should automatically load it, similar to how it loads
> libraries in the /opt/flink/lib directory. However, Flink does not seem to
> detect the plugin.
>
> Here is the directory structure for /opt/flink:
> > tree /opt/flink
> .
> 
> ├── plugins
> │   ├── metrics-prometheus
> │   │   └── flink-metrics-prometheus-1.16.0.jar
> ...
>
> And here is the related Flink configuration:
> > metrics.reporter.prom.class:
> org.apache.flink.metrics.prometheus.PrometheusReporter
>
> The error logs in the task manager show the following:
> 2022-12-30 10:03:55,840 WARN
>  org.apache.flink.runtime.metrics.ReporterSetup   [] - The
> reporter configuration of 'prom' configures the reporter class, which is a
> deprecated approach to configure reporters. Please configure a factory
> class instead: 'metrics.reporter.prom.factory.class: ' to
> ensure that the configuration continues to work with future versions.
> 2022-12-30 10:03:55,841 ERROR
> org.apache.flink.runtime.metrics.ReporterSetup   [] - Could not
> instantiate metrics reporter prom. Metrics might not be exposed/reported.
> java.lang.ClassNotFoundException:
> org.apache.flink.metrics.prometheus.PrometheusReporter
> at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source) ~[?:?]
> at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown
> Source) ~[?:?]
> at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]
> at java.lang.Class.forName0(Native Method) ~[?:?]
> at java.lang.Class.forName(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.loadViaReflection(ReporterSetup.java:456)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.loadReporter(ReporterSetup.java:409)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.setupReporters(ReporterSetup.java:328)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.runtime.metrics.ReporterSetup.fromConfiguration(ReporterSetup.java:209)
> ~[flink-dist-1.16.0.jar:1.16.0]
>
> The Java commands for Flink process:
> flink  1  3.0  4.6 2168308 765936 ?  Ssl  10:03   1:08
> /opt/java/openjdk/bin/java -XX:+UseG1GC -Xmx697932173 -Xms697932173
> -XX:MaxDirectMemorySize=300647712 -XX:MaxMetaspaceSize=268435456
> -Dlog.file=/opt/flink/log/flink--kubernetes-taskmanager-0-checkpoint-ha-example-taskmanager-1-1.log
> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties
> -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml
> -classpath
> /opt/flink/lib/flink-cep-1.16.0.jar:/opt/flink/lib/flink-connector-files-1.16.0.jar:/opt/flink/lib/flink-csv-1.16.0.jar:/opt/flink/lib/flink-json-1.16.0.jar:/opt/flink/lib/flink-scala_2.12-1.16.0.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.4.1-10.0.jar:/opt/flink/lib/flink-shaded-zookeeper-3.5.9.jar:/opt/flink/lib/flink-table-api-java-uber-1.16.0.jar:/opt/flink/lib/flink-table-planner-loader-1.16.0.jar:/opt/flink/lib/flink-table-runtime-1.16.0.jar:/opt/flink/lib/log4j-1.2-api-2.17.1.jar:/opt/flink/lib/log4j-api-2.17.1.jar:/opt/flink/lib/log4j-core-2.17.1.jar:/opt/flink/lib/log4j-slf4j-impl-2.17.1.jar:/opt/flink/lib/flink-dist-1.16.0.jar
> org.apache.flink.kubernetes.taskmanager.KubernetesTaskExecutorRunner
> --configDir /opt/flink/conf -Djobmanager.rpc.address=172.17.0.7
> -Dpipeline.classpaths= -Djobmanager.memory.off-heap.size=134217728b
> -Dweb.tmpdir=/tmp/flink-web-57b9e638-f313-4389-a75b-988509697edb
> -Djobmanager.rpc.port=6123
> -D.pipeline.job-id=a6f1c9fb
> -Drest.address=172.17.0.7 

Re: Cleanup for high-availability.storageDir

2022-12-08 Thread Matthias Pohl via user
Yes, the wrong button was pushed when replying last time. -.-

Looking into the code once again [1], you're right. It looks like for
"last-state", no job is cancelled but the cluster deployment is just
deleted. I was assuming that the artifacts the documentation about the
JobResultStore resource leak [2] is referring to are the
JobResultStoreEntry files rather than other artifacts (e.g. jobgraphs). But
yeah, if we only delete the deployment, no Flink-internal cleanup is done.

I'm wondering what the reasoning behind that is.

[1]
https://github.com/apache/flink-kubernetes-operator/blob/ea01e294cf1b68d597244d0a11b3c81822a163e7/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/AbstractFlinkService.java#L336
[2]
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/#jobresultstore-resource-leak

On Thu, Dec 8, 2022 at 11:04 AM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wrote:

> Hi Matthias,
>
> I think you didn't include the mailing list in your response.
>
> According to my experiments, using last-state means the operator simply
> deletes the Flink pods, and I believe that doesn't count as Cancelled, so
> the artifacts for blobs and submitted job graphs are not cleaned up. I
> imagine the same logic Gyula mentioned before applies, namely keep the
> latest one and clean the older ones.
>
> Regards,
> Alexis.
>
> Am Do., 8. Dez. 2022 um 10:37 Uhr schrieb Matthias Pohl <
> matthias.p...@aiven.io>:
>
>> I see, I confused the Flink-internal recovery with what the Flink
>> Kubernetes Operator does for redeploying the Flink job. AFAIU, when you do
>> an upgrade of your job, the operator will cancel the Flink job (I'm
>> assuming now that you use Flink's Application mode rather than Session
>> mode). The operator cancelled your job and shuts down the cluster.
>> Checkpoints are retained and, therefore, can be used as the so-called "last
>> state" when redeploying your job using the same Job ID. In that case, the
>> corresponding jobGraph and other BLOBs should be cleaned up by Flink
>> itself. The checkpoint files are retained, i.e. survive the Flink cluster
>> shutdown.
>>
>> When redeploying the Flink cluster with the (updated) job, a new JobGraph
>> file is created by Flink internally. BLOBs are recreated as well. New
>> checkpoints are going to be created and old ones (that are not needed
>> anymore) are cleaned up.
>>
>> Just to recap what I said before (to make it more explicit to
>> differentiate what the k8s operator does and what Flink does internally):
>> Removing the artifacts you were talking about in your previous post would
>> harm Flink's internal recovery mechanism. That's probably not what you want.
>>
>> @Gyula: Please correct me if I misunderstood something here.
>>
>> I hope that helped.
>> Matthias
>>
>> On Wed, Dec 7, 2022 at 4:19 PM Alexis Sarda-Espinosa <
>> sarda.espin...@gmail.com> wrote:
>>
>>> I see, thanks for the details.
>>>
>>> I do mean replacing the job without stopping it terminally.
>>> Specifically, I mean updating the container image with one that contains
>>> an updated job jar. Naturally, the new version must not break state
>>> compatibility, but as long as that is fulfilled, the job should be able to
>>> use the last checkpoint as starting point. It's my understanding that this
>>> is how the Kubernetes operator's "last-state" upgrade mode works [1].
>>>
>>> [1]
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/custom-resource/job-management/#stateful-and-stateless-application-upgrades
>>>
>>> Regards,
>>> Alexis.
>>>
>>> Am Mi., 7. Dez. 2022 um 15:54 Uhr schrieb Matthias Pohl <
>>> matthias.p...@aiven.io>:
>>>
 > - job_name/submittedJobGraphX
 submittedJobGraph* is the persisted JobGraph that would be picked up in
 case of a failover. Deleting this file would result in Flink's failure
 recovery not working properly anymore if the job is still executed but
 needs to be restarted because the actual job definition is gone.

 > completedCheckpointXYZ
 This is the persisted CompletedCheckpoint with a reference to the
 actual Checkpoint directory. Deleting this file would be problematic if the
 state recovery relies in some way on this specific checkpoint. The HA data
 relies on this file to be present. Failover only fails if there's no newer
 checkpoint or the HA data still refers to this checkpoint in some way.

 > - job_name/blob/job_uuid/blob_...
 Artifacts of the BlobServer containing runtime artifacts of the jobs
 (e.g. logs, libraries, ...)

 In general, you don't want to clean HA artifacts if the job hasn't
 reached a terminal state, yet, as it harms Flink's ability to recover the
 job. Additionally, these files are connected with the HA backend, i.e. the
 file path is stored in the HA backend. Removing the artifacts will likely
 result in metadata becoming invalid.

Re: How's JobManager bring up TaskManager in Application Mode or Session Mode?

2022-11-28 Thread Matthias Pohl via user
Hi Mark,
the JobManager is not necessarily in charge of spinning up TaskManager
instances. It depends on the resource provider configuration you choose.
Flink differentiates between active and passive Resource Management (see
the two available implementations of ResourceManager [1]).

Active Resource Management actually takes care of spinning up new
TaskManager instances if needed (i.e. Flink runs out of free task slots).
This is handled by the corresponding AbstractResourceManageDriver
implementations [2].

In contrast, passive Resource Management (i.e. through the standalone
resource provider configurations [3]) doesn't do anything like that. Here,
Flink works with the TaskManagers that were instantiated by an external
process. Each TaskManager instance registers itself to the JobManager that
is specified in the Flink configuration which is provided to the
corresponding TaskManager instance.

I hope that helps. For future posts, please solely use the user mailing
list for questions around understanding Flink or troubleshooting. The dev
mailing list is reserved for development-related questions [4].

Matthias

[1]
https://github.com/apache/flink/blob/55a8d1a76067204e00839f1b6a2c09965434eaa4/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java#L117
[2]
https://github.com/apache/flink/blob/9815caad271a561640ffe0df7193c04270d53a25/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/active/AbstractResourceManagerDriver.java#L33
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/resource-providers/standalone/overview/
[4] https://flink.apache.org/community.html#mailing-lists

On Tue, Nov 29, 2022 at 5:23 AM 李  wrote:

> Hi,
>
>How's JobManager bring up TaskManager in Application Mode or Session
> Mode? I can’t get it even after reading source code of flink operator?
>
> Any help will be appreciate, Thank you.
>
>  Mark
>
>
>


Re: [Security] - Critical OpenSSL Vulnerability

2022-11-01 Thread Matthias Pohl via user
The Docker image for Flink 1.12.7 uses an older base image which comes with
openssl 1.1.1k. There was a previous post in the OpenSSL mailing list
reporting a low vulnerability being fixed with 3.0.6 and 1.1.1r (both
versions being explicitly mentioned) [1]. Therefore, I understand the post
in a way that only 3.0.x would be affected and, as a consequence, Docker
images below 1.13- would be fine.

I verified Mason's finding that only 1.14+ Docker images are affected. No
entire release is necessary as far as I understand. Theoretically, we would
only have to push newer Docker images to the registry. I'm not sure what
the right approach is when it comes to versioning. I'm curious about
Chesnay's opinion on that one (CC'd).

[1]
https://mta.openssl.org/pipermail/openssl-announce/2022-October/000233.html

On Tue, Nov 1, 2022 at 7:06 AM Prasanna kumar 
wrote:

> Could we also get an emergency patch to 1.12 version as well , because
> upgrading flink to a newer version on production in a short time would be
> high in effort and longer in duration as well .
>
> Thanks,
> Prasanna
>
> On Tue, Nov 1, 2022 at 11:30 AM Prasanna kumar <
> prasannakumarram...@gmail.com> wrote:
>
>> If flink version 1.12 also affected ?
>>
>> Thanks,
>> Prasanna.
>>
>> On Tue, Nov 1, 2022 at 10:40 AM Mason Chen 
>> wrote:
>>
>>> Hi Tamir and Martjin,
>>>
>>> We have also noticed this internally. So far, we have found that the
>>> *latest* Flink Java 11/Scala 2.12 docker images *1.14, 1.15, and 1.16*
>>> are affected, which all have the *openssl 3.0.2 *dependency. It would
>>> be good to discuss an emergency release when this patch comes out
>>> tomorrow, as it is the highest priority level from their severity rating.
>>>
>>> Best,
>>> Mason
>>>
>>> On Mon, Oct 31, 2022 at 1:10 PM Martijn Visser 
>>> wrote:
>>>
 Hi Tamir,

 That depends on a) if Flink is vulnerable and b) if yes, how vulnerable
 that would be.

 Best regards,

 Martijn

 Op ma 31 okt. 2022 om 19:22 schreef Tamir Sagi <
 tamir.s...@niceactimize.com>

> Hey all,
>
> Following that link
> https://eu01.z.antigena.com/l/CjXA7qEmnn79gc24BA2Hb6K2OVR-yGlLfMyp4smo5aXj5Z6WC0dSiHCRPqjSz972DkRNssUoTbxKmp5Pi3IaaVB983yfLJ9MUZY9LYtnBMEKJP5DcQqmhR3SktltkbVG8b7nSRa84kWSnwNJFuXFLA2GrMLTVG7mXdy59-ykolsAWAVAJSDgRdWCv6xN0iczvQ
>
>
> due to critical vulnerability , there will be an important release of
> OpenSSl v3.0.7 tomorrow November 1st.
>
> Is there any plan to update Flink with the newest version?
>
> Thanks.
> Tamir
>
>
> Confidentiality: This communication and any attachments are intended
> for the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail
> and attachments are free from any virus, we advise that in keeping with
> good computing practice the recipient should ensure they are actually 
> virus
> free.
>
 --
 Martijn
 https://twitter.com/MartijnVisser82
 https://github.com/MartijnVisser

>>>


Re: Watermark generating mechanism in Flink SQL

2022-10-17 Thread Matthias Pohl via user
Hi Hunk,
there is documentation about watermarking in FlinkSQL [1]. There is also a
FlinkSQL cookbook entry about watermarking [2]. Essentially, you define the
watermark strategy in your CREATE TABLE statement and specify the lateness
for a given event (not the period in which watermarks are automatically
generated!). You have to apply the `WATERMARK FOR` phrase on a column that
is declared as a time attribute [3]. Watermarks are based on event time,
i.e. based on an event being processed that provides the event time. Your
idea of generating them "every 5 seconds" does not work out of the box
because a watermark wouldn't be generated if the source idles for more than
5 seconds (in case of your specific example). Sending periodic dummy events
extrapolating the current event time would be a way to work around this
issue. Keep in mind that mixing processing time (what you would do if you
create a watermark based on the system's current time rather than relying
on events) and event time is usually not advised. I hope that helps.

Best,
Matthias

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/sql/create/#watermark
[2]
https://github.com/ververica/flink-sql-cookbook/blob/main/aggregations-and-analytics/02_watermarks/02_watermarks.md
[3]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/table/concepts/time_attributes/#event-time

On Tue, Oct 18, 2022 at 5:32 AM wang <24248...@163.com> wrote:

> Hi dear engineers,
>
> I have one question about watermark generating mechanism in Flink SQL.
> There are two mechanisms called *Periodic Watermarks* and *Punctuated
> Watermarks, *I want to use* Periodic Watermarks* with interval 5 seconds
> (meaning watermarks will be generated every 5 seconds), how should I set in
> Flink sql? thanks in advance!
>
> Regards,
> Hunk
>


Re: jobmaster's fatal error will kill the session cluster

2022-10-17 Thread Matthias Pohl via user
; ~[?:?]
> at
> org.apache.hudi.sink.StreamWriteOperatorCoordinator.start(StreamWriteOperatorCoordinator.java:179)
> ~[?:?]
> at
> org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder.start(OperatorCoordinatorHolder.java:194)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:164)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.scheduler.DefaultOperatorCoordinatorHandler.startAllOperatorCoordinators(DefaultOperatorCoordinatorHandler.java:82)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.startScheduling(SchedulerBase.java:624)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.startScheduling(JobMaster.java:1010)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:927)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:388)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.lambda$start$0(AkkaRpcActor.java:612)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:611)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:185)
> ~[flink-rpc-akka_db70a2fa-991e-4392-9447-5d060aeb156e.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) ~[?:?]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) ~[?:?]
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> ~[flink-scala_2.12-1.15.0.jar:1.15.0]
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> ~[flink-scala_2.12-1.15.0.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
> ~[?:?]
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-scala_2.12-1.15.0.jar:1.15.0]
>
> I’m not sure whether it’s proper to kill the cluster just because of using
> a wrong job configuration (set a relative path).
>
>
> 2022年10月14日 19:53,Matthias Pohl via user  写道:
>
> Hi Jie Han,
> welcome to the community. Just a little side note: These kinds of
> questions are more suitable to be asked in the user mailing list. The dev
> mailing list is rather used for discussing feature development or
> project-related topics. See [1] for further details.
>
> About your question: The stacktrace you're providing indicates that
> something went wrong while initiating the job execution. Unfortunately, the
> actual reason is not clear because that's not included in your stacktrace
> (it should be listed as a cause for the JobMasterException in your logs).
> You're right in assuming that Flink is able to handle certain kinds of user
> code and infrastructure-related errors by restarting the job. But there
> might be other Flink cluster internal errors that could cause a Flink
> cluster shutdown. It's hard to tell from the logs you provided. Usually,
> it's a good habit to share a reasonable amount of logs to make
> investigating the issue easier right away.
>
> Let's move the discussion into the user mailing list in case you have
> further questions.
>
> Best,
> Matthias
>
> [1] https://flink.apache.org/community.html#mailing-lists
>
> On Fri, Oct 14, 2022 at 10:13 AM Jie Han  wrote:
>
>> Hi, guys, I’m new to apache flink. It’s exciting to join the community!
>>
>> When I experienced flink 1.15.0, I met some problems confusing, here is
>> the streamlined log:
>>
>> org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Could not
>> start RpcEndpoint jobmanager_2.
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:617)
>> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:185)
>> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
>> at akka.japi.pf
>> <https://eu01.z.antigena.com/l/pyN08xgk8WWQxSNlxnRpGknLWSdgjvnWGXMhKkXGI~fvj~FYsrddbTM9iVPOiscdS

Re: Sometimes checkpoints to s3 fail

2022-10-14 Thread Matthias Pohl via user
Hi Evgeniy,
is it Ceph which you're using as a S3 server? All the Google search entries
point to Ceph when looking for the error message. Could it be that there's
a problem with the version of the underlying system? The stacktrace you
provided looks like Flink struggles to close the File and, therefore, fails
to create the checkpoint.

Best,
Matthias

On Thu, Oct 6, 2022 at 11:25 AM Evgeniy Lyutikov 
wrote:

> Hello all.
> I can’t understand the floating problem, sometimes checkpoints stop
> passing, sometimes they start to complete every other time.
> Flink 1.14.4 in kubernetes application mode.
>
>
> 2022-10-06 09:08:04,731 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] -
> Triggering checkpoint 18314 (type=CHECKPOINT) @ 1665047284716 for job
> .
> 2022-10-06 09:11:29,130 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Decline
> checkpoint 18314 by task 048169f0e3c2efd473d3cef9c9d2cd70 of job
>  at job-name-taskmanager-3-1 @ 10.109.0.168
> (dataPort=43795).
> org.apache.flink.util.SerializedThrowable: Asynchronous task checkpoint
> failed.
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:301)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:155)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> [?:?]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> [?:?]
> at java.lang.Thread.run(Thread.java:829) [?:?]
> Caused by: org.apache.flink.util.SerializedThrowable: Could not
> materialize checkpoint 18314 for operator Process rec last clicks -> Cast
> rec last clicks type (30/44)#0.
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.handleExecutionException(AsyncCheckpointRunnable.java:279)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> ... 4 more
> Caused by: org.apache.flink.util.SerializedThrowable: java.io.IOException:
> Could not flush to file and close the file system output stream to
> s3p://flink-checkpoints/k8s-checkpoint-job-name//shared/7c09fcf1-49b9-4b72-b756-81cd7778e396
> in order to obtain the stream state handle
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ~[?:?]
> at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[?:?]
> at
> org.apache.flink.util.concurrent.FutureUtils.runIfNotDoneAndGet(FutureUtils.java:645)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.api.operators.OperatorSnapshotFinalizer.(OperatorSnapshotFinalizer.java:54)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.finalizeNonFinishedSnapshots(AsyncCheckpointRunnable.java:177)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.streaming.runtime.tasks.AsyncCheckpointRunnable.run(AsyncCheckpointRunnable.java:124)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable: Could not flush to
> file and close the file system output stream to
> s3p://flink-checkpoints/k8s-checkpoint-job-name//shared/7c09fcf1-49b9-4b72-b756-81cd7778e396
> in order to obtain the stream state handle
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStreamFactory$FsCheckpointStateOutputStream.closeAndGetHandle(FsCheckpointStreamFactory.java:373)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.uploadLocalFileToCheckpointFs(RocksDBStateUploader.java:143)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.contrib.streaming.state.RocksDBStateUploader.lambda$createUploadFutures$0(RocksDBStateUploader.java:101)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:32)
> ~[flink-dist_2.12-1.14.4.jar:1.14.4]
> at
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> ~[?:?]
> ... 3 more
> Caused by: org.apache.flink.util.SerializedThrowable:
> com.amazonaws.services.s3.model.AmazonS3Exception: This multipart
> completion is already in progress (Service: Amazon S3; Status Code: 500;
> Error Code: InternalError; Request ID:
> tx000ced9f8-00633e9bc1-18489a52-default; S3 Extended Request
> ID: 18489a52-default-default; Proxy: null), S3 Extended Request ID:
> 18489a52-default-default
> at
> com.facebook.presto.hive.s3.PrestoS3FileSystem$PrestoS3OutputStream.uploadObject(PrestoS3FileSystem.java:1278)
> ~[?:?]
> at
> 

Re: jobmaster's fatal error will kill the session cluster

2022-10-14 Thread Matthias Pohl via user
Hi Jie Han,
welcome to the community. Just a little side note: These kinds of questions
are more suitable to be asked in the user mailing list. The dev mailing
list is rather used for discussing feature development or project-related
topics. See [1] for further details.

About your question: The stacktrace you're providing indicates that
something went wrong while initiating the job execution. Unfortunately, the
actual reason is not clear because that's not included in your stacktrace
(it should be listed as a cause for the JobMasterException in your logs).
You're right in assuming that Flink is able to handle certain kinds of user
code and infrastructure-related errors by restarting the job. But there
might be other Flink cluster internal errors that could cause a Flink
cluster shutdown. It's hard to tell from the logs you provided. Usually,
it's a good habit to share a reasonable amount of logs to make
investigating the issue easier right away.

Let's move the discussion into the user mailing list in case you have
further questions.

Best,
Matthias

[1] https://flink.apache.org/community.html#mailing-lists

On Fri, Oct 14, 2022 at 10:13 AM Jie Han  wrote:

> Hi, guys, I’m new to apache flink. It’s exciting to join the community!
>
> When I experienced flink 1.15.0, I met some problems confusing, here is
> the streamlined log:
>
> org.apache.flink.runtime.rpc.akka.exceptions.AkkaRpcException: Could not
> start RpcEndpoint jobmanager_2.
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.start(AkkaRpcActor.java:617)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:185)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.Actor.aroundReceive(Actor.scala:537)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
> [flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
> [flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
> [flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
> [?:1.8.0_301]
> at
> java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1067)
> [?:1.8.0_301]
> at
> java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1703)
> [?:1.8.0_301]
> at
> java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:172)
> [?:1.8.0_301]
> Caused by: org.apache.flink.runtime.jobmaster.JobMasterException: Could
> not start the JobMaster.
> at
> org.apache.flink.runtime.jobmaster.JobMaster.onStart(JobMaster.java:390)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStart(RpcEndpoint.java:181)
> ~[flink-dist-1.15.0.jar:1.15.0]
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StoppedState.lambda$start$0(AkkaRpcActor.java:612)
> ~[flink-rpc-akka_65043be6-9dc5-4303-a760-61bd044fb53a.jar:1.15.0]
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
> 

Re: Cancel a job in status INITIALIZING

2022-09-26 Thread Matthias Pohl via user
Can you provide the JobManager logs for this case. It sounds odd that the
job was stuck in the INITIALIZING phase.

Matthias

On Wed, Sep 21, 2022 at 11:50 AM Christian Lorenz via user <
user@flink.apache.org> wrote:

> Hi,
>
>
>
> we’re running a Flink Cluster in standalone/session mode. During a restart
> of a jobmanager one job was stuck in status INITIALIZING.
>
> When trying to cancel the job via CLI the command failed with a
> java.util.concurrent.TimeoutException.
>
> The only way to get rid of this job for us was to stop the jobmanagers and
> delete the zookeeper root node.
>
> Is there a better way of handling this issue as this seems to be very
> unclean to me.
>
>
>
> Kind regards,
>
> Christian
>
> Mapp Digital Germany GmbH with registered offices at Sandstr. 3, 80335
> München.
> Registered with the District Court München HRB 226181
> Managing Directors: Frasier, Christopher & Warren, Steve
>
> This e-mail is from Mapp Digital and its international legal entities and
> may contain information that is confidential or proprietary.
> If you are not the intended recipient, do not read, copy or distribute the
> e-mail or any attachments. Instead, please notify the sender and delete the
> e-mail and any attachments.
> Please consider the environment before printing. Thank you.
>


Re: Jobmanager fails to come up if the job has an issue

2022-09-26 Thread Matthias Pohl via user
Yes, the JobManager will failover in HA mode and all jobs would be
recovered.

On Mon, Sep 26, 2022 at 2:06 PM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> Thanks @Matthias Pohl  . This is informative.  So
> generally in a session cluster if I have more than one job and only one of
> them has this issue, still we will face the same problem?
>
> Regards
> Ram
>
> On Mon, Sep 26, 2022 at 4:32 PM Matthias Pohl 
> wrote:
>
>> I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
>> order for the job to not be cleaned up entirely after a failure while
>> submitting the job, the JobManager is failed fatally resulting in a
>> failover. That's what you're experiencing.
>>
>> One solution is to fix the permission issue to make the job recover
>> without problems. If that's not what you want to do, you could delete the
>> entry with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
>> JobGraphStore ConfigMap (based on your logs it should
>> be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
>> prevent the JobManager from recovering this specific job. Keep in mind that
>> you have to clean up any job-related data by yourself in that case.
>>
>> I hope that helps.
>> Matthias
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-9097
>>
>> On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
>> ramvasu.fl...@gmail.com> wrote:
>>
>>> I got some logs and stack traces from our backend storage. This is not
>>> the entire log though. Can this be useful?  With these set of logs messages
>>> the job manager kept restarting.
>>>
>>> Regards
>>> Ram
>>>
>>> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
>>> ramvasu.fl...@gmail.com> wrote:
>>>
 Thank you very much for the reply. I have lost the k8s cluster in this
 case before I could capture the logs. I will try to repro this and get back
 to you.

 Regards
 Ram

 On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl 
 wrote:

> Hi Ramkrishna,
> thanks for reaching out to the Flink community. Could you share the
> JobManager logs to get a better understanding of what's going on? I'm
> wondering why the JobManager is failing when the actual problem is that 
> the
> job is struggling to access a folder. It sounds like there are multiple
> problems here.
>
> Best,
> Matthias
>
> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
> ramvasu.fl...@gmail.com> wrote:
>
>> Hi all
>>
>> I have a simple job where we read for a given path in cloud storage
>> to watch for new files in a given fodler. While I setup my job there was
>> some permission issue on the folder. The job is STREAMING job.
>> The cluster is set in the session mode and is running on Kubernetes.
>> The job manager since then is failing to come back up and every time
>> it fails with the permission issue. But the point is how should i recover
>> my cluster in this case. Since JM is not there the UI is also not working
>> and how do I remove the bad job from the JM.
>>
>> Regards
>> Ram
>>
>


Re: JobManager restarts on job failure

2022-09-26 Thread Matthias Pohl via user
That's a good point. I forgot about these options. You're right. Cleanup
wouldn't be done in that case. So, upgrading would be a viable option as
you suggested.

Matthias

On Mon, Sep 26, 2022 at 12:53 PM Gyula Fóra  wrote:

> Maybe it is a stupid question but in Flink 1.15 with the following configs
> enabled:
>
> SHUTDOWN_ON_APPLICATION_FINISH = false
> SUBMIT_FAILED_JOB_ON_APPLICATION_ERROR = true
>
> I think jobmanager pod would not restart but simply go to a terminal
> failed state right?
>
> Gyula
>
> On Mon, Sep 26, 2022 at 12:31 PM Matthias Pohl 
> wrote:
>
>> Thanks Evgeniy for reaching out to the community and Gyula for picking it
>> up. I haven't looked into the k8s operator in much detail, yet. So, help me
>> out if I miss something here. But I'm afraid that this is not something
>> that would be fixed by upgrading to 1.15.
>> The issue here is that we're recovering from an external checkpoint using
>> the same job ID (the default one used for any Flink cluster in Application
>> Mode) and the same cluster ID, if I understand correctly. Now, the job is
>> failing during initialization. Currently, this causes a global cleanup [1].
>> All HA data including the checkpoints are going to be deleted. I created
>> FLINK-29415 [2] to cover this.
>>
>> I'm wondering whether we could work around this problem by specifying a
>> random job ID through PipelineOptionsInternal [3] in the Kubernetes
>> Operator. But I haven't looked into all the consequences around that. And
>> it feels wrong to make this configuration parameter publicly usable.
>>
>> Another option might be to use ExecutionMode.RECOVERY in case of an
>> initialization failure when recovering from an external Checkpoint in
>> Application Mode (like we do it for internal recovery already).
>>
>> I'm looking forward to your opinion.
>> Matthias
>>
>> [1]
>> https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L663
>> [2] https://issues.apache.org/jira/browse/FLINK-29415
>> [3]
>> https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/configuration/PipelineOptionsInternal.java#L29
>>
>> On Tue, Sep 20, 2022 at 3:45 PM Gyula Fóra  wrote:
>>
>>> I see I think we have seen this issue with others before, in Flink 1.15
>>> it is solved by the newly introduced JobResultStore. The operator also
>>> configures that automatically for 1.15 to avoid this.
>>>
>>> Gyula
>>>
>>> On Tue, Sep 20, 2022 at 3:27 PM Evgeniy Lyutikov 
>>> wrote:
>>>
 Thanks for the answer.
 I think this is not about the operator issue, kubernetes deployment
 just restarts the fallen pod, restarted jobmanager without HA metadata
 starts the job itself from an empty state.

 I'm looking for a way to prevent it from exiting in case of an
 job error (we use application mode cluster).



 --
 *От:* Gyula Fóra 
 *Отправлено:* 20 сентября 2022 г. 19:49:37
 *Кому:* Evgeniy Lyutikov
 *Копия:* user@flink.apache.org
 *Тема:* Re: JobManager restarts on job failure

 The best thing for you to do would be to upgrade to Flink 1.15 and the
 latest operator version.
 In Flink 1.15 we have the option to interact with the Flink jobmanager
 even after the job FAILED and the operator leverages this for a much more
 robust behaviour.

 In any case the operator should not ever start the job from an empty
 state (even if it FAILED), if you think that's happening could you please
 open a JIRA ticket with the accompanying JM and Operator logs?

 Thanks
 Gyula

 On Tue, Sep 20, 2022 at 1:00 PM Evgeniy Lyutikov 
 wrote:

> Hi,
> We using flink 1.14.4 with flink kubernetes operator.
>
> Sometimes when updating a job, it fails on startup and flink removes
> all HA metadata and exits the jobmanager.
>
>
> 2022-09-14 14:54:44,534 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - 
> Restoring
> job  from Checkpoint 30829 @ 1663167158684
> for  located at
> s3p://flink-checkpoints/k8s-checkpoint-job-name//chk-30829.
> 2022-09-14 14:54:44,638 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>  reached terminal state FAILED.
> org.apache.flink.runtime.client.JobInitializationException: Could not
> start the JobMaster.
> Caused by: java.util.concurrent.CompletionException:
> java.lang.IllegalStateException: There is no operator for the state
> 4e1d9dde287c33a35e7970cbe64a40fe
> 2022-09-14 14:54:44,930 ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal
> error occurred in the cluster entrypoint.

Re: Jobmanager fails to come up if the job has an issue

2022-09-26 Thread Matthias Pohl via user
I see. Thanks for sharing the logs. It's related to a FLINK-9097 [1]. In
order for the job to not be cleaned up entirely after a failure while
submitting the job, the JobManager is failed fatally resulting in a
failover. That's what you're experiencing.

One solution is to fix the permission issue to make the job recover without
problems. If that's not what you want to do, you could delete the entry
with the key 'jobGraph-04ae99777ee2ed34c13fe8120e68436e' from the
JobGraphStore ConfigMap (based on your logs it should
be flink-972ac3d8028e45fcafa9b8b7b7f1dafb-custer-config-map). This will
prevent the JobManager from recovering this specific job. Keep in mind that
you have to clean up any job-related data by yourself in that case.

I hope that helps.
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-9097

On Mon, Sep 26, 2022 at 12:26 PM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> I got some logs and stack traces from our backend storage. This is not the
> entire log though. Can this be useful?  With these set of logs messages the
> job manager kept restarting.
>
> Regards
> Ram
>
> On Mon, Sep 26, 2022 at 3:11 PM ramkrishna vasudevan <
> ramvasu.fl...@gmail.com> wrote:
>
>> Thank you very much for the reply. I have lost the k8s cluster in this
>> case before I could capture the logs. I will try to repro this and get back
>> to you.
>>
>> Regards
>> Ram
>>
>> On Mon, Sep 26, 2022 at 12:42 PM Matthias Pohl 
>> wrote:
>>
>>> Hi Ramkrishna,
>>> thanks for reaching out to the Flink community. Could you share the
>>> JobManager logs to get a better understanding of what's going on? I'm
>>> wondering why the JobManager is failing when the actual problem is that the
>>> job is struggling to access a folder. It sounds like there are multiple
>>> problems here.
>>>
>>> Best,
>>> Matthias
>>>
>>> On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
>>> ramvasu.fl...@gmail.com> wrote:
>>>
 Hi all

 I have a simple job where we read for a given path in cloud storage to
 watch for new files in a given fodler. While I setup my job there was some
 permission issue on the folder. The job is STREAMING job.
 The cluster is set in the session mode and is running on Kubernetes.
 The job manager since then is failing to come back up and every time it
 fails with the permission issue. But the point is how should i recover my
 cluster in this case. Since JM is not there the UI is also not working and
 how do I remove the bad job from the JM.

 Regards
 Ram

>>>


Re: JobManager restarts on job failure

2022-09-26 Thread Matthias Pohl via user
Thanks Evgeniy for reaching out to the community and Gyula for picking it
up. I haven't looked into the k8s operator in much detail, yet. So, help me
out if I miss something here. But I'm afraid that this is not something
that would be fixed by upgrading to 1.15.
The issue here is that we're recovering from an external checkpoint using
the same job ID (the default one used for any Flink cluster in Application
Mode) and the same cluster ID, if I understand correctly. Now, the job is
failing during initialization. Currently, this causes a global cleanup [1].
All HA data including the checkpoints are going to be deleted. I created
FLINK-29415 [2] to cover this.

I'm wondering whether we could work around this problem by specifying a
random job ID through PipelineOptionsInternal [3] in the Kubernetes
Operator. But I haven't looked into all the consequences around that. And
it feels wrong to make this configuration parameter publicly usable.

Another option might be to use ExecutionMode.RECOVERY in case of an
initialization failure when recovering from an external Checkpoint in
Application Mode (like we do it for internal recovery already).

I'm looking forward to your opinion.
Matthias

[1]
https://github.com/apache/flink/blob/41ac1ba13679121f1ddf14b26a36f4f4a3cc73e4/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L663
[2] https://issues.apache.org/jira/browse/FLINK-29415
[3]
https://github.com/apache/flink/blob/c6997c97c575d334679915c328792b8a3067cfb5/flink-core/src/main/java/org/apache/flink/configuration/PipelineOptionsInternal.java#L29

On Tue, Sep 20, 2022 at 3:45 PM Gyula Fóra  wrote:

> I see I think we have seen this issue with others before, in Flink 1.15 it
> is solved by the newly introduced JobResultStore. The operator also
> configures that automatically for 1.15 to avoid this.
>
> Gyula
>
> On Tue, Sep 20, 2022 at 3:27 PM Evgeniy Lyutikov 
> wrote:
>
>> Thanks for the answer.
>> I think this is not about the operator issue, kubernetes deployment just
>> restarts the fallen pod, restarted jobmanager without HA metadata starts
>> the job itself from an empty state.
>>
>> I'm looking for a way to prevent it from exiting in case of an job error
>> (we use application mode cluster).
>>
>>
>>
>> --
>> *От:* Gyula Fóra 
>> *Отправлено:* 20 сентября 2022 г. 19:49:37
>> *Кому:* Evgeniy Lyutikov
>> *Копия:* user@flink.apache.org
>> *Тема:* Re: JobManager restarts on job failure
>>
>> The best thing for you to do would be to upgrade to Flink 1.15 and the
>> latest operator version.
>> In Flink 1.15 we have the option to interact with the Flink jobmanager
>> even after the job FAILED and the operator leverages this for a much more
>> robust behaviour.
>>
>> In any case the operator should not ever start the job from an empty
>> state (even if it FAILED), if you think that's happening could you please
>> open a JIRA ticket with the accompanying JM and Operator logs?
>>
>> Thanks
>> Gyula
>>
>> On Tue, Sep 20, 2022 at 1:00 PM Evgeniy Lyutikov 
>> wrote:
>>
>>> Hi,
>>> We using flink 1.14.4 with flink kubernetes operator.
>>>
>>> Sometimes when updating a job, it fails on startup and flink removes all
>>> HA metadata and exits the jobmanager.
>>>
>>>
>>> 2022-09-14 14:54:44,534 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Restoring
>>> job  from Checkpoint 30829 @ 1663167158684
>>> for  located at
>>> s3p://flink-checkpoints/k8s-checkpoint-job-name//chk-30829.
>>> 2022-09-14 14:54:44,638 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>>>  reached terminal state FAILED.
>>> org.apache.flink.runtime.client.JobInitializationException: Could not
>>> start the JobMaster.
>>> Caused by: java.util.concurrent.CompletionException:
>>> java.lang.IllegalStateException: There is no operator for the state
>>> 4e1d9dde287c33a35e7970cbe64a40fe
>>> 2022-09-14 14:54:44,930 ERROR
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal
>>> error occurred in the cluster entrypoint.
>>> 2022-09-14 14:54:45,020 INFO
>>> org.apache.flink.kubernetes.highavailability.KubernetesHaServices [] -
>>> Clean up the high availability data for job
>>> .
>>> 2022-09-14 14:54:45,020 INFO
>>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Shutting
>>> KubernetesApplicationClusterEntrypoint down with application status
>>> UNKNOWN. Diagnostics Cluster entrypoint has been closed externally..
>>> 2022-09-14 14:54:45,026 INFO
>>> org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - Shutting
>>> down rest endpoint.
>>> 2022-09-14 14:54:46,122 INFO
>>> akka.remote.RemoteActorRefProvider$RemotingTerminator[] - Shutting
>>> down remote daemon.
>>> 2022-09-14 14:54:46,321 INFO
>>> 

Re: Jobmanager fails to come up if the job has an issue

2022-09-26 Thread Matthias Pohl via user
Hi Ramkrishna,
thanks for reaching out to the Flink community. Could you share the
JobManager logs to get a better understanding of what's going on? I'm
wondering why the JobManager is failing when the actual problem is that the
job is struggling to access a folder. It sounds like there are multiple
problems here.

Best,
Matthias

On Mon, Sep 26, 2022 at 6:25 AM ramkrishna vasudevan <
ramvasu.fl...@gmail.com> wrote:

> Hi all
>
> I have a simple job where we read for a given path in cloud storage to
> watch for new files in a given fodler. While I setup my job there was some
> permission issue on the folder. The job is STREAMING job.
> The cluster is set in the session mode and is running on Kubernetes.
> The job manager since then is failing to come back up and every time it
> fails with the permission issue. But the point is how should i recover my
> cluster in this case. Since JM is not there the UI is also not working and
> how do I remove the bad job from the JM.
>
> Regards
> Ram
>


Re: Classloading issues with Flink Operator / Kubernetes Native

2022-09-16 Thread Matthias Pohl via user
Are you deploying the job in session or application mode? Could you provide
the stacktrace. I'm wondering whether that would be helpful to pin a code
location for further investigation.
So far, I couldn't come up with a definite answer about placing the jar in
the lib directory. Initially, I would have thought that it's fine
considering that all dependencies are included and the job jar itself ends
up on the user classpath. I'm curious whether Chesnay (CC'd) has an answer
to that one.

On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko 
wrote:

> Hey everyone,
>
> I’m migrating a Flink Kubernetes standalone job to the Flink operator
> (with Kubernetes native mode).
>
> I have a lot of classloading issues when trying to run with the operator
> in native mode. For example, I have a Postgres driver as a dependency (I
> can confirm the files are included in the uber jar), but I still get
> "java.sql.SQLException: No suitable driver found for jdbc:postgresql:..."
> exception.
>
> In the Kubernetes standalone setup my uber jar is placed in the
> /opt/flink/lib folder, this is what I specify as "jarURI" in the operator
> config. Is this supported? Should I only be using /opt/flink/usrlib?
>
> Thanks for any suggestions.
>


Re: New licensing for Akka

2022-09-09 Thread Matthias Pohl via user
Looks like there will be a bit of a grace period till Sep 2023 for
vulnerability fixes in akka 2.6.x [1]

[1] https://discuss.lightbend.com/t/2-6-x-maintenance-proposal/9949

On Wed, Sep 7, 2022 at 4:30 PM Robin Cassan via user 
wrote:

> Thanks a lot for your answers, this is reassuring!
>
> Cheers
>
> Le mer. 7 sept. 2022 à 13:12, Chesnay Schepler  a
> écrit :
>
>> Just to squash concerns, we will make sure this license change will not
>> affect Flink users in any way.
>>
>> On 07/09/2022 11:14, Robin Cassan via user wrote:
>> > Hi all!
>> > It seems Akka have announced a licensing change
>> > https://www.lightbend.com/blog/why-we-are-changing-the-license-for-akka
>> > If I understand correctly, this could end-up increasing cost a lot for
>> > companies using Flink in production. Do you know if the Flink
>> > developers have any initial reaction as to how this could be handled
>> > (using a Fork? moving out of akka, even though it's probably
>> > incredibly complex?)? Are we right to assume that this license applies
>> > when using akka through Flink?
>> >
>> > Thanks a lot!
>> > Robin
>>
>>
>>


Re: New licensing for Akka

2022-09-07 Thread Matthias Pohl via user
There is some more discussion going on in the related PR [1]. Based on the
current state of the discussion, akka 2.6.20 will be the last version under
Apache 2.0 license. But, I guess, we'll have to see where this discussion
is heading considering that it's kind of fresh.

[1] https://github.com/akka/akka/pull/31561

On Wed, Sep 7, 2022 at 11:30 AM Chesnay Schepler  wrote:

> We'll have to look into it.
>
> The license would apply to usages of Flink.
> That said, I'm not sure if we'd even be allowed to use Akka under that
> license since it puts significant restrictions on the use of the software.
> If that is the case, then it's either use a fork created by another
> party or switch to a different library.
>
> On 07/09/2022 11:14, Robin Cassan via user wrote:
> > Hi all!
> > It seems Akka have announced a licensing change
> > https://www.lightbend.com/blog/why-we-are-changing-the-license-for-akka
> > If I understand correctly, this could end-up increasing cost a lot for
> > companies using Flink in production. Do you know if the Flink
> > developers have any initial reaction as to how this could be handled
> > (using a Fork? moving out of akka, even though it's probably
> > incredibly complex?)? Are we right to assume that this license applies
> > when using akka through Flink?
> >
> > Thanks a lot!
> > Robin
>
>
>


Re: Slow Tests in Flink 1.15

2022-09-06 Thread Matthias Pohl via user
Hi David,
I guess, you're referring to [1]. But as Chesnay already pointed out in the
previous thread: It would be helpful to get more insights into what exactly
your tests are executing (logs, code, ...). That would help identifying the
cause.
> Can you give us a more complete stacktrace so we can see what call in
> Flink is waiting for something?
>
> Does this happen to all of your tests?
> Can you provide us with an example that we can try ourselves? If not,
> can you describe the test structure (e.g., is it using a
> MiniClusterResource).

Matthias

[1] https://lists.apache.org/thread/yhhprwyf29kgypzzqdmjgft4qs25yyhk

On Mon, Sep 5, 2022 at 4:59 PM David Jost  wrote:

> Hi,
>
> we were going to upgrade our application from Flink 1.14.4 to Flink
> 1.15.2, when we noticed, that all our job tests, using a
> MiniClusterWithClientResource, are multiple times slower in 1.15 than
> before in 1.14. I, unfortunately, have not found mentions in that regard in
> the changelog or documentation. The slowdown is rather extreme I hope to
> find a solution to this. I saw it mentioned once in the mailing list, but
> there was no (public) outcome to it.
>
> I would appreciate any help on this. Thank you in advance.
>
> Best
>  David


Re: flink ci build run longer than the maximum time of 310 minutes.

2022-09-05 Thread Matthias Pohl via user
Usually, it would be more helpful to provide a link to the PR to get a
better picture of the problem. I'm not 100% sure whether I grasp what's
wrong.

It looks like your branch is based on apache/flink:release-1.15 [1].
Therefore, you should fetch the most recent version from upstream and then
do a git rebase upstream/release-1.15. This will put your 4 commits which
you've added to your local branch so far "on top" of everything that is
already part of upstream/release-1.15. This should resolve your branch
being 11 commits behind the 1.15 release branch. Force-pushing the changes
in your local branch to your remote repo (your fork) will update the PR.

Keep in mind that you have to specify the right base branch in your Github
PR (pointing to the 1.15 release branch in your case) as well to have the
right diff.

I hope that helps. Best,
Matthias

[1] https://github.com/apache/flink/tree/release-1.15

On Sat, Sep 3, 2022 at 10:18 AM hjw <1010445...@qq.com> wrote:

> Hi,Matthias
> The ci build Error in  e2e_1_ci job:
> Sep 0211:01:51 ##[group]Top 15 biggest directories in terms of used disk
> space
> Sep 02 11:01:52 Searching for .dump, .dumpstream and related files in
> '/home/vsts/work/1/s'
> dmesg: read kernel buffer failed: Operation not permitted
> Sep 02 11:01:53 No taskexecutor daemon to stop on host fv-az158-417.
> Sep 02 11:01:53 No standalonesession daemon to stop on host fv-az158-417.
> Sep 02 11:10:27 The command 'docker build --no-cache --network=host -t
> test_docker_embedded_job dev/test_docker_embedded_job-debian' (pid: 188432)
> did not finish after 600 seconds.
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line
> 900: kill: (188432) - No such process
> Sep 02 11:11:06 The command 'docker build --no-cache --network=host -t
> test_docker_embedded_job dev/test_docker_embedded_job-debian' (pid: 188484)
> did not finish after 600 seconds.
> /home/vsts/work/1/s/flink-end-to-end-tests/test-scripts/common.sh: line
> 900: kill: (188484) - No such process
>
> I think the issue  applies to  my case.
> However, I have submited some commit to my fork repo and create a pr.The
> pr  has not been merged in to flink repo. My fork repo status :This
> branch is 4 commits ahead
> 
> , 11 commits behind
> 
>  apache:release-1.15.
>
> When I rebase the branch from upstream and push to my fork repo, the 11
> commits
> 
>  behind
> 
>  apache:release-1.15
> also appear in my pr change files. How can I handle this situation? thx.
>
> --
> Best,
> Hjw
>
>
>
> -- 原始邮件 --
> *发件人:* "Matthias Pohl" ;
> *发送时间:* 2022年9月2日(星期五) 晚上7:29
> *收件人:* "Martijn Visser";
> *抄送:* "hjw"<1010445...@qq.com>;"user";
> *主题:* Re: flink ci build run longer than the maximum time of 310 minutes.
>
> Not sure whether that applies to your case, but there was a recent issue
> [1] where the e2e_1_ci job ran into a timeout. If that's what you were
> observing, rebasing your branch might help.
>
> Best,
> Matthias
>
> [1] https://issues.apache.org/jira/browse/FLINK-29161
>
> On Fri, Sep 2, 2022 at 10:51 AM Martijn Visser 
> wrote:
>
>> You can ask the Flinkbot to run again by typing as comment
>>
>> @flinkbot run azure
>>
>> Best regards,
>>
>> Martijn
>>
>> Op vr 2 sep. 2022 om 08:40 schreef hjw <1010445...@qq.com>:
>>
>>> I commit a pr to Flink Github .
>>> A error happened in building ci.
>>> [error]The job running on agent Azure Pipelines 6 ran longer than the
>>> maximum time of 310 minutes. For more information, see
>>> https://go.microsoft.com/fwlink/?linkid=2077134
>>>
>>> How to solve this problem?
>>> How to triigle the ci building again?
>>> thx.
>>>
>>


Re: flink ci build run longer than the maximum time of 310 minutes.

2022-09-02 Thread Matthias Pohl via user
Not sure whether that applies to your case, but there was a recent issue
[1] where the e2e_1_ci job ran into a timeout. If that's what you were
observing, rebasing your branch might help.

Best,
Matthias

[1] https://issues.apache.org/jira/browse/FLINK-29161

On Fri, Sep 2, 2022 at 10:51 AM Martijn Visser 
wrote:

> You can ask the Flinkbot to run again by typing as comment
>
> @flinkbot run azure
>
> Best regards,
>
> Martijn
>
> Op vr 2 sep. 2022 om 08:40 schreef hjw <1010445...@qq.com>:
>
>> I commit a pr to Flink Github .
>> A error happened in building ci.
>> [error]The job running on agent Azure Pipelines 6 ran longer than the
>> maximum time of 310 minutes. For more information, see
>> https://go.microsoft.com/fwlink/?linkid=2077134
>>
>> How to solve this problem?
>> How to triigle the ci building again?
>> thx.
>>
>


Re: Failing to maven compile install Flink 1.15

2022-08-22 Thread Matthias Pohl via user
Hi hjw,
it would be interesting to know the exact Maven commands you used for the
successful run (where you compiled the flink-client module individually)
and the failed run (where you tried to build everything at once) and
probably a more complete version of the Maven output.

The path
D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\src\test\assembly\test-assembly.xml
appears
to be strange. The Flink sources have a test-assembly.xml file
configuration in two locations:
$ find . -name test-assembly.xml
./flink-clients/src/test/assembly/test-assembly.xml
./flink-formats/flink-avro/src/test/assembly/test-assembly.xml

There's also no src folder in Flink's root folder which indicates there's
something which (at least I) don't understand about your setup.

Best,
Matthias

On Fri, Aug 19, 2022 at 3:16 AM yuxia  wrote:

> which mvn version do you use? It's recommanded to use maven 3.2.5
>
> Best regards,
> Yuxia
>
> --
> *发件人: *"hjw" <1010445...@qq.com>
> *收件人: *"User" 
> *发送时间: *星期四, 2022年 8 月 18日 下午 10:48:57
> *主题: *Failing to maven compile install Flink 1.15
>
> I try to maven clean install Flink 1.15 parent,but fail.
> A Error happened in compiling flink-clients.
> Error Log:
> Failed to execute goal
> org.apache.maven.plugins:maven-assembly-plugin:2.4:single
> (create-test-dependency) on project flink-clients: Error reading
> assemblies: Error locating assembly descriptor:
> src/test/assembly/test-assembly.xml
>
> [1] [INFO] Searching for file location:
> D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\flink-clients\target\src\test\assembly\test-assembly.xml
>
> [2] [INFO] File:
> D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\flink-clients\target\src\test\assembly\test-assembly.xml
> does not exist.
>
> [3] [INFO] File:
> D:\learn\Code\Flink\FlinkSourceCode\Flink-1.15\flink\src\test\assembly\test-assembly.xml
> does not exist.
>
>
> However, mvn clean package Flink 1.15 parent  and  flink-client alone are
> successful.
>
>
>