Re: Start Flink cluster, k8s pod behavior

2021-09-30 Thread Matthias Pohl
Hi Qihua,
I guess, looking into kubectl describe and the JobManager logs would help
in understanding what's going on.

Best,
Matthias

On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang  wrote:

> Hi,
> I deployed flink in session mode. I didn't run any jobs. I saw below logs.
> That is normal, same as Flink menual shows.
>
> + /opt/flink/bin/run-job-manager.sh
> Starting HA cluster with 1 masters.
> Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
> Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.
>
> But when I check kubectl, it shows status is Completed. After a while,
> status changed to CrashLoopBackOff, and pod restart.
> NAME  READY
> STATUS RESTARTS   AGE
> job-manager-776dcf6dd-xzs8g   0/1 Completed  5
>  5m27s
>
> NAME  READY
> STATUS RESTARTS   AGE
> job-manager-776dcf6dd-xzs8g   0/1 CrashLoopBackOff   5
>  7m35s
>
> Anyone can help me understand why?
> Why do kubernetes regard this pod as completed and restart? Should I
> config something? either Flink side or Kubernetes side? From the Flink
> manual, after the cluster is started, I can upload a jar to run the
> application.
>
> Thanks,
> Qihua
>


Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-30 Thread Matthias Pohl
Thanks for sharing. I was wondering why you don't use $PORT0 in your
command. And: Are the ports properly configured in the Marathon network
configuration [1]? But the error seems to be unrelated to that setting.
Other than that, I cannot see any other issue with the configuration. It
could be that the HOST IP is blocked?

[1] https://mesosphere.github.io/marathon/docs/ports.html#specifying-ports

On Wed, Sep 29, 2021 at 7:07 PM Javier Vegas  wrote:

>
> Full appmaster log in debug mode is attached.
> My startup command was
> /opt/flink/bin/mesos-appmaster.sh \
>   -Drest.bind-port=8081 \
>   -Drest.port=8081 \
>   -Djobmanager.rpc.address=$HOST \
>   -Djobmanager.rpc.port=$PORT1 \
>   -Dmesos.resourcemanager.framework.user=flink \
>   -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
>   -Dmesos.master=10.0.18.246:5050 \
>   -Dmesos.resourcemanager.tasks.cpus=4 \
>   -Dmesos.resourcemanager.tasks.container.type=docker \
>   -Dmesos.resourcemanager.tasks.container.image.name=
> docker.strava.com/strava/timeline-populator2:jv-mesos \
>   -Dtaskmanager.numberOfTaskSlots=4 ;
>
> where $PORT1 refers to my second host open port, mapped to 6123 on the
> Docker container (first port is mapped to 8081).
> I can see in the log that $HOST and $PORT1 resolve to the correct values, 
> 10.0.20.25
> and 31608
>
> On Wed, Sep 29, 2021 at 9:41 AM Matthias Pohl 
> wrote:
>
>> ...and if possible, it would be helpful to provide debug logs as well.
>>
>> On Wed, Sep 29, 2021 at 6:33 PM Matthias Pohl 
>> wrote:
>>
>>> May you provide the entire JobManager logs so that we can see what's
>>> going on?
>>>
>>> On Wed, Sep 29, 2021 at 12:42 PM Javier Vegas  wrote:
>>>
 Thanks again, Matthias!

 Putting  -Djobmanager.rpc.address=$HOST and
 -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh
 I see in tog they seem to transform in the correct values

 -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009

 but a bit later the appmaster dies with this new error. it is unclear
 what address it is trying to bind, I added explicit params
 -Drest.bind-port=8081 and
   -Drest.port=8081 in case jobmanager.rpc.port was somehow
 interfering, but that didn't help.

 2021-09-29 10:29:59.845 [main] INFO  
 org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting 
 MesosSessionClusterEntrypoint down with application status FAILED. 
 Diagnostics java.net.BindException: Cannot assign requested address
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
at 
 org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at 
 org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
 org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
 org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)

 On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
 wrote:

> The port has its separate configuration parameter jobmanager.rpc.port
> [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>
>>

In flight records on Flink : Newbie question

2021-09-30 Thread Declan Harrison
Hi Guys

I've just recently started using Apache Flink to evaluate its suitability
for  a project I'm working on.

First impressions are that the project is great, well documented and has
lots of examples and guidance showcasing the multitude of things that it
can do. Challenging knowing where to start at times as there are many ways
to achieve the same result.

So my pipeline is similar to an ETL, I have a continuous DataStream source
of Java records modelled as POJOs which I then transform each POJO to a
single JSON record before writing to a streaming sink.  This all works as
expected.

However my question is in 2 parts and I hope you can help, apologies in
advance if this question highlights my lack of experience.

   - Can I get access to the infight records that are currently being
   processed within Flink ? By inflight I mean the records that are currently
   being processed but haven't been written to the sink.
   - Is the number of inflight records deterministic? How many records does
   Flink process per subtask/thread.  For example, it might be 1 record at a
   time per subtask?

Thanks
Declan


Re: Start Flink cluster, k8s pod behavior

2021-09-30 Thread Chesnay Schepler

Is the run-job-manager.sh script actually blocking?
Since you (apparently) use that as an entrypoint, if that scripts exits 
after starting the JM then from the perspective of Kubernetes everything 
is done.


On 30/09/2021 08:59, Matthias Pohl wrote:

Hi Qihua,
I guess, looking into kubectl describe and the JobManager logs would 
help in understanding what's going on.


Best,
Matthias

On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang > wrote:


Hi,
I deployed flink in session mode. I didn't run any jobs. I saw
below logs. That is normal, same as Flink menual shows.

+ /opt/flink/bin/run-job-manager.sh
Starting HA cluster with 1 masters.
Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.

But when I check kubectl, it shows status is Completed. After a
while, status changed to CrashLoopBackOff, and pod restart.
NAME              READY   STATUS             RESTARTS   AGE
job-manager-776dcf6dd-xzs8g       0/1     Completed      5        
 5m27s

NAME              READY   STATUS             RESTARTS   AGE
job-manager-776dcf6dd-xzs8g       0/1 CrashLoopBackOff   5        
 7m35s

Anyone can help me understand why?
Why do kubernetes regard this pod as completed and restart? Should
I config something? either Flink side or Kubernetes side? From the
Flink manual, after the cluster is started, I can upload a jar to
run the application.

Thanks,
Qihua





Re: Flink application mode with no ui , how to start job using k8s ?

2021-09-30 Thread Denis Nutiu
Hi,

If you're new to k8s you can try to use Flink Native[1]. It's a CLI tool
that can be used to deploy Flink in application mode or session mode but
note that Reactive Mode is not supported in Flink Native.

To answer your questions
a) You need to bundle your jar with the Flink image or mount it in a volume.
b) Yes, you should be able to.

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/

-- 
Best,
Denis Nutiu


Re: [ANNOUNCE] Flink mailing lists archive service has migrated to Apache Archive service

2021-09-30 Thread Till Rohrmann
Thanks for the hint with the managed search engines Matthias. I think this
is quite helpful.

Cheers,
Till

On Wed, Sep 15, 2021 at 4:27 PM Matthias Pohl 
wrote:

> Thanks Leonard for the announcement. I guess that is helpful.
>
> @Robert is there any way we can change the default setting to something
> else (e.g. greater than 0 days)? Only having the last month available as a
> default is kind of annoying considering that the time setting is quite
> hidden.
>
> Matthias
>
> PS: As a workaround, one could use the gte=0d parameter which is encoded in
> the URL (e.g. if you use managed search engines in Chrome or Firefox's
> bookmark keywords:
> https://lists.apache.org/x/list.html?user@flink.apache.org:gte=0d:%s).
> That
> will make all posts available right-away.
>
> On Mon, Sep 6, 2021 at 3:16 PM JING ZHANG  wrote:
>
> > Thanks Leonard for driving this.
> > The information is helpful.
> >
> > Best,
> > JING ZHANG
> >
> > Jark Wu  于2021年9月6日周一 下午4:59写道:
> >
> >> Thanks Leonard,
> >>
> >> I have seen many users complaining that the Flink mailing list doesn't
> >> work (they were using Nabble).
> >> I think this information would be very helpful.
> >>
> >> Best,
> >> Jark
> >>
> >> On Mon, 6 Sept 2021 at 16:39, Leonard Xu  wrote:
> >>
> >>> Hi, all
> >>>
> >>> The mailing list archive service Nabble Archive was broken at the end
> of
> >>> June, the Flink community has migrated the mailing lists archives[1] to
> >>> Apache Archive service by commit[2], you can refer [3] to know more
> mailing
> >>> lists archives of Flink.
> >>>
> >>> Apache Archive service is maintained by ASF thus the stability is
> >>> guaranteed, it’s a web-based mail archive service which allows you to
> >>> browse, search, interact, subscribe, unsubscribe, etc. with mailing
> lists.
> >>>
> >>> Apache Archive service shows mails of the last month by default, you
> can
> >>> specify the date range to browse, search the history mails.
> >>>
> >>>
> >>> Hope it would be helpful.
> >>>
> >>> Best,
> >>> Leonard
> >>>
> >>> [1] The Flink mailing lists in Apache archive service
> >>> dev mailing list archives:
> >>> https://lists.apache.org/list.html?d...@flink.apache.org
> >>> user mailing list archives :
> >>> https://lists.apache.org/list.html?user@flink.apache.org
> >>> user-zh mailing list archives :
> >>> https://lists.apache.org/list.html?user...@flink.apache.org
> >>> [2]
> >>>
> https://github.com/apache/flink-web/commit/9194dda862da00d93f627fd315056471657655d1
> >>> [3] https://flink.apache.org/community.html#mailing-lists
> >>
> >>
>


Does Flink 1.12.2 support Zookeeper version 3.6+

2021-09-30 Thread Prasanna kumar
Hi ,

Does Flink 1.12.2 support Zookeeper version 3.6+ ?

If we add  zookeeper version 3.6 jar in the flink image ,would it be able
to connect ?

The following link mentions only zk 3.5 or 3.4
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/zookeeper_ha/#zookeeper-versions


Thanks,
Prasanna.


Re: [ANNOUNCE] Flink mailing lists archive service has migrated to Apache Archive service

2021-09-30 Thread Robert Metzger
@Matthias Pohl : I've also been annoyed by this 30
days limit, but I'm not aware of a way to globally change the default. I
would ask in #asfinfra in the asf slack.

On Thu, Sep 30, 2021 at 12:19 PM Till Rohrmann  wrote:

> Thanks for the hint with the managed search engines Matthias. I think this
> is quite helpful.
>
> Cheers,
> Till
>
> On Wed, Sep 15, 2021 at 4:27 PM Matthias Pohl 
> wrote:
>
> > Thanks Leonard for the announcement. I guess that is helpful.
> >
> > @Robert is there any way we can change the default setting to something
> > else (e.g. greater than 0 days)? Only having the last month available as
> a
> > default is kind of annoying considering that the time setting is quite
> > hidden.
> >
> > Matthias
> >
> > PS: As a workaround, one could use the gte=0d parameter which is encoded
> in
> > the URL (e.g. if you use managed search engines in Chrome or Firefox's
> > bookmark keywords:
> > https://lists.apache.org/x/list.html?user@flink.apache.org:gte=0d:%s).
> > That
> > will make all posts available right-away.
> >
> > On Mon, Sep 6, 2021 at 3:16 PM JING ZHANG  wrote:
> >
> > > Thanks Leonard for driving this.
> > > The information is helpful.
> > >
> > > Best,
> > > JING ZHANG
> > >
> > > Jark Wu  于2021年9月6日周一 下午4:59写道:
> > >
> > >> Thanks Leonard,
> > >>
> > >> I have seen many users complaining that the Flink mailing list doesn't
> > >> work (they were using Nabble).
> > >> I think this information would be very helpful.
> > >>
> > >> Best,
> > >> Jark
> > >>
> > >> On Mon, 6 Sept 2021 at 16:39, Leonard Xu  wrote:
> > >>
> > >>> Hi, all
> > >>>
> > >>> The mailing list archive service Nabble Archive was broken at the end
> > of
> > >>> June, the Flink community has migrated the mailing lists archives[1]
> to
> > >>> Apache Archive service by commit[2], you can refer [3] to know more
> > mailing
> > >>> lists archives of Flink.
> > >>>
> > >>> Apache Archive service is maintained by ASF thus the stability is
> > >>> guaranteed, it’s a web-based mail archive service which allows you to
> > >>> browse, search, interact, subscribe, unsubscribe, etc. with mailing
> > lists.
> > >>>
> > >>> Apache Archive service shows mails of the last month by default, you
> > can
> > >>> specify the date range to browse, search the history mails.
> > >>>
> > >>>
> > >>> Hope it would be helpful.
> > >>>
> > >>> Best,
> > >>> Leonard
> > >>>
> > >>> [1] The Flink mailing lists in Apache archive service
> > >>> dev mailing list archives:
> > >>> https://lists.apache.org/list.html?d...@flink.apache.org
> > >>> user mailing list archives :
> > >>> https://lists.apache.org/list.html?user@flink.apache.org
> > >>> user-zh mailing list archives :
> > >>> https://lists.apache.org/list.html?user...@flink.apache.org
> > >>> [2]
> > >>>
> >
> https://github.com/apache/flink-web/commit/9194dda862da00d93f627fd315056471657655d1
> > >>> [3] https://flink.apache.org/community.html#mailing-lists
> > >>
> > >>
> >
>


RE: FlinkJobNotFoundException

2021-09-30 Thread Hailu, Andreas
Hi Matthias, the log file is quite large (21MB) so mailing it over in its 
entirety may have been a challenge. The file is available here [1], and we’re 
of course happy to share any relevant parts of it with the mailing list.

I think since we’ve shared logs with you before in the past, you weren’t sent 
over an additional welcome email ☺


[1] https://lockbox.gs.com/lockbox/folders/dc2ccacc-f2d2-4d66-a098-461b43e8b65f/

// ah

From: Matthias Pohl 
Sent: Thursday, September 30, 2021 2:57 AM
To: Gusick, Doug S [Engineering] 
Cc: user@flink.apache.org; Erai, Rahul [Engineering] 

Subject: Re: FlinkJobNotFoundException

I didn't receive any email. But we rather not do individual support. Please 
share the logs on the mailing list. This way, anyone is able to participate in 
the discussion.

Best,
Matthias

On Wed, Sep 29, 2021 at 8:12 PM Gusick, Doug S 
mailto:doug.gus...@gs.com>> wrote:
Hi Matthias,

Thank you for getting back. We have been looking into upgrading to a newer 
version, but have not completed full testing just yet.

I was unable to find a previous error in the JM logs. You should have received 
an email with details to a “lockbox”. I have uploaded the job manager logs 
there. Please let me know if you need any more information.

Thank you,
Doug

From: Matthias Pohl mailto:matth...@ververica.com>>
Sent: Wednesday, September 29, 2021 12:00 PM
To: Gusick, Doug S [Engineering] 
mailto:doug.gus...@ny.email.gs.com>>
Cc: user@flink.apache.org; Erai, Rahul 
[Engineering] mailto:rahul.e...@ny.email.gs.com>>
Subject: Re: FlinkJobNotFoundException

Hi Doug,
thanks for reaching out to the community. First of all, 1.9.2 is quite an old 
Flink version. You might want to consider upgrading to a newer version. The 
community only offers support for the two most-recent Flink versions. Newer 
version might include fixes for your issue.

But back to your actual problem: The logs you're providing only show that some 
job switched into FINISHED state. Is there some error showing up earlier in the 
logs which you might have missed? It would be helpful if you could share the 
complete JobManager logs to get a better understanding of what's going on.

Best,
Matthias

On Wed, Sep 29, 2021 at 3:47 PM Gusick, Doug S 
mailto:doug.gus...@gs.com>> wrote:
Hello,

We are facing an issue with some of our applications that are submitting a high 
volume of jobs to Flink (we are using v1.9.2). We are observing that numerous 
jobs (in this case 44 out of 350+) fail with the same FlinkJobNotFoundException 
within a 45 second timeframe.

From our client logs, this is the exception we can see:


Calc Engine: Caused by: org.apache.flink.runtime.rest.util.RestClientException: 
[org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job (d0991f0ae712a9df710aa03311a32c8c)]

Calc Engine:   at 
org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)

Calc Engine:   at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)

Calc Engine:   at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)

Calc Engine:   at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)

Calc Engine:   at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)

Calc Engine:   ... 3 more


This is the first job to fail with the above exception. From the JobManager 
logs, we can see that the job goes to FINISHED State, and then we see the 
following exception:

2021-09-28 04:54:16,936 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.executiongraph.ExecutionGraph- Job Flink Java 
Job at Tue Sep 28 04:48:21 EDT 2021 (d0991f0ae712a9df710aa03311a32c8c) switched 
from state RUNNING to FINISHED.
2021-09-28 04:54:16,937 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job 
d0991f0ae712a9df710aa03311a32c8c reached globally terminal state FINISHED.
2021-09-28 04:54:16,939 INFO  [flink-akka.actor.default-dispatcher-28] 
org.apache.flink.runtime.jobmaster.JobMaster  - Stopping the 
JobMaster for job Flink Java Job at Tue Sep 28 04:48:21 EDT 
2021(d0991f0ae712a9df710aa03311a32c8c).
2021-09-28 04:54:16,940 INFO  [flink-akka.actor.default-dispatcher-39] 
org.apache.flink.yarn.YarnResourceManager - Disconnect job 
manager 
0...@akka.tcp://fl...@d43723-714.dc.gs.com:44887/user/jobmanager_392
 for job d0991f0ae712a9df710aa03311a32c8c from the resource manager.
2021-09-28 04:54:18,256 ERROR [flink-akka.actor.default-dispatcher-91] 
org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler  - 
Exception occurred in REST handler: 
org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find 
Flink job (d0991

Re: Does Flink 1.12.2 support Zookeeper version 3.6+

2021-09-30 Thread Chesnay Schepler

We only support zk 3.4/3.5 .

To try another ZK version you will need to create a 
flink-shaded-zookeeper artifact, similar to the 3.4/3.5 version that you 
can find here: 
https://github.com/apache/flink-shaded/tree/master/flink-shaded-zookeeper-parent


Once you have that it theoretically is as simple as replacing the 
flink-shaded-zookeeper jar in the lib/ directory.


On 30/09/2021 14:26, Prasanna kumar wrote:

Hi ,

Does Flink 1.12.2 support Zookeeper version 3.6+ ?

If we add  zookeeper version 3.6 jar in the flink image ,would it be 
able to connect ?


The following link mentions only zk 3.5 or 3.4 
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/zookeeper_ha/#zookeeper-versions 
 



Thanks,
Prasanna.





RocksDB: Spike in Memory Usage Post Restart

2021-09-30 Thread Kevin Lam
Hi all,

We're debugging an issue with OOMs that occurs on our jobs shortly after a
restore from checkpoint. Our application is running on kubernetes and uses
RocksDB as it's state backend.

We reproduced the issue on a small cluster of 2 task managers. If we killed
a single task manager, we noticed that after restoring from checkpoint, the
untouched task manager has an elevated memory footprint (see the blue line
for the surviving task manager):

[image: image.png]
If we kill the newest TM (yellow line) again, after restoring the surviving
task manager gets OOM killed.

We looked at the OOMKiller Report and it seems that the memory is not
coming from the JVM but we're unsure of the source. It seems like something
is allocating native memory that the JVM is not aware of.

We're suspicious of RocksDB. Has anyone seen this kind of issue before? Is
it possible there's some kind of memory pressure or memory leak coming from
RocksDB that only presents itself when a job is restarted? Perhaps
something isn't cleaned up?

Any help would be appreciated.


Re: Start Flink cluster, k8s pod behavior

2021-09-30 Thread Qihua Yang
I did check the kubectl describe, it shows below info. Reason is Completed.

Ports: 8081/TCP, 6123/TCP, 6124/TCP, 6125/TCP
Host Ports:0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
  /opt/flink/bin/entrypoint.sh
Args:
  /opt/flink/bin/run-job-manager.sh
State:  Waiting
  Reason:   CrashLoopBackOff
Last State: Terminated
  Reason:   Completed
  Exit Code:0
  Started:  Wed, 29 Sep 2021 20:12:30 -0700
  Finished: Wed, 29 Sep 2021 20:12:45 -0700
Ready:  False
Restart Count:  131


On Wed, Sep 29, 2021 at 11:59 PM Matthias Pohl 
wrote:

> Hi Qihua,
> I guess, looking into kubectl describe and the JobManager logs would help
> in understanding what's going on.
>
> Best,
> Matthias
>
> On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang  wrote:
>
>> Hi,
>> I deployed flink in session mode. I didn't run any jobs. I saw below
>> logs. That is normal, same as Flink menual shows.
>>
>> + /opt/flink/bin/run-job-manager.sh
>> Starting HA cluster with 1 masters.
>> Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
>> Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.
>>
>> But when I check kubectl, it shows status is Completed. After a while,
>> status changed to CrashLoopBackOff, and pod restart.
>> NAME  READY
>> STATUS RESTARTS   AGE
>> job-manager-776dcf6dd-xzs8g   0/1 Completed  5
>>  5m27s
>>
>> NAME  READY
>> STATUS RESTARTS   AGE
>> job-manager-776dcf6dd-xzs8g   0/1 CrashLoopBackOff   5
>>  7m35s
>>
>> Anyone can help me understand why?
>> Why do kubernetes regard this pod as completed and restart? Should I
>> config something? either Flink side or Kubernetes side? From the Flink
>> manual, after the cluster is started, I can upload a jar to run the
>> application.
>>
>> Thanks,
>> Qihua
>>
>


Re: Start Flink cluster, k8s pod behavior

2021-09-30 Thread Qihua Yang
Thank you for your reply.
>From the log, exit code is 0, and reason is Completed.
Looks like the cluster is fine. But why kubenetes restart the pod. As you
said, from perspective of Kubernetes everything is done. Then how to
prevent the restart?
It didn't even give chance to upload and run a jar

Ports: 8081/TCP, 6123/TCP, 6124/TCP, 6125/TCP
Host Ports:0/TCP, 0/TCP, 0/TCP, 0/TCP
Command:
  /opt/flink/bin/entrypoint.sh
Args:
  /opt/flink/bin/run-job-manager.sh
State:  Waiting
  Reason:   CrashLoopBackOff
Last State: Terminated
  Reason:   Completed
  Exit Code:0
  Started:  Wed, 29 Sep 2021 20:12:30 -0700
  Finished: Wed, 29 Sep 2021 20:12:45 -0700
Ready:  False
Restart Count:  131

Thanks,
Qihua

On Thu, Sep 30, 2021 at 1:00 AM Chesnay Schepler  wrote:

> Is the run-job-manager.sh script actually blocking?
> Since you (apparently) use that as an entrypoint, if that scripts exits
> after starting the JM then from the perspective of Kubernetes everything is
> done.
>
> On 30/09/2021 08:59, Matthias Pohl wrote:
>
> Hi Qihua,
> I guess, looking into kubectl describe and the JobManager logs would help
> in understanding what's going on.
>
> Best,
> Matthias
>
> On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang  wrote:
>
>> Hi,
>> I deployed flink in session mode. I didn't run any jobs. I saw below
>> logs. That is normal, same as Flink menual shows.
>>
>> + /opt/flink/bin/run-job-manager.sh
>> Starting HA cluster with 1 masters.
>> Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
>> Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.
>>
>>
>> But when I check kubectl, it shows status is Completed. After a while,
>> status changed to CrashLoopBackOff, and pod restart.
>> NAME  READY
>> STATUS RESTARTS   AGE
>> job-manager-776dcf6dd-xzs8g   0/1 Completed  5
>>  5m27s
>>
>> NAME  READY
>> STATUS RESTARTS   AGE
>> job-manager-776dcf6dd-xzs8g   0/1 CrashLoopBackOff   5
>>  7m35s
>>
>> Anyone can help me understand why?
>> Why do kubernetes regard this pod as completed and restart? Should I
>> config something? either Flink side or Kubernetes side? From the Flink
>> manual, after the cluster is started, I can upload a jar to run the
>> application.
>>
>> Thanks,
>> Qihua
>>
>
>


Re: Start Flink cluster, k8s pod behavior

2021-09-30 Thread Qihua Yang
Looks like after script *flink-daemon.sh *complete, it return exit 0.
Kubernetes regard it as done. Is that expected?

Thanks,
Qihua

On Thu, Sep 30, 2021 at 11:11 AM Qihua Yang  wrote:

> Thank you for your reply.
> From the log, exit code is 0, and reason is Completed.
> Looks like the cluster is fine. But why kubenetes restart the pod. As you
> said, from perspective of Kubernetes everything is done. Then how to
> prevent the restart?
> It didn't even give chance to upload and run a jar
>
> Ports: 8081/TCP, 6123/TCP, 6124/TCP, 6125/TCP
> Host Ports:0/TCP, 0/TCP, 0/TCP, 0/TCP
> Command:
>   /opt/flink/bin/entrypoint.sh
> Args:
>   /opt/flink/bin/run-job-manager.sh
> State:  Waiting
>   Reason:   CrashLoopBackOff
> Last State: Terminated
>   Reason:   Completed
>   Exit Code:0
>   Started:  Wed, 29 Sep 2021 20:12:30 -0700
>   Finished: Wed, 29 Sep 2021 20:12:45 -0700
> Ready:  False
> Restart Count:  131
>
> Thanks,
> Qihua
>
> On Thu, Sep 30, 2021 at 1:00 AM Chesnay Schepler 
> wrote:
>
>> Is the run-job-manager.sh script actually blocking?
>> Since you (apparently) use that as an entrypoint, if that scripts exits
>> after starting the JM then from the perspective of Kubernetes everything is
>> done.
>>
>> On 30/09/2021 08:59, Matthias Pohl wrote:
>>
>> Hi Qihua,
>> I guess, looking into kubectl describe and the JobManager logs would help
>> in understanding what's going on.
>>
>> Best,
>> Matthias
>>
>> On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang  wrote:
>>
>>> Hi,
>>> I deployed flink in session mode. I didn't run any jobs. I saw below
>>> logs. That is normal, same as Flink menual shows.
>>>
>>> + /opt/flink/bin/run-job-manager.sh
>>> Starting HA cluster with 1 masters.
>>> Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
>>> Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.
>>>
>>>
>>> But when I check kubectl, it shows status is Completed. After a while,
>>> status changed to CrashLoopBackOff, and pod restart.
>>> NAME  READY
>>>   STATUS RESTARTS   AGE
>>> job-manager-776dcf6dd-xzs8g   0/1 Completed  5
>>>  5m27s
>>>
>>> NAME  READY
>>>   STATUS RESTARTS   AGE
>>> job-manager-776dcf6dd-xzs8g   0/1 CrashLoopBackOff   5
>>>  7m35s
>>>
>>> Anyone can help me understand why?
>>> Why do kubernetes regard this pod as completed and restart? Should I
>>> config something? either Flink side or Kubernetes side? From the Flink
>>> manual, after the cluster is started, I can upload a jar to run the
>>> application.
>>>
>>> Thanks,
>>> Qihua
>>>
>>
>>


Exception thrown during batch job execution on YARN even though job succeeded

2021-09-30 Thread Ken Krugler
Hi all,

We’ve upgraded from Flink 1.11 to 1.13, and our workflows are now sometimes 
failing with an exception, even though the job has succeeded.

The stack trace for this bit of the exception is:

java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
complete the operation. Number of retries has been exhausted.
at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at 
org.apache.flink.client.program.ContextEnvironment.getJobExecutionResult(ContextEnvironment.java:117)
at 
org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:74)
at my.program.execute.workflow...

The root cause is "java.net.ConnectException: Connection refused”, returned 
from the YARN node where the Job Manager is (was) running.

ContextEnvironment.java line 117 is:

jobExecutionResult = jobExecutionResultFuture.get();

This looks like a race condition, where YARN is terminating the Job Manager, 
and this sometimes completes before the main program has retrieved all of the 
job status information.

I’m wondering if this is a side effect of recent changes to make execution 
async/non-blocking.

Is this a known issue? Anything we can do to work around it?

Thanks,

— Ken

PS - The last two people working on this area code were Aljoscha and Robert 
(really wish git blame didn’t show most lines as being modified by “Rufus 
Refactor”…sigh)

--
Ken Krugler
http://www.scaleunlimited.com
Custom big data solutions
Flink, Pinot, Solr, Elasticsearch