FLINK JOB

2020-04-20 Thread Som Lima
Hi,

FLINK JOB  url  defaults to localhost

i.e. localhost:8081.

I have to manually change it to server :8081 to get Apache  flink  Web
Dashboard to display.


Flink Job Exception

2017-02-15 Thread Govindarajan Srinivasaraghavan
Hi All,

I'm trying to run a streaming job with flink 1.2 version and there are 3
task managers with 12 task slots. Irrespective of the parallelism that I
give it always fails with the below error and I found a JIRA link
corresponding to this issue. Can I know by when this will be resolved since
I'm not able to run any job in my current environment. Thanks.

https://issues.apache.org/jira/browse/FLINK-5773

java.lang.ClassCastException: Cannot cast scala.util.Failure to
org.apache.flink.runtime.messages.Acknowledge
at java.lang.Class.cast(Class.java:3369)
at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
at scala.util.Try$.apply(Try.scala:192)
at scala.util.Success.map(Try.scala:237)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at 
scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
at 
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at 
scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
at 
akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:967)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:437)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Re: FLINK JOB

2020-04-20 Thread Jeff Zhang
How do you run flink job ? It should not always be localhost:8081

Som Lima  于2020年4月20日周一 下午4:33写道:

> Hi,
>
> FLINK JOB  url  defaults to localhost
>
> i.e. localhost:8081.
>
> I have to manually change it to server :8081 to get Apache  flink  Web
> Dashboard to display.
>
>
>
>
>

-- 
Best Regards

Jeff Zhang


Re: FLINK JOB

2020-04-20 Thread Som Lima
I am only running the zeppelin  word count example by clicking the zeppelin
run arrow.


On Mon, 20 Apr 2020, 09:42 Jeff Zhang,  wrote:

> How do you run flink job ? It should not always be localhost:8081
>
> Som Lima  于2020年4月20日周一 下午4:33写道:
>
>> Hi,
>>
>> FLINK JOB  url  defaults to localhost
>>
>> i.e. localhost:8081.
>>
>> I have to manually change it to server :8081 to get Apache  flink
>> Web Dashboard to display.
>>
>>
>>
>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: FLINK JOB

2020-04-20 Thread Jeff Zhang
I see, so you are running flink interpreter in local mode. But you access
zeppelin from a remote machine, right ?  Do you mean you can access it
after changing localhost to ip ? If so, then I can add one configuration in
zeppelin side to replace the localhost to real ip.

Som Lima  于2020年4月20日周一 下午4:44写道:

> I am only running the zeppelin  word count example by clicking the
> zeppelin run arrow.
>
>
> On Mon, 20 Apr 2020, 09:42 Jeff Zhang,  wrote:
>
>> How do you run flink job ? It should not always be localhost:8081
>>
>> Som Lima  于2020年4月20日周一 下午4:33写道:
>>
>>> Hi,
>>>
>>> FLINK JOB  url  defaults to localhost
>>>
>>> i.e. localhost:8081.
>>>
>>> I have to manually change it to server :8081 to get Apache  flink
>>> Web Dashboard to display.
>>>
>>>
>>>
>>>
>>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>

-- 
Best Regards

Jeff Zhang


Re: FLINK JOB

2020-04-20 Thread Som Lima
Yes exactly that is the change I am having to make.  Changing FLINK JOB
default localhost to ip of server computer in the browser.

I followed the instructions as per your
link.
https://medium.com/@zjffdu/flink-on-zeppelin-part-1-get-started-2591aaa6aa47

i.e. 0.0.0.0  of zeppelin.server.addr. for remote access.



On Mon, 20 Apr 2020, 10:30 Jeff Zhang,  wrote:

> I see, so you are running flink interpreter in local mode. But you access
> zeppelin from a remote machine, right ?  Do you mean you can access it
> after changing localhost to ip ? If so, then I can add one configuration in
> zeppelin side to replace the localhost to real ip.
>
> Som Lima  于2020年4月20日周一 下午4:44写道:
>
>> I am only running the zeppelin  word count example by clicking the
>> zeppelin run arrow.
>>
>>
>> On Mon, 20 Apr 2020, 09:42 Jeff Zhang,  wrote:
>>
>>> How do you run flink job ? It should not always be localhost:8081
>>>
>>> Som Lima  于2020年4月20日周一 下午4:33写道:
>>>
>>>> Hi,
>>>>
>>>> FLINK JOB  url  defaults to localhost
>>>>
>>>> i.e. localhost:8081.
>>>>
>>>> I have to manually change it to server :8081 to get Apache  flink
>>>> Web Dashboard to display.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>
> --
> Best Regards
>
> Jeff Zhang
>


Flink job percentage

2020-08-07 Thread Flavio Pompermaier
Hi to all,
one of our customers asked us to see a percentage of completion of a Flink
Batch job. Is there any already implemented heuristic I can use to compute
it? Will this be possible also when DataSet api will migrate to
DataStream..?

Thanks in advance,
Flavio


Flink job parallelism

2019-08-15 Thread Vishwas Siravara
Hi guys,
I have a flink job which I want to run with a parallelism of 2.

I run it from command line like : flink run -p 2 -C
file:///home/was/classpathconfig/ -c com.visa.flink.cli.Main
flink-job-assembly-0.1-SNAPSHOT.jar flink druid

My cluster has two task managers with only 1 task slot each.
However when I look at the Web UI for my job , I see that one of the task
managers is still available. But when I submit with the web UI , both the
task managers are used for this job and I get a parallelism of 2.

Can you help me with understanding as to why this happens ?

Thank you
Vishwas


Flink Job Deployment

2017-09-04 Thread Rinat
Hi folks ! 
I’ve got a question about running flink job on the top of YARN. 
Is there any possibility to store job sources in hdfs, for example

/app/flink/job-name/ 
  - /lib/*.jar
  - /etc/*.properties

and specify directories, that should be added to the job classpath ?

Thx.





Flink Job History Dump

2016-04-05 Thread Robert Schmidtke
Hi everyone,

I'm using Flink 0.10.2 to run some benchmarks on my cluster and I would
like to compare it to Spark 1.6.0. Spark has an eventLog property that I
can use to have the history written to HDFS, and then later view it offline
on the History Server.

Does Flink have a similar Feature, especially for offline analysis of a
job's history/events? I know of the Web UI, but I would like to be able to
run my own analysis on top of the data. There is the Monitoring REST API
and I'm wondering if it's possible to gain access to the raw data this API
exposes, and possibly view it on a locally running web UI.

Thanks a lot in advance
Robert


-- 
My GPG Key ID: 336E2680


Re: Flink Job Exception

2017-02-16 Thread Till Rohrmann
Hi Govindarajan,

there is a pending PR for this issue. I think I can merge it today.

Cheers,
Till

On Thu, Feb 16, 2017 at 12:50 AM, Govindarajan Srinivasaraghavan <
govindragh...@gmail.com> wrote:

> Hi All,
>
> I'm trying to run a streaming job with flink 1.2 version and there are 3
> task managers with 12 task slots. Irrespective of the parallelism that I
> give it always fails with the below error and I found a JIRA link
> corresponding to this issue. Can I know by when this will be resolved since
> I'm not able to run any job in my current environment. Thanks.
>
> https://issues.apache.org/jira/browse/FLINK-5773
>
> java.lang.ClassCastException: Cannot cast scala.util.Failure to 
> org.apache.flink.runtime.messages.Acknowledge
>   at java.lang.Class.cast(Class.java:3369)
>   at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
>   at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
>   at scala.util.Try$.apply(Try.scala:192)
>   at scala.util.Success.map(Try.scala:237)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>   at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
>   at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
>   at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:967)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>   at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:437)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>


Re: Flink Job Exception

2017-02-16 Thread Aljoscha Krettek
Hi Govindarajan,
the Jira issue that you linked to and which Till is currently fixing will
only fix the obvious type mismatch in the Akka messages. There is also an
underlying problem that causes this message to be sent in the first place.
In the case of the user who originally created the Jira issue the reason
was that the Max-Parallelism was set to a value smaller than the
parallelism. Can you try looking in the JobManager/TaskManager logs and see
if you find the original cause there?

Cheers,
Aljoscha

On Thu, 16 Feb 2017 at 09:36 Till Rohrmann  wrote:

> Hi Govindarajan,
>
> there is a pending PR for this issue. I think I can merge it today.
>
> Cheers,
> Till
>
> On Thu, Feb 16, 2017 at 12:50 AM, Govindarajan Srinivasaraghavan <
> govindragh...@gmail.com> wrote:
>
> Hi All,
>
> I'm trying to run a streaming job with flink 1.2 version and there are 3
> task managers with 12 task slots. Irrespective of the parallelism that I
> give it always fails with the below error and I found a JIRA link
> corresponding to this issue. Can I know by when this will be resolved since
> I'm not able to run any job in my current environment. Thanks.
>
> https://issues.apache.org/jira/browse/FLINK-5773
>
> java.lang.ClassCastException: Cannot cast scala.util.Failure to 
> org.apache.flink.runtime.messages.Acknowledge
>   at java.lang.Class.cast(Class.java:3369)
>   at scala.concurrent.Future$$anonfun$mapTo$1.apply(Future.scala:405)
>   at scala.util.Success$$anonfun$map$1.apply(Try.scala:237)
>   at scala.util.Try$.apply(Try.scala:192)
>   at scala.util.Success.map(Try.scala:237)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at 
> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>   at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
>   at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:266)
>   at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:89)
>   at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:967)
>   at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>   at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:437)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>


Submit Flink job programatically

2017-04-06 Thread Jins George

Hello Community,

I have a need to submit  flink job to a remote Yarn cluster 
programatically . I tried to use YarnClusterDescriptor.deploy() , but I 
get message /RMProxy.java:92:main] - Connecting to ResourceManager at 
/0.0.0.0:8032.
/It is trying to connect the resouce manager on the client machine.  I 
have set the YARN_CONF_DIR on the client machine  and placed 
yarn-site.xml , core-site.xml etc.  However it does not seems to be 
picking these files.


Is this the right way to sumit to a Remote Yarn cluster ?


Thanks,
Jins George


Flink job finished unexpected

2021-02-20 Thread Rainie Li
Hello,

I launched a job with a larger load on hadoop yarn cluster.
The Job finished after running 5 hours, I didn't find any error from
JobManger log besides this connect exception.





*2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport
 - Remote connection to [/10.1.57.146:48368
] failed with java.io.IOException: Connection
reset by peer2021-02-20 13:20:14,110 WARN
 akka.remote.ReliableDeliverySupervisor-
Association with remote system [akka.tcp://flink-metrics@host:35241] has
failed, address is now gated for [50] ms. Reason: [Disassociated]
2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor
 - Association with remote system
[akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms.
Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN
 akka.remote.ReliableDeliverySupervisor-
Association with remote system [akka.tcp://flink-metrics@host:38481] has
failed, address is now gated for [50] ms. Reason: [Disassociated] *

Any idea what caused the job to be finished and how to resolve it?
Any suggestions are appreciated.

Thanks
Best regards
Rainie


Flink Job Cleanup Sequence

2020-02-06 Thread Abdul Qadeer
Hi!

Is there a FLIP or doc describing how Flink (1.8+) cleans up the states of
a Job upon cancellation in Zookeeper HA mode? I am trying to find the
sequence of cleaning between SubmittedJobGraphs in Zookeeper, blobs in FS
maintained by BlobStore etc.


Flink job getting killed

2020-04-05 Thread Giriraj Chauhan
Hi,

We are submitting a flink(1.9.1) job for data processing. It runs fine and
processes data for sometime i.e. ~30 mins and later it throws following
exception and job gets killed.
 2020-04-02 14:15:43,371 INFO  org.apache.flink.runtime.taskmanager.Task
  - Sink: Unnamed (2/4) (45d01514f0fb99602883ca43e997e8f3)
switched from RUNNING to FAILED.
java.io.EOFException
at
org.apache.flink.core.memory.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:321)
at
org.apache.flink.types.StringValue.readString(StringValue.java:769)
at
org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:75)
at
org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:33)
at
org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:205)
at
org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
at
org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
at
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:141)
at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:91)
at
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:47)
at
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:135)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:279)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:301)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:406)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.base/java.lang.Thread.run(Unknown Source)


Once the above exception occur, we do see following runtime exception

java.lang.RuntimeException: Buffer pool is destroyed.
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
at
org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:727)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:705)
at
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at
com.dell.emc.mars.topology.bl.DiffGenerator.handleCreate(DiffGenerator.java:519)
at
com.dell.emc.mars.topology.bl.DiffGenerator.populateHTable(DiffGenerator.java:294)
at
com.dell.emc.mars.topology.bl.DiffGenerator.compare(DiffGenerator.java:58)
at
com.dell.emc.mars.topology.bl.CompareWithDatabase.flatMap(CompareWithDatabase.java:146)
at
com.dell.emc.mars.topology.bl.CompareWithDatabase.flatMap(CompareWithDatabase.java:22)
at
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:637)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:612)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:592)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:727)
at
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:705)
at
org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
at
com.dell.emc.mars.topology.bl.CompareWithModel.flatMap(CompareWithModel.java:110)
at
com.dell.emc.mars.topology.bl.CompareWithModel.flatMap(CompareWithModel.java:24)
at
org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:637)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:612)
at
org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:592)
at
o

Re: FLINK JOB solved

2020-04-20 Thread Som Lima
I found the problem.

in the flink1.0.0/conf

There are two files.
Masters
and slaves

the Masters contains localhost:8081
in the slaves  just localhost.

I changed them both to server  ipaddress.

Now the FLINK JOB link has full :8081 link and displays Apache Flink
Dashboard in browser.




On Mon, 20 Apr 2020, 12:07 Som Lima,  wrote:

> Yes exactly that is the change I am having to make.  Changing FLINK JOB
> default localhost to ip of server computer in the browser.
>
> I followed the instructions as per your
> link.
>
> https://medium.com/@zjffdu/flink-on-zeppelin-part-1-get-started-2591aaa6aa47
>
> i.e. 0.0.0.0  of zeppelin.server.addr. for remote access.
>
>
>
> On Mon, 20 Apr 2020, 10:30 Jeff Zhang,  wrote:
>
>> I see, so you are running flink interpreter in local mode. But you access
>> zeppelin from a remote machine, right ?  Do you mean you can access it
>> after changing localhost to ip ? If so, then I can add one configuration in
>> zeppelin side to replace the localhost to real ip.
>>
>> Som Lima  于2020年4月20日周一 下午4:44写道:
>>
>>> I am only running the zeppelin  word count example by clicking the
>>> zeppelin run arrow.
>>>
>>>
>>> On Mon, 20 Apr 2020, 09:42 Jeff Zhang,  wrote:
>>>
>>>> How do you run flink job ? It should not always be localhost:8081
>>>>
>>>> Som Lima  于2020年4月20日周一 下午4:33写道:
>>>>
>>>>> Hi,
>>>>>
>>>>> FLINK JOB  url  defaults to localhost
>>>>>
>>>>> i.e. localhost:8081.
>>>>>
>>>>> I have to manually change it to server :8081 to get Apache  flink
>>>>> Web Dashboard to display.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Best Regards
>>>>
>>>> Jeff Zhang
>>>>
>>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>


Re: Flink job percentage

2020-08-11 Thread Robert Metzger
Hi Flavio,

I'm not aware of such a heuristic being implemented anywhere. You need to
come up with something yourself.

On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier 
wrote:

> Hi to all,
> one of our customers asked us to see a percentage of completion of a Flink
> Batch job. Is there any already implemented heuristic I can use to compute
> it? Will this be possible also when DataSet api will migrate to
> DataStream..?
>
> Thanks in advance,
> Flavio
>


Re: Flink job percentage

2020-08-13 Thread Arvid Heise
Hi Flavio,

This is a daunting task to implement properly. There is an easy fix in
related workflow systems though. Assuming that it's a rerunning task, then
you simply store the run times of the last run, use some kind of low-pass
filter (=decaying average) and compare the current runtime with the
expected runtime. Even if Flink would have some estimation, it's probably
not more accurate than this.

Best,

Arvid

On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger  wrote:

> Hi Flavio,
>
> I'm not aware of such a heuristic being implemented anywhere. You need to
> come up with something yourself.
>
> On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier 
> wrote:
>
>> Hi to all,
>> one of our customers asked us to see a percentage of completion of a
>> Flink Batch job. Is there any already implemented heuristic I can use to
>> compute it? Will this be possible also when DataSet api will migrate to
>> DataStream..?
>>
>> Thanks in advance,
>> Flavio
>>
>

-- 

Arvid Heise | Senior Java Developer



Follow us @VervericaData

--

Join Flink Forward  - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng


Re: Flink job percentage

2020-11-05 Thread Flavio Pompermaier
What do you thinkin about this very rough heuristic (obviously it makes
sense only for batch jobs)?
It's far from perfect but at least it gives an idea of something going on..
PS: I found some mismatch from the states documented in [1] and the ones I
found in the ExecutionState enum..
[1]
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid

Map statusCount =
jobDetails.getJobVerticesPerState();
int uncompleted = statusCount.getOrDefault(ExecutionState.CREATED, 0) +
//
statusCount.getOrDefault(ExecutionState.RUNNING, 0) + ///
statusCount.getOrDefault(ExecutionState.CANCELING, 0) + //
statusCount.getOrDefault(ExecutionState.DEPLOYING, 0) + //
// statusCount.getOrDefault(ExecutionState.FAILING,0)+ // not found
in Flink 1.11.0
// statusCount.getOrDefault(ExecutionState.SUSPENDED,0)+ /// not
found in Flink 1.11.0
statusCount.getOrDefault(ExecutionState.RECONCILING, 0) + //
// statusCount.getOrDefault(ExecutionState.RESTARTING,0) + /// not
found in Flink 1.11.0
statusCount.getOrDefault(ExecutionState.RUNNING, 0) + //
statusCount.getOrDefault(ExecutionState.SCHEDULED, 0);
int completed = statusCount.getOrDefault(ExecutionState.FINISHED, 0) +
//
statusCount.getOrDefault(ExecutionState.FAILED, 0) + //
statusCount.getOrDefault(ExecutionState.CANCELED, 0);
final Integer completionPercentage = Math.floorDiv(completed, completed
+ uncompleted);

Thanks in advance,
Flavio

On Thu, Aug 13, 2020 at 4:17 PM Arvid Heise  wrote:

> Hi Flavio,
>
> This is a daunting task to implement properly. There is an easy fix in
> related workflow systems though. Assuming that it's a rerunning task, then
> you simply store the run times of the last run, use some kind of low-pass
> filter (=decaying average) and compare the current runtime with the
> expected runtime. Even if Flink would have some estimation, it's probably
> not more accurate than this.
>
> Best,
>
> Arvid
>
> On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger 
> wrote:
>
>> Hi Flavio,
>>
>> I'm not aware of such a heuristic being implemented anywhere. You need to
>> come up with something yourself.
>>
>> On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier 
>> wrote:
>>
>>> Hi to all,
>>> one of our customers asked us to see a percentage of completion of a
>>> Flink Batch job. Is there any already implemented heuristic I can use to
>>> compute it? Will this be possible also when DataSet api will migrate to
>>> DataStream..?
>>>
>>> Thanks in advance,
>>> Flavio
>>>
>>
>
> --
>
> Arvid Heise | Senior Java Developer
>
> 
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward  - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Ververica GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
> (Toni) Cheng
>


Re: Flink job percentage

2020-11-05 Thread Chesnay Schepler

|The "mismatch" is due to you mixing job and vertex states.
|

|These are the states a job can be in (based on 
org.apache.flink.api.common.JobStatus):|


   |[ "CREATED", "RUNNING", "FAILING", "FAILED", "CANCELLING",
   "CANCELED", "FINISHED", "RESTARTING", "SUSPENDED", "RECONCILING" ]||
   |

|These are the states a vertex can be in (based on 
org.apache.flink.runtime.execution.ExecutionState):|


   |[ "CREATED", "SCHEDULED", "DEPLOYING", "RUNNING", "FINISHED",
   "CANCELING", "CANCELED", "FAILED", "RECONCILING" ]|

|Naturally, for your code you only want to check for the lattern.
|

|The documentation is hence correct. FYI, we directly access the 
corresponding enums to generate this list, so it _cannot_ be out-of-sync.|




On 11/5/2020 11:16 AM, Flavio Pompermaier wrote:
What do you thinkin about this very rough heuristic (obviously it 
makes sense only for batch jobs)?
It's far from perfect but at least it gives an idea of something going 
on..
PS: I found some mismatch from the states documented in [1] and the 
ones I found in the ExecutionState enum..
[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid


    Map statusCount = 
jobDetails.getJobVerticesPerState();
    int uncompleted = statusCount.getOrDefault(ExecutionState.CREATED, 
0) + //

statusCount.getOrDefault(ExecutionState.RUNNING, 0) + ///
statusCount.getOrDefault(ExecutionState.CANCELING, 0) + //
statusCount.getOrDefault(ExecutionState.DEPLOYING, 0) + //
        // statusCount.getOrDefault(ExecutionState.FAILING,0)+ // not 
found in Flink 1.11.0
        // statusCount.getOrDefault(ExecutionState.SUSPENDED,0)+ /// 
not found in Flink 1.11.0

statusCount.getOrDefault(ExecutionState.RECONCILING, 0) + //
        // statusCount.getOrDefault(ExecutionState.RESTARTING,0) + /// 
not found in Flink 1.11.0

statusCount.getOrDefault(ExecutionState.RUNNING, 0) + //
statusCount.getOrDefault(ExecutionState.SCHEDULED, 0);
    int completed = statusCount.getOrDefault(ExecutionState.FINISHED, 
0) + //

statusCount.getOrDefault(ExecutionState.FAILED, 0) + //
statusCount.getOrDefault(ExecutionState.CANCELED, 0);
    final Integer completionPercentage = Math.floorDiv(completed, 
completed + uncompleted);


Thanks in advance,
Flavio

On Thu, Aug 13, 2020 at 4:17 PM Arvid Heise > wrote:


Hi Flavio,

This is a daunting task to implement properly. There is an easy
fix in related workflow systems though. Assuming that it's a
rerunning task, then you simply store the run times of the last
run, use some kind of low-pass filter (=decaying average) and
compare the current runtime with the expected runtime. Even if
Flink would have some estimation, it's probably not more accurate
than this.

Best,

Arvid

On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger
mailto:rmetz...@apache.org>> wrote:

Hi Flavio,

I'm not aware of such a heuristic being implemented anywhere.
You need to come up with something yourself.

On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier
mailto:pomperma...@okkam.it>> wrote:

Hi to all,
one of our customers asked us to see a percentage of
completion of a Flink Batch job. Is there any already
implemented heuristic I can use to compute it? Will this
be possible also when DataSet api will migrate to
DataStream..?

Thanks in advance,
Flavio



-- 


Arvid Heise| Senior Java Developer




Follow us @VervericaData

--

Join Flink Forward - The Apache
FlinkConference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB 158244
BManaging Directors: Timothy Alexander Steinert, Yip Park Tung
Jason, Ji (Toni) Cheng





Re: Flink job percentage

2020-11-05 Thread Chesnay Schepler
Admittedly, it can be out-of-sync if someone forgets to regenerate the 
documentation, but they cannot be mixed up.


On 11/5/2020 11:31 AM, Chesnay Schepler wrote:

|The "mismatch" is due to you mixing job and vertex states.
|

|These are the states a job can be in (based on 
org.apache.flink.api.common.JobStatus):|


|[ "CREATED", "RUNNING", "FAILING", "FAILED", "CANCELLING",
"CANCELED", "FINISHED", "RESTARTING", "SUSPENDED", "RECONCILING" ]||
|

|These are the states a vertex can be in (based on 
org.apache.flink.runtime.execution.ExecutionState):|


|[ "CREATED", "SCHEDULED", "DEPLOYING", "RUNNING", "FINISHED",
"CANCELING", "CANCELED", "FAILED", "RECONCILING" ]|

|Naturally, for your code you only want to check for the lattern.
|

|The documentation is hence correct. FYI, we directly access the 
corresponding enums to generate this list, so it _cannot_ be out-of-sync.|




On 11/5/2020 11:16 AM, Flavio Pompermaier wrote:
What do you thinkin about this very rough heuristic (obviously it 
makes sense only for batch jobs)?
It's far from perfect but at least it gives an idea of something 
going on..
PS: I found some mismatch from the states documented in [1] and the 
ones I found in the ExecutionState enum..
[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid


    Map statusCount = 
jobDetails.getJobVerticesPerState();
    int uncompleted = 
statusCount.getOrDefault(ExecutionState.CREATED, 0) + //

statusCount.getOrDefault(ExecutionState.RUNNING, 0) + ///
statusCount.getOrDefault(ExecutionState.CANCELING, 0) + //
statusCount.getOrDefault(ExecutionState.DEPLOYING, 0) + //
        // statusCount.getOrDefault(ExecutionState.FAILING,0)+ // not 
found in Flink 1.11.0
        // statusCount.getOrDefault(ExecutionState.SUSPENDED,0)+ /// 
not found in Flink 1.11.0

statusCount.getOrDefault(ExecutionState.RECONCILING, 0) + //
        // statusCount.getOrDefault(ExecutionState.RESTARTING,0) + 
/// not found in Flink 1.11.0

statusCount.getOrDefault(ExecutionState.RUNNING, 0) + //
statusCount.getOrDefault(ExecutionState.SCHEDULED, 0);
    int completed = statusCount.getOrDefault(ExecutionState.FINISHED, 
0) + //

statusCount.getOrDefault(ExecutionState.FAILED, 0) + //
statusCount.getOrDefault(ExecutionState.CANCELED, 0);
    final Integer completionPercentage = Math.floorDiv(completed, 
completed + uncompleted);


Thanks in advance,
Flavio

On Thu, Aug 13, 2020 at 4:17 PM Arvid Heise > wrote:


Hi Flavio,

This is a daunting task to implement properly. There is an easy
fix in related workflow systems though. Assuming that it's a
rerunning task, then you simply store the run times of the last
run, use some kind of low-pass filter (=decaying average) and
compare the current runtime with the expected runtime. Even if
Flink would have some estimation, it's probably not more accurate
than this.

Best,

Arvid

On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger
mailto:rmetz...@apache.org>> wrote:

Hi Flavio,

I'm not aware of such a heuristic being implemented anywhere.
You need to come up with something yourself.

On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier
mailto:pomperma...@okkam.it>> wrote:

Hi to all,
one of our customers asked us to see a percentage of
completion of a Flink Batch job. Is there any already
implemented heuristic I can use to compute it? Will this
be possible also when DataSet api will migrate to
DataStream..?

Thanks in advance,
Flavio



-- 


Arvid Heise| Senior Java Developer




Follow us @VervericaData

--

Join Flink Forward - The Apache
FlinkConference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB
158244 BManaging Directors: Timothy Alexander Steinert, Yip Park
Tung Jason, Ji (Toni) Cheng







Re: Flink job percentage

2020-11-05 Thread Flavio Pompermaier
Ok I understood. Unfortunately the documentation is not able to extract the
Map type of status-count that is  Map and I
thought that the job status and execution status were equivalent.
And what about the heuristic...? Could it make sense

On Thu, Nov 5, 2020 at 11:33 AM Chesnay Schepler  wrote:

> Admittedly, it can be out-of-sync if someone forgets to regenerate the
> documentation, but they cannot be mixed up.
>
> On 11/5/2020 11:31 AM, Chesnay Schepler wrote:
>
> The "mismatch" is due to you mixing job and vertex states.
>
> These are the states a job can be in (based on
> org.apache.flink.api.common.JobStatus):
>
> [ "CREATED", "RUNNING", "FAILING", "FAILED", "CANCELLING", "CANCELED",
> "FINISHED", "RESTARTING", "SUSPENDED", "RECONCILING" ]
>
> These are the states a vertex can be in (based on
> org.apache.flink.runtime.execution.ExecutionState):
>
> [ "CREATED", "SCHEDULED", "DEPLOYING", "RUNNING", "FINISHED", "CANCELING",
> "CANCELED", "FAILED", "RECONCILING" ]
>
> Naturally, for your code you only want to check for the lattern.
>
> The documentation is hence correct. FYI, we directly access the
> corresponding enums to generate this list, so it _cannot_ be out-of-sync.
>
> On 11/5/2020 11:16 AM, Flavio Pompermaier wrote:
>
> What do you thinkin about this very rough heuristic (obviously it makes
> sense only for batch jobs)?
> It's far from perfect but at least it gives an idea of something going on..
> PS: I found some mismatch from the states documented in [1] and the ones I
> found in the ExecutionState enum..
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid
>
> Map statusCount =
> jobDetails.getJobVerticesPerState();
> int uncompleted = statusCount.getOrDefault(ExecutionState.CREATED, 0)
> + //
> statusCount.getOrDefault(ExecutionState.RUNNING, 0) + ///
> statusCount.getOrDefault(ExecutionState.CANCELING, 0) + //
> statusCount.getOrDefault(ExecutionState.DEPLOYING, 0) + //
> // statusCount.getOrDefault(ExecutionState.FAILING,0)+ // not
> found in Flink 1.11.0
> // statusCount.getOrDefault(ExecutionState.SUSPENDED,0)+ /// not
> found in Flink 1.11.0
> statusCount.getOrDefault(ExecutionState.RECONCILING, 0) + //
> // statusCount.getOrDefault(ExecutionState.RESTARTING,0) + /// not
> found in Flink 1.11.0
> statusCount.getOrDefault(ExecutionState.RUNNING, 0) + //
> statusCount.getOrDefault(ExecutionState.SCHEDULED, 0);
> int completed = statusCount.getOrDefault(ExecutionState.FINISHED, 0) +
> //
> statusCount.getOrDefault(ExecutionState.FAILED, 0) + //
> statusCount.getOrDefault(ExecutionState.CANCELED, 0);
> final Integer completionPercentage = Math.floorDiv(completed,
> completed + uncompleted);
>
> Thanks in advance,
> Flavio
>
> On Thu, Aug 13, 2020 at 4:17 PM Arvid Heise  wrote:
>
>> Hi Flavio,
>>
>> This is a daunting task to implement properly. There is an easy fix in
>> related workflow systems though. Assuming that it's a rerunning task, then
>> you simply store the run times of the last run, use some kind of low-pass
>> filter (=decaying average) and compare the current runtime with the
>> expected runtime. Even if Flink would have some estimation, it's probably
>> not more accurate than this.
>>
>> Best,
>>
>> Arvid
>>
>> On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger 
>> wrote:
>>
>>> Hi Flavio,
>>>
>>> I'm not aware of such a heuristic being implemented anywhere. You need
>>> to come up with something yourself.
>>>
>>> On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier 
>>> wrote:
>>>
 Hi to all,
 one of our customers asked us to see a percentage of completion of a
 Flink Batch job. Is there any already implemented heuristic I can use to
 compute it? Will this be possible also when DataSet api will migrate to
 DataStream..?

 Thanks in advance,
 Flavio

>>>
>>
>> --
>>
>> Arvid Heise | Senior Java Developer
>>
>> 
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward  - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B 
>> Managing
>> Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng
>>
>>
>
>
>


Re: Flink job percentage

2020-11-05 Thread Flavio Pompermaier
Just another question: should I open a JIRA to rename
ExecutionState.CANCELING to CANCELLING (indeed the enum's Javadoc report
CANCELLING)?

On Thu, Nov 5, 2020 at 11:31 AM Chesnay Schepler  wrote:

> The "mismatch" is due to you mixing job and vertex states.
>
> These are the states a job can be in (based on
> org.apache.flink.api.common.JobStatus):
>
> [ "CREATED", "RUNNING", "FAILING", "FAILED", "CANCELLING", "CANCELED",
> "FINISHED", "RESTARTING", "SUSPENDED", "RECONCILING" ]
>
> These are the states a vertex can be in (based on
> org.apache.flink.runtime.execution.ExecutionState):
>
> [ "CREATED", "SCHEDULED", "DEPLOYING", "RUNNING", "FINISHED", "CANCELING",
> "CANCELED", "FAILED", "RECONCILING" ]
>
> Naturally, for your code you only want to check for the lattern.
>
> The documentation is hence correct. FYI, we directly access the
> corresponding enums to generate this list, so it _cannot_ be out-of-sync.
>
> On 11/5/2020 11:16 AM, Flavio Pompermaier wrote:
>
> What do you thinkin about this very rough heuristic (obviously it makes
> sense only for batch jobs)?
> It's far from perfect but at least it gives an idea of something going on..
> PS: I found some mismatch from the states documented in [1] and the ones I
> found in the ExecutionState enum..
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid
>
> Map statusCount =
> jobDetails.getJobVerticesPerState();
> int uncompleted = statusCount.getOrDefault(ExecutionState.CREATED, 0)
> + //
> statusCount.getOrDefault(ExecutionState.RUNNING, 0) + ///
> statusCount.getOrDefault(ExecutionState.CANCELING, 0) + //
> statusCount.getOrDefault(ExecutionState.DEPLOYING, 0) + //
> // statusCount.getOrDefault(ExecutionState.FAILING,0)+ // not
> found in Flink 1.11.0
> // statusCount.getOrDefault(ExecutionState.SUSPENDED,0)+ /// not
> found in Flink 1.11.0
> statusCount.getOrDefault(ExecutionState.RECONCILING, 0) + //
> // statusCount.getOrDefault(ExecutionState.RESTARTING,0) + /// not
> found in Flink 1.11.0
> statusCount.getOrDefault(ExecutionState.RUNNING, 0) + //
> statusCount.getOrDefault(ExecutionState.SCHEDULED, 0);
> int completed = statusCount.getOrDefault(ExecutionState.FINISHED, 0) +
> //
> statusCount.getOrDefault(ExecutionState.FAILED, 0) + //
> statusCount.getOrDefault(ExecutionState.CANCELED, 0);
> final Integer completionPercentage = Math.floorDiv(completed,
> completed + uncompleted);
>
> Thanks in advance,
> Flavio
>
> On Thu, Aug 13, 2020 at 4:17 PM Arvid Heise  wrote:
>
>> Hi Flavio,
>>
>> This is a daunting task to implement properly. There is an easy fix in
>> related workflow systems though. Assuming that it's a rerunning task, then
>> you simply store the run times of the last run, use some kind of low-pass
>> filter (=decaying average) and compare the current runtime with the
>> expected runtime. Even if Flink would have some estimation, it's probably
>> not more accurate than this.
>>
>> Best,
>>
>> Arvid
>>
>> On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger 
>> wrote:
>>
>>> Hi Flavio,
>>>
>>> I'm not aware of such a heuristic being implemented anywhere. You need
>>> to come up with something yourself.
>>>
>>> On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier 
>>> wrote:
>>>
 Hi to all,
 one of our customers asked us to see a percentage of completion of a
 Flink Batch job. Is there any already implemented heuristic I can use to
 compute it? Will this be possible also when DataSet api will migrate to
 DataStream..?

 Thanks in advance,
 Flavio

>>>
>>
>> --
>>
>> Arvid Heise | Senior Java Developer
>>
>> 
>>
>> Follow us @VervericaData
>>
>> --
>>
>> Join Flink Forward  - The Apache Flink
>> Conference
>>
>> Stream Processing | Event Driven | Real Time
>>
>> --
>>
>> Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>>
>> --
>> Ververica GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B 
>> Managing
>> Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji (Toni) Cheng
>>
>>
>
>


Re: Flink job percentage

2020-11-05 Thread Chesnay Schepler
No, because that would break the API and any log-parsing infrastructure 
relying on it.


On 11/5/2020 2:56 PM, Flavio Pompermaier wrote:
Just another question: should I open a JIRA to rename 
ExecutionState.CANCELING to CANCELLING (indeed the enum's Javadoc 
report CANCELLING)?


On Thu, Nov 5, 2020 at 11:31 AM Chesnay Schepler > wrote:


|The "mismatch" is due to you mixing job and vertex states.
|

|These are the states a job can be in (based on
org.apache.flink.api.common.JobStatus):|

|[ "CREATED", "RUNNING", "FAILING", "FAILED", "CANCELLING",
"CANCELED", "FINISHED", "RESTARTING", "SUSPENDED",
"RECONCILING" ]||
|

|These are the states a vertex can be in (based on
org.apache.flink.runtime.execution.ExecutionState):|

|[ "CREATED", "SCHEDULED", "DEPLOYING", "RUNNING", "FINISHED",
"CANCELING", "CANCELED", "FAILED", "RECONCILING" ]|

|Naturally, for your code you only want to check for the lattern.
|

|The documentation is hence correct. FYI, we directly access the
corresponding enums to generate this list, so it _cannot_ be
out-of-sync.|



On 11/5/2020 11:16 AM, Flavio Pompermaier wrote:

What do you thinkin about this very rough heuristic (obviously it
makes sense only for batch jobs)?
It's far from perfect but at least it gives an idea of something
going on..
PS: I found some mismatch from the states documented in [1] and
the ones I found in the ExecutionState enum..
[1]

https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/rest_api.html#jobs-jobid

    Map statusCount =
jobDetails.getJobVerticesPerState();
    int uncompleted =
statusCount.getOrDefault(ExecutionState.CREATED, 0) + //
statusCount.getOrDefault(ExecutionState.RUNNING, 0) + ///
statusCount.getOrDefault(ExecutionState.CANCELING, 0) + //
statusCount.getOrDefault(ExecutionState.DEPLOYING, 0) + //
        // statusCount.getOrDefault(ExecutionState.FAILING,0)+ //
not found in Flink 1.11.0
        // statusCount.getOrDefault(ExecutionState.SUSPENDED,0)+
/// not found in Flink 1.11.0
statusCount.getOrDefault(ExecutionState.RECONCILING, 0) + //
        // statusCount.getOrDefault(ExecutionState.RESTARTING,0)
+ /// not found in Flink 1.11.0
statusCount.getOrDefault(ExecutionState.RUNNING, 0) + //
statusCount.getOrDefault(ExecutionState.SCHEDULED, 0);
    int completed =
statusCount.getOrDefault(ExecutionState.FINISHED, 0) + //
statusCount.getOrDefault(ExecutionState.FAILED, 0) + //
statusCount.getOrDefault(ExecutionState.CANCELED, 0);
    final Integer completionPercentage = Math.floorDiv(completed,
completed + uncompleted);

Thanks in advance,
Flavio

On Thu, Aug 13, 2020 at 4:17 PM Arvid Heise mailto:ar...@ververica.com>> wrote:

Hi Flavio,

This is a daunting task to implement properly. There is an
easy fix in related workflow systems though. Assuming that
it's a rerunning task, then you simply store the run times of
the last run, use some kind of low-pass filter (=decaying
average) and compare the current runtime with the expected
runtime. Even if Flink would have some estimation, it's
probably not more accurate than this.

Best,

Arvid

On Tue, Aug 11, 2020 at 10:26 AM Robert Metzger
mailto:rmetz...@apache.org>> wrote:

Hi Flavio,

I'm not aware of such a heuristic being implemented
anywhere. You need to come up with something yourself.

On Fri, Aug 7, 2020 at 12:55 PM Flavio Pompermaier
mailto:pomperma...@okkam.it>> wrote:

Hi to all,
one of our customers asked us to see a percentage of
completion of a Flink Batch job. Is there any already
implemented heuristic I can use to compute it? Will
this be possible also when DataSet api will migrate
to DataStream..?

Thanks in advance,
Flavio



-- 


Arvid Heise| Senior Java Developer




Follow us @VervericaData

--

Join Flink Forward - The Apache
FlinkConference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--

Ververica GmbHRegistered at Amtsgericht Charlottenburg: HRB
158244 BManaging Directors: Timothy Alexander Steinert, Yip
Park Tung Jason, Ji (Toni) Cheng







Flink Job state debug

2018-10-18 Thread jia yichao
Hi,
I want to debug the flink job in the local IntelliJ IDEA to restore the 
initializeState from the snapshot of a previous execution . How do I set the 
parameters to allow the program read the checkpoint or savepoint data from hdfs 
to restore the state of the program?



Flink Job and Watermarking

2019-02-07 Thread Kaustubh Rudrawar
Hi,

I'm writing a job that wants to make an HTTP request once a watermark has
reached all tasks of an operator. It would be great if this could be
determined from outside the Flink job, but I don't think it's possible to
access watermark information for the job as a whole. Below is a workaround
I've come up with:

   1. Read messages from Kafka using the provided KafkaSource. Event time
   will be defined as a timestamp within the message.
   2. Key the stream based on an id from the message.
   3. DedupOperator that dedupes messages. This operator will run with a
   parallelism of N.
   4. An operator that persists the messages to S3. It doesn't need to
   output anything - it should ideally be a Sink (if it were a sink we could
   use the StreamingFileSink).
   5. Implement an operator that will make an HTTP request once
   processWatermark is called for time T. A parallelism of 1 will be used for
   this operator as it will do very little work. Because it has a parallelism
   of 1, the operator in step 4 cannot send anything to it as it could become
   a throughput bottleneck.

Does this implementation seem like a valid workaround? Any other
alternatives I should consider?

Thanks for your help,
Kaustubh


Flink job testing with

2018-04-17 Thread Chauvet, Thomas
Hi everybody,

I would like to test a kafka / flink process in scala. I would like to proceed 
as in the integration testing documentation : 
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/testing.html#integration-testing
 with Kafka as source and sink.

For example, I have a topic kafka as source for flink (I use 
FlinkKafkaConsumer011), then I do some process with Flink, then I send the 
stream to Kafka (FlinkKafkaProducer011).

Any idea on how to do that ? Or better, any example ?

Thanks


Re: Flink job parallelism

2019-08-16 Thread Biao Liu
Hi Vishwas,

Regardless of the available task manager, what's your job really look like?
Is it with parallelism 2 or 1?

It's hard to say what happened based on your description. Could you
reproduce the scenario?
If the answer is yes, then could you provide more details? Like a
screenshot, logs, codes of your example.

Thanks,
Biao /'bɪ.aʊ/



On Thu, 15 Aug 2019 at 23:54, Vishwas Siravara  wrote:

> Hi guys,
> I have a flink job which I want to run with a parallelism of 2.
>
> I run it from command line like : flink run -p 2 -C
> file:///home/was/classpathconfig/ -c com.visa.flink.cli.Main
> flink-job-assembly-0.1-SNAPSHOT.jar flink druid
>
> My cluster has two task managers with only 1 task slot each.
> However when I look at the Web UI for my job , I see that one of the task
> managers is still available. But when I submit with the web UI , both the
> task managers are used for this job and I get a parallelism of 2.
>
> Can you help me with understanding as to why this happens ?
>
> Thank you
> Vishwas
>


High availability flink job

2019-09-15 Thread Vishwas Siravara
Hi guys,
I have a flink job running in standalone mode with a parallelism of >1 ,
that produces data to a kafka sink. My topic is replicated with a
replication factor of 2. Now suppose one of the kafka brokers goes down ,
then will my streaming job fail ? Is there a way where in I can continue
processing until that broker node comes up, also the sink is partitioned by
a key.

Thanks,
Vishwas


Flink Job claster scalability

2020-01-08 Thread KristoffSC
Hi all,
I must say I'm very impressed by Flink and what it can do.

I was trying to play around with Flink operator parallelism and scalability
and I have few questions regarding this subject. 

My setup is:
1. Flink 1.9.1
2. Docker Job Cluster, where each Task manager has only one task slot. I'm
following [1]
3. env setup:
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, 1000));
env.setParallelism(1);
env.setMaxParallelism(128);
env.enableCheckpointing(10 * 60 * 1000);

Please mind that I am using operator chaining here. 

My pipeline setup:

 


As you can see I have 7 operators (few of them were actually chained and
this is ok), with different parallelism level. This all gives me 23 tasks
total. 


I've noticed that with "one task manager = one task slot" approach I have to
have 6 task slots/task managers to be able to start this pipeline.


 

If number of task slots is lower than 6, job is scheduled but not started. 

With 6 task slots everything is working fine and I've must say that I'm very
impressed with a way that Flinks balanced data between task slots. Data was
distributed very evenly between operator instances/tasks. 

In this setup (7 operators, 23 tasks and 6 task slots), some task slots have
to be reused by more than one operator. While inspecting UI I've found
examples such operators. This is what I was expecting though.

However I was surprised a little bit after I added one additional task
manager (hence one new task slot)


 

After adding new resources, Flink did not re balanced/redistributed the
graph. So this host was sitting there and doing nothing. Even after putting
some load on the cluster, still this node was not used.

 
*After doing this exercise I have few questions:*

1. It seems that number of task slots must be equal or greater than max
number of parallelism used in the pipeline. In my case it was 6. When I
changed parallelism for one of the operator to 7, I had to have 7 task slots
(task managers in my setup) to be able to even start the job. 
Is this the case?

2. What I can do to use the extra node that was spanned while job was
running?
In other words, If I would see that one of my nodes has to much load what I
can do? Please mind that I'm using keyBy/hashing function in my pipeline and
in my tests I had around 5000 unique keys.

I've try to use REST API to call "rescale" but I got this response:
/302{"errors":["Rescaling is temporarily disabled. See FLINK-12312."]}/

Thanks.

[1]
https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Error submitting flink job

2017-09-01 Thread Krishnanand Khambadkone
I am trying to submit a flink job from the command line and seeing this error.  
Any idea what could be happening

java.io.IOException: File or directory already exists. Existing files and 
directories are not overwritten in NO_OVERWRITE mode. Use OVERWRITE mode to 
overwrite existing files and directories.



Re: Flink Job Deployment

2017-09-08 Thread Fabian Hueske
Hi Rinat,

no, this is unfortunately not possible.
When a job is submitted, all required JARs are copied into an HDFS location
that's job-specific.

Best, Fabian

2017-09-04 13:11 GMT+02:00 Rinat :

> Hi folks !
> I’ve got a question about running flink job on the top of YARN.
> Is there any possibility to store job sources in hdfs, for example
>
> /app/flink/job-name/
>   - /lib/*.jar
>   - /etc/*.properties
>
> and specify directories, that should be added to the job classpath ?
>
> Thx.
>
>
>
>


Re: Flink job performance

2024-04-15 Thread Zhanghao Chen
Hi Oscar,

The rebalance operation will go over the network stack, but not necessarily 
involving remote data shuffle. For data shuffling between tasks of the same 
node, the local channel is used, but compared to chained operators, it still 
introduces extra data serialization overhead. For data shuffling between tasks 
on different nodes, remote network shuffling is involved.

Therefore, breaking the chain with extra rebalance operation will definitely 
add extra overhead. But usually, it is negligible under a small parallelism 
setting like yours. Could you share the exception details thrown after the 
change?

From: Oscar Perez via user 
Sent: Monday, April 15, 2024 15:57
To: Oscar Perez via user ; pi-team ; 
Hermes Team 
Subject: Flink job performance

Hi community!

We have an interesting problem with Flink after increasing parallelism in a 
certain way. Here is the summary:

1)  We identified that our job bottleneck were some Co-keyed process operators 
that were affecting on previous operators causing backpressure.
2( What we did was to increase the parallelism to all the operators from 6 to 
12 but keeping 6 these operators that read from kafka. The main reason was that 
all our topics have 6 partitions so increasing the parallelism will not yield 
better performance

See attached job layout prior and after the changes:
What happens was that some operations that were chained in the same operator 
like reading - filter - map - filter now are rebalanced and the overall 
performance of the job is suffering (keeps throwing exceptions now and then)

Is the rebalance operation going over the network or this happens in the same 
node? How can we effectively improve performance of this job with the given 
resources?

Thanks for the input!
Regards,
Oscar




Re: Flink job performance

2024-04-15 Thread Zhanghao Chen
Hi, there seems to be sth wrong with the two images attached in the latest 
email. I cannot open them.

Best,
Zhanghao Chen

From: Oscar Perez via user 
Sent: Monday, April 15, 2024 15:57
To: Oscar Perez via user ; pi-team ; 
Hermes Team 
Subject: Flink job performance

Hi community!

We have an interesting problem with Flink after increasing parallelism in a 
certain way. Here is the summary:

1)  We identified that our job bottleneck were some Co-keyed process operators 
that were affecting on previous operators causing backpressure.
2( What we did was to increase the parallelism to all the operators from 6 to 
12 but keeping 6 these operators that read from kafka. The main reason was that 
all our topics have 6 partitions so increasing the parallelism will not yield 
better performance

See attached job layout prior and after the changes:
What happens was that some operations that were chained in the same operator 
like reading - filter - map - filter now are rebalanced and the overall 
performance of the job is suffering (keeps throwing exceptions now and then)

Is the rebalance operation going over the network or this happens in the same 
node? How can we effectively improve performance of this job with the given 
resources?

Thanks for the input!
Regards,
Oscar




Re: Flink job performance

2024-04-15 Thread Zhanghao Chen
The exception basically says the remote TM is unreachable, probably terminated 
due to some other reasons. This may not be the root cause. Is there any other 
exceptions in the log? Also, since the overall resource usage is almost full, 
could you try allocating more CPUs and see if the instability persists?

Best,
Zhanghao Chen

From: Oscar Perez 
Sent: Monday, April 15, 2024 19:24
To: Zhanghao Chen 
Cc: Oscar Perez via user 
Subject: Re: Flink job performance

Hei, ok that is weird. Let me resend them.

Regards,
Oscar

On Mon, 15 Apr 2024 at 14:00, Zhanghao Chen 
mailto:zhanghao.c...@outlook.com>> wrote:
Hi, there seems to be sth wrong with the two images attached in the latest 
email. I cannot open them.

Best,
Zhanghao Chen

From: Oscar Perez via user mailto:user@flink.apache.org>>
Sent: Monday, April 15, 2024 15:57
To: Oscar Perez via user mailto:user@flink.apache.org>>; 
pi-team mailto:pi-t...@n26.com>>; Hermes Team 
mailto:hermes-t...@n26.com>>
Subject: Flink job performance

Hi community!

We have an interesting problem with Flink after increasing parallelism in a 
certain way. Here is the summary:

1)  We identified that our job bottleneck were some Co-keyed process operators 
that were affecting on previous operators causing backpressure.
2( What we did was to increase the parallelism to all the operators from 6 to 
12 but keeping 6 these operators that read from kafka. The main reason was that 
all our topics have 6 partitions so increasing the parallelism will not yield 
better performance

See attached job layout prior and after the changes:
What happens was that some operations that were chained in the same operator 
like reading - filter - map - filter now are rebalanced and the overall 
performance of the job is suffering (keeps throwing exceptions now and then)

Is the rebalance operation going over the network or this happens in the same 
node? How can we effectively improve performance of this job with the given 
resources?

Thanks for the input!
Regards,
Oscar




Re: Flink job performance

2024-04-15 Thread Oscar Perez via user
Hi,
I appreciate your comments and thank you for that. My original question
still remains though. Why the very same job just by changing the settings
aforementioned had this increase in cpu usage and performance degradation
when we should have expected the opposite behaviour?

thanks again,
Oscar

On Mon, 15 Apr 2024 at 15:11, Zhanghao Chen 
wrote:

> The exception basically says the remote TM is unreachable, probably
> terminated due to some other reasons. This may not be the root cause. Is
> there any other exceptions in the log? Also, since the overall resource
> usage is almost full, could you try allocating more CPUs and see if the
> instability persists?
>
> Best,
> Zhanghao Chen
> --
> *From:* Oscar Perez 
> *Sent:* Monday, April 15, 2024 19:24
> *To:* Zhanghao Chen 
> *Cc:* Oscar Perez via user 
> *Subject:* Re: Flink job performance
>
> Hei, ok that is weird. Let me resend them.
>
> Regards,
> Oscar
>
> On Mon, 15 Apr 2024 at 14:00, Zhanghao Chen 
> wrote:
>
> Hi, there seems to be sth wrong with the two images attached in the latest
> email. I cannot open them.
>
> Best,
> Zhanghao Chen
> --
> *From:* Oscar Perez via user 
> *Sent:* Monday, April 15, 2024 15:57
> *To:* Oscar Perez via user ; pi-team <
> pi-t...@n26.com>; Hermes Team 
> *Subject:* Flink job performance
>
> Hi community!
>
> We have an interesting problem with Flink after increasing parallelism in
> a certain way. Here is the summary:
>
> 1)  We identified that our job bottleneck were some Co-keyed process
> operators that were affecting on previous operators causing backpressure.
> 2( What we did was to increase the parallelism to all the operators from 6
> to 12 but keeping 6 these operators that read from kafka. The main reason
> was that all our topics have 6 partitions so increasing the parallelism
> will not yield better performance
>
> See attached job layout prior and after the changes:
> What happens was that some operations that were chained in the same
> operator like reading - filter - map - filter now are rebalanced and the
> overall performance of the job is suffering (keeps throwing exceptions now
> and then)
>
> Is the rebalance operation going over the network or this happens in the
> same node? How can we effectively improve performance of this job with the
> given resources?
>
> Thanks for the input!
> Regards,
> Oscar
>
>
>


Re: Flink job performance

2024-04-15 Thread Kenan Kılıçtepe
How many taskmanagers and server do you have?
Can you also share the task managers page of flink dashboard?


On Mon, Apr 15, 2024 at 10:58 AM Oscar Perez via user 
wrote:

> Hi community!
>
> We have an interesting problem with Flink after increasing parallelism in
> a certain way. Here is the summary:
>
> 1)  We identified that our job bottleneck were some Co-keyed process
> operators that were affecting on previous operators causing backpressure.
> 2( What we did was to increase the parallelism to all the operators from 6
> to 12 but keeping 6 these operators that read from kafka. The main reason
> was that all our topics have 6 partitions so increasing the parallelism
> will not yield better performance
>
> See attached job layout prior and after the changes:
> What happens was that some operations that were chained in the same
> operator like reading - filter - map - filter now are rebalanced and the
> overall performance of the job is suffering (keeps throwing exceptions now
> and then)
>
> Is the rebalance operation going over the network or this happens in the
> same node? How can we effectively improve performance of this job with the
> given resources?
>
> Thanks for the input!
> Regards,
> Oscar
>
>
>


Flink job Deployement problem

2024-06-05 Thread Fokou Toukam, Thierry
Hi, i'm trying to deploy flink job but i have this error. How to solve it 
please?
[cid:b13357ec-00b9-4a15-8d04-1c797a4eced3]

Thierry FOKOU |  IT M.A.Sc Student

Département de génie logiciel et TI

École de technologie supérieure  |  Université du Québec

1100, rue Notre-Dame Ouest

Montréal (Québec)  H3C 1K3


[image001]<http://etsmtl.ca/>


Re: Flink Job History Dump

2016-04-05 Thread Ufuk Celebi
Hey Robert!

This is currently not possible :-(, but this is a feature that is on
Flink's road map.

A very inconvenient workaround could be to manually query the REST
APIs [1] and dump the responses somewhere and query it there.

– Ufuk

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.0/internals/monitoring_rest_api.html

On Tue, Apr 5, 2016 at 4:00 PM, Robert Schmidtke  wrote:
> Hi everyone,
>
> I'm using Flink 0.10.2 to run some benchmarks on my cluster and I would like
> to compare it to Spark 1.6.0. Spark has an eventLog property that I can use
> to have the history written to HDFS, and then later view it offline on the
> History Server.
>
> Does Flink have a similar Feature, especially for offline analysis of a
> job's history/events? I know of the Web UI, but I would like to be able to
> run my own analysis on top of the data. There is the Monitoring REST API and
> I'm wondering if it's possible to gain access to the raw data this API
> exposes, and possibly view it on a locally running web UI.
>
> Thanks a lot in advance
> Robert
>
>
> --
> My GPG Key ID: 336E2680


Visualize result of Flink job

2016-05-29 Thread Palle
Hi there

I am using Flink to analyse a lot of incoming data. Every 10 seconds it makes 
sense to present the analysis so far as some form of visualization. Every 10 
seconds I therefore will replace the current contents of the 
visualization/presentation with the analysis result of the most recent 10 
seconds.

I was first thinking of using ElasticSearch/Kibana for this because I know it 
should be easy to set up, but I am thinking it may not be the best fit, because 
Elastic is by nature a search engine that is good for trending and stuff like 
that - not entire replacement of the current view. And therefore I may also 
experience difficulties implementing the view in Elastic.

Does anyone know of any other visualization tools that work well with Flink? 
...where it is easy to export the result of a Flink job to a user interface 
(web).

Thanks
Palle


Re: Submit Flink job programatically

2017-04-07 Thread Kamil Dziublinski
Hey,

I had a similar problem when I tried to list the jobs and kill one by name
in yarn cluster. Initially I also tried to set YARN_CONF_DIR but it didn't
work.
What helped tho was passing hadoop conf dir to my application when starting
it. Like that:
java -cp application.jar:/etc/hadoop/conf

Reason was that my application was finding default configuration coming
from hadoop dependency in fat jar and was not even trying to look for
anything in environment variable.
When I passed hadoop conf dir to it, it started working properly.

Hope it helps,

Cheers,
Kamil.

On Fri, Apr 7, 2017 at 8:04 AM, Jins George  wrote:

> Hello Community,
>
> I have a need to submit  flink job to a remote Yarn cluster
> programatically . I tried to use YarnClusterDescriptor.deploy() , but I get
> message
> *RMProxy.java:92:main] - Connecting to ResourceManager at /0.0.0.0:8032
> <http://0.0.0.0:8032>. *It is trying to connect the resouce manager on
> the client machine.  I have set the YARN_CONF_DIR on the client machine
> and placed yarn-site.xml , core-site.xml etc.  However it does not seems to
> be picking these files.
>
> Is this the right way to sumit to a Remote Yarn cluster ?
>
>
> Thanks,
> Jins George
>


EOFException when running Flink job

2015-04-17 Thread Stefan Bunk
Hi Squirrels,

I have some trouble with a delta-iteration transitive closure program [1].
When I run the program, I get the following error:

java.io.EOFException
at
org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333)
at
org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140)
at
org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223)
at
org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291)
at
org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307)
at
org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62)
at
org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26)
at
org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89)
at
org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29)
at
org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219)
at
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
at
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
at
org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
at
org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
at
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
at java.lang.Thread.run(Thread.java:745)

Both input files have been generated and written to HDFS by Flink jobs. I
already ran the Flink program that generated them several times: the error
persists.
You can find the logs at [2] and [3].

I am using the 0.9.0-milestone-1 release.

Best,
Stefan


[1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57
[2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4
[3] One task manager's logs:
https://gist.github.com/knub/8f2f953da95c8d7adefc


Re: Flink job finished unexpected

2021-02-24 Thread Arvid Heise
Hi Rainie,

there are two probably causes:
* Network instabilities
* Taskmanager died, then you can further dig in the taskmanager logs for
errors right before that time.

In both cases, Flink should restart the job with the correct restart
policies if configured.

On Sat, Feb 20, 2021 at 10:07 PM Rainie Li  wrote:

> Hello,
>
> I launched a job with a larger load on hadoop yarn cluster.
> The Job finished after running 5 hours, I didn't find any error from
> JobManger log besides this connect exception.
>
>
>
>
>
> *2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport
>- Remote connection to [/10.1.57.146:48368
> ] failed with java.io.IOException: Connection
> reset by peer2021-02-20 13:20:14,110 WARN
>  akka.remote.ReliableDeliverySupervisor-
> Association with remote system [akka.tcp://flink-metrics@host:35241] has
> failed, address is now gated for [50] ms. Reason: [Disassociated]
> 2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor
>  - Association with remote system
> [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms.
> Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN
>  akka.remote.ReliableDeliverySupervisor-
> Association with remote system [akka.tcp://flink-metrics@host:38481] has
> failed, address is now gated for [50] ms. Reason: [Disassociated] *
>
> Any idea what caused the job to be finished and how to resolve it?
> Any suggestions are appreciated.
>
> Thanks
> Best regards
> Rainie
>


Re: Flink job finished unexpected

2021-02-24 Thread Rainie Li
I see, I will check tm log.
Thank you Arvid.

Best regards
Rainie

On Wed, Feb 24, 2021 at 5:27 AM Arvid Heise  wrote:

> Hi Rainie,
>
> there are two probably causes:
> * Network instabilities
> * Taskmanager died, then you can further dig in the taskmanager logs for
> errors right before that time.
>
> In both cases, Flink should restart the job with the correct restart
> policies if configured.
>
> On Sat, Feb 20, 2021 at 10:07 PM Rainie Li  wrote:
>
>> Hello,
>>
>> I launched a job with a larger load on hadoop yarn cluster.
>> The Job finished after running 5 hours, I didn't find any error from
>> JobManger log besides this connect exception.
>>
>>
>>
>>
>>
>> *2021-02-20 13:20:14,110 WARN  akka.remote.transport.netty.NettyTransport
>>- Remote connection to [/10.1.57.146:48368
>> ] failed with java.io.IOException: Connection
>> reset by peer2021-02-20 13:20:14,110 WARN
>>  akka.remote.ReliableDeliverySupervisor-
>> Association with remote system [akka.tcp://flink-metrics@host:35241] has
>> failed, address is now gated for [50] ms. Reason: [Disassociated]
>> 2021-02-20 13:20:14,110 WARN  akka.remote.ReliableDeliverySupervisor
>>  - Association with remote system
>> [akka.tcp://flink@host:39493] has failed, address is now gated for [50] ms.
>> Reason: [Disassociated] 2021-02-20 13:20:14,110 WARN
>>  akka.remote.ReliableDeliverySupervisor-
>> Association with remote system [akka.tcp://flink-metrics@host:38481] has
>> failed, address is now gated for [50] ms. Reason: [Disassociated] *
>>
>> Any idea what caused the job to be finished and how to resolve it?
>> Any suggestions are appreciated.
>>
>> Thanks
>> Best regards
>> Rainie
>>
>


Flink job repeated restart failure

2021-03-24 Thread VINAYA KUMAR BENDI
Dear all,

One of the Flink jobs gave below exception and failed. Several attempts to 
restart the job resulted in the same exception and the job failed each time. 
The job started successfully only after changing the file name.

Flink Version: 1.11.2

Exception
2021-03-24 20:13:09,288 INFO  org.apache.kafka.clients.producer.KafkaProducer   
   [] - [Producer clientId=producer-2] Closing the Kafka producer with 
timeoutMillis = 0 ms.
2021-03-24 20:13:09,288 INFO  org.apache.kafka.clients.producer.KafkaProducer   
   [] - [Producer clientId=producer-2] Proceeding to force close the 
producer since pending requests could not be completed within timeout 0 ms.
2021-03-24 20:13:09,304 WARN  org.apache.flink.runtime.taskmanager.Task 
   [] - Flat Map -> async wait operator -> Process -> Sink: Unnamed 
(1/1) (8905142514cb25adbd42980680562d31) switched from RUNNING to FAILED.
java.io.IOException: No such file or directory
at java.io.UnixFileSystem.createFileExclusively(Native Method) 
~[?:1.8.0_252]
at java.io.File.createNewFile(File.java:1012) ~[?:1.8.0_252]
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.createSpillingChannel(SpanningWrapper.java:291)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.updateLength(SpanningWrapper.java:178)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.transferFrom(SpanningWrapper.java:111)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNonSpanningRecord(SpillingAdaptiveSpanningRecordDeserializer.java:110)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:85)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:146)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
 ~[flink-dist_2.12-1.11.2.jar:1.11.2]
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536) 
~[flink-dist_2.12-1.11.2.jar:1.11.2]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546) 
[flink-dist_2.12-1.11.2.jar:1.11.2]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
2021-03-24 20:13:09,305 INFO  org.apache.flink.runtime.taskmanager.Task 
   [] - Freeing task resources for Flat Map -> async wait operator -> 
Process -> Sink: Unnamed (1/1) (8905142514cb25adbd42980680562d31).
2021-03-24 20:13:09,311 INFO  
org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - 
Un-registering task and sending final execution state FAILED to JobManager for 
task Flat Map -> async wait operator -> Process -> Sink: Unnamed (1/1) 
8905142514cb25adbd42980680562d31.

File: 
https://github.com/apache/flink/blob/release-1.11.2/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/serialization/SpanningWrapper.java

Related Jira ID: https://issues.apache.org/jira/browse/FLINK-18811

Similar exception mentioned in FLINK-18811 which has a fix in 1.12.0. Though in 
our case, we didn't notice any disk failure. Is there any other reason(s) for 
the above mentioned IOException?

While we are planning to upgrade to the latest Flink version, are there any 
other workaround(s) instead of deploying the job again with a different file 
name?

Kind regards,
Vinaya


Re: Flink job getting killed

2020-04-06 Thread Fabian Hueske
Hi Giriraj,

This looks like the deserialization of a String failed.
Can you isolate the problem to a pair of sending and receiving tasks?

Best, Fabian

Am So., 5. Apr. 2020 um 20:18 Uhr schrieb Giriraj Chauhan <
graj.chau...@gmail.com>:

> Hi,
>
> We are submitting a flink(1.9.1) job for data processing. It runs fine and
> processes data for sometime i.e. ~30 mins and later it throws following
> exception and job gets killed.
>  2020-04-02 14:15:43,371 INFO  org.apache.flink.runtime.taskmanager.Task
>   - Sink: Unnamed (2/4) (45d01514f0fb99602883ca43e997e8f3)
> switched from RUNNING to FAILED.
> java.io.EOFException
> at
> org.apache.flink.core.memory.DataInputDeserializer.readUnsignedByte(DataInputDeserializer.java:321)
> at
> org.apache.flink.types.StringValue.readString(StringValue.java:769)
> at
> org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:75)
> at
> org.apache.flink.api.common.typeutils.base.StringSerializer.deserialize(StringSerializer.java:33)
> at
> org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:205)
> at
> org.apache.flink.streaming.runtime.streamrecord.StreamElementSerializer.deserialize(StreamElementSerializer.java:46)
> at
> org.apache.flink.runtime.plugable.NonReusingDeserializationDelegate.read(NonReusingDeserializationDelegate.java:55)
> at
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:141)
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:91)
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.pollNextNullable(StreamTaskNetworkInput.java:47)
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:135)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:279)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.run(StreamTask.java:301)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:406)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> at java.base/java.lang.Thread.run(Unknown Source)
>
>
> Once the above exception occur, we do see following runtime exception
>
> java.lang.RuntimeException: Buffer pool is destroyed.
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.pushToRecordWriter(RecordWriterOutput.java:110)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:89)
> at
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.collect(RecordWriterOutput.java:45)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:727)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:705)
> at
> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
> at
> com.dell.emc.mars.topology.bl.DiffGenerator.handleCreate(DiffGenerator.java:519)
> at
> com.dell.emc.mars.topology.bl.DiffGenerator.populateHTable(DiffGenerator.java:294)
> at
> com.dell.emc.mars.topology.bl.DiffGenerator.compare(DiffGenerator.java:58)
> at
> com.dell.emc.mars.topology.bl.CompareWithDatabase.flatMap(CompareWithDatabase.java:146)
> at
> com.dell.emc.mars.topology.bl.CompareWithDatabase.flatMap(CompareWithDatabase.java:22)
> at
> org.apache.flink.streaming.api.operators.StreamFlatMap.processElement(StreamFlatMap.java:50)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:637)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:612)
> at
> org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:592)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:727)
> at
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:705)
> at
> org.apache.flink.streaming.api.operators.TimestampedCollector.collect(TimestampedCollector.java:51)
> at
> com.dell.emc.mars.topology.bl.CompareWithModel.flatMap(CompareWithModel.java:110)
> at
> com.dell.emc.mars.topology.bl.CompareWithModel.flatMap(CompareWithModel.java:24)
> at
> org.apache.flink.streaming.api.operators.StreamFlatMap.process

ERROR submmiting a flink job

2020-07-14 Thread Aissa Elaffani
Hello Guys,
I am trying to launch a FLINK app on a distance server, but I have this
error message.
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error:
org.apache.flink.client.program.ProgramInvocationException: Job failed
(JobID: 8bf7f299746e051ea7b94afd07e29d3d)
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
at
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
at
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:662)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:893)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
at
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
Caused by: java.util.concurrent.ExecutionException:
org.apache.flink.client.program.ProgramInvocationException: Job failed
(JobID: 8bf7f299746e051ea7b94afd07e29d3d)
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:83)
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1620)
at sensors.StreamingJob.main(StreamingJob.java:145)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
... 8 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: Job
failed (JobID: 8bf7f299746e051ea7b94afd07e29d3d)
at
org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:112)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$21(RestClusterClient.java:565)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:291)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job
execution failed.
at
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:147)
at
org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:110)
... 19 more
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:110)
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:76)
at
org.apache.flink.runtime.sch

Cancel flink job occur exception.

2018-09-04 Thread 段丁瑞
Hi all,
  I submit a flink job through yarn-cluster mode and cancel job with 
savepoint option immediately after job status change to deployed. Sometimes i 
met this error:

org.apache.flink.util.FlinkException: Could not cancel job .
at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
complete the operation. Number of retries has been exhausted.
at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
... 6 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at 
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:274)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
... 1 more
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: 
Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 16 more
Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 7 more

I check the jobmanager log, no error found. Savepoint is correct saved in 
hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to 
KILLED.
I think this issue occur because RestClusterClient cannot find jobmanager 
addresss after Jobmanager(AM) has shutdown.
My flink version is 1.5.3.
Anyone could help me to resolve this issue, thanks!

devin.


Cancel flink job occur exception

2018-09-04 Thread 李瑞亮
Hi all,
  I submit a flink job through yarn-cluster mode and cancel job with 
savepoint option immediately after job status change to deployed. Sometimes i 
met this error:

org.apache.flink.util.FlinkException: Could not cancel job .
at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:585)
at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
at org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:577)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1034)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.ExecutionException: 
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
complete the operation. Number of retries has been exhausted.
at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
at 
org.apache.flink.client.program.rest.RestClusterClient.cancelWithSavepoint(RestClusterClient.java:398)
at 
org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(CliFrontend.java:583)
... 6 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Number of retries has been exhausted.
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
... 1 more
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: 
Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at 
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 16 more
Caused by: java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 7 more

I check the jobmanager log, no error found. Savepoint is correct saved in 
hdfs. Yarn appliction status changed to FINISHED and FinalStatus change to 
KILLED.
I think this issue occur because RestClusterClient cannot find jobmanager 
addresss after Jobmanager(AM) has shutdown.
My flink version is 1.5.3.
Anyone could help me to resolve this issue, thanks!

Best Regard!


Re: Flink Job state debug

2018-10-18 Thread Hequn Cheng
Hi yichao,
A similar question[1] was asked a few days ago. I think it would be helpful
to answer this question.

Best, Hequn

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Window-State-is-not-being-store-on-check-pointing-td23913.html

On Fri, Oct 19, 2018 at 1:18 AM jia yichao  wrote:

> Hi,
> I want to debug the flink job in the local IntelliJ IDEA to restore the
> initializeState from the snapshot of a previous execution . How do I set
> the parameters to allow the program read the checkpoint or savepoint data
> from hdfs to restore the state of the program?
>
>


Re: Flink Job state debug

2018-10-18 Thread jia yichao
Hi Hequn,

Thank you for your help.  My real purpose is to debug the following code in 
IntelliJ IDEA . 
I want to start the flink job in IDEA, the 54th line of code can get the states 
in the previously stored savepoint or checkpoint, at line 56 the  
context.isRestored() can return true. Do you have any idea to do this in IDEA? 




> 在 2018年10月19日,上午9:51,Hequn Cheng  写道:
> 
> Hi yichao,
> A similar question[1] was asked a few days ago. I think it would be helpful 
> to answer this question. 
> 
> Best, Hequn
> 
> [1] 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Window-State-is-not-being-store-on-check-pointing-td23913.html
>  
> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Window-State-is-not-being-store-on-check-pointing-td23913.html>
> On Fri, Oct 19, 2018 at 1:18 AM jia yichao  <mailto:yc_...@qq.com>> wrote:
> Hi,
> I want to debug the flink job in the local IntelliJ IDEA to restore the 
> initializeState from the snapshot of a previous execution . How do I set the 
> parameters to allow the program read the checkpoint or savepoint data from 
> hdfs to restore the state of the program?
> 



Re: Flink Job state debug

2018-10-21 Thread jia yichao
Hi Hequn,

That is what I want, thank you for your help.

Best, yichao


> 在 2018年10月20日,上午9:30,Hequn Cheng  写道:
> 
> Hi yichao,
> 
> There is a checkpointing IT test here[1]. You can run it locally and set 
> breakpoints to debug.
> Furthermore, you can search other checkpointing test cases in Flink git. 
> 
> Best, Hequn
> 
> [1] 
> https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/checkpointing/KeyedStateCheckpointingITCase.java
>  
> <https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/checkpointing/KeyedStateCheckpointingITCase.java>
> On Fri, Oct 19, 2018 at 11:12 AM jia yichao  <mailto:yc_...@qq.com>> wrote:
> Hi Hequn,
> 
> Thank you for your help.  My real purpose is to debug the following code in 
> IntelliJ IDEA . 
> I want to start the flink job in IDEA, the 54th line of code can get the 
> states in the previously stored savepoint or checkpoint, at line 56 the  
> context.isRestored() can return true. Do you have any idea to do this in 
> IDEA? 
> 
> 
> 
> 
>> 在 2018年10月19日,上午9:51,Hequn Cheng > <mailto:chenghe...@gmail.com>> 写道:
>> 
>> Hi yichao,
>> A similar question[1] was asked a few days ago. I think it would be helpful 
>> to answer this question. 
>> 
>> Best, Hequn
>> 
>> [1] 
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Window-State-is-not-being-store-on-check-pointing-td23913.html
>>  
>> <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Window-State-is-not-being-store-on-check-pointing-td23913.html>
>> On Fri, Oct 19, 2018 at 1:18 AM jia yichao > <mailto:yc_...@qq.com>> wrote:
>> Hi,
>> I want to debug the flink job in the local IntelliJ IDEA to restore the 
>> initializeState from the snapshot of a previous execution . How do I set the 
>> parameters to allow the program read the checkpoint or savepoint data from 
>> hdfs to restore the state of the program?
>> 
> 
> 



Flink job execution graph hints

2018-12-09 Thread Taher Koitawala
Hi All,
  Is there a way to send hints to the job graph builder!? Like
specifically disabling or enabling chaining.


Re: Flink Job and Watermarking

2019-02-08 Thread Chesnay Schepler
Have you considered using the metric system to access the current 
watermarks for each operator? (see 
https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#io)


On 08.02.2019 03:19, Kaustubh Rudrawar wrote:

Hi,

I'm writing a job that wants to make an HTTP request once a watermark 
has reached all tasks of an operator. It would be great if this could 
be determined from outside the Flink job, but I don't think it's 
possible to access watermark information for the job as a whole. Below 
is a workaround I've come up with:


 1. Read messages from Kafka using the provided KafkaSource. Event
time will be defined as a timestamp within the message.
 2. Key the stream based on an id from the message.
 3. DedupOperator that dedupes messages. This operator will run with a
parallelism of N.
 4. An operator that persists the messages to S3. It doesn't need to
output anything - it should ideally be a Sink (if it were a sink
we could use the StreamingFileSink).
 5. Implement an operator that will make an HTTP request once
processWatermark is called for time T. A parallelism of 1 will be
used for this operator as it will do very little work. Because it
has a parallelism of 1, the operator in step 4 cannot send
anything to it as it could become a throughput bottleneck.

Does this implementation seem like a valid workaround? Any other 
alternatives I should consider?


Thanks for your help,
Kaustubh






Re: Flink job testing with

2018-04-18 Thread Tzu-Li (Gordon) Tai
Hi,

The docs here [1] provide some example snippets of using the Kafka connector
to consume from / write to Kafka topics.

Once you consumed a `DataStream` from a Kafka topic using the Kafka
consumer, you can use Flink transformations such as map, flatMap, etc. to
perform processing on the records.

Hope this helps!

Cheers,
Gordon

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/connectors/kafka.html



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Constant backpressure on flink job

2019-04-22 Thread Monika Hristova
Hello,

We are experiencing regular backpressure (at least once a week). I would like 
to ask how we can fix it.

Currently our configurations are:
vi /usr/lib/flink/conf/flink-conf.yaml
# Settings applied by Cloud Dataproc initialization action
jobmanager.rpc.address: bonusengine-prod-m-0
jobmanager.heap.mb: 4096
jobmanager.rpc.port: 6123
taskmanager.heap.mb: 4096
taskmanager.memory.preallocate: false
taskmanager.numberOfTaskSlots: 8
#taskmanager.network.numberOfBuffers: 21952 # legacy deprecated
taskmanager.network.memory.fraction: 0.9
taskmanager.network.memory.min: 67108864
taskmanager.network.memory.max: 1073741824
taskmanager.memory.segment-size: 65536
parallelism.default: 52
web.port: 8081
web.timeout: 12
heartbeat.interval: 1
heartbeat.timeout: 10
yarn.application-attempts: 10
high-availability: zookeeper
high-availability.zookeeper.quorum: 
bonusengine-prod-m-0:2181,bonusengine-prod-m-1:2181,bonusengine-prod-m-2:2181
high-availability.zookeeper.path.root: /flink
#high-availability.zookeeper.storageDir: hdfs:///flink/recovery # legacy 
deprecated
high-availability.storageDir: hdfs:///flink/recovery
flink.partition-discovery.interval-millis=6
fs.hdfs.hadoopconf: /etc/hadoop/conf
state.backend: rocksdb
state.checkpoints.dir: hdfs:///bonusengine/checkpoints/
state.savepoints.dir: hdfs:///bonusengine/savepoints
metrics.reporters: stsd
metrics.reporter.stsd.class: org.apache.flink.metrics.statsd.StatsDReporter
metrics.reporter.stsd.host: 127.0.0.1
metrics.reporter.stsd.port: 8125
zookeeper.sasl.disable: true
yarn.reallocate-failed: true
yarn.maximum-failed-containers: 32
web.backpressure.refresh-interval: 6
web.backpressure.num-samples: 100
web.backpressure.delay-between-samples: 50

with Hadoop HA cluster: masters -> 8 vCPUs, 7.2 GB and slaves -> 16 vCPUs, 60 
GB with yarn configuration(see attached file)

We have one yarn session started where 8 jobs are run. Three of them are 
consuming the same source (kafka) which is causing the backpressure, but only 
one of them experiences backpressure. The state of the job is 20 something MB 
and the checkpoint is configured as follows:
checkpointing.interval=30 # makes sure value in  ms of progress happens 
between checkpoints checkpointing.pause_between_checkpointing=24 # 
checkpoints have to complete within value in ms or are discarded 
checkpointing.timeout=6 # allows given number of checkpoints to be in 
progress at the same time checkpointing.max_concurrent_checkpoints=1 # 
enables/disables externalized checkpoints which are retained after job 
cancellation checkpointing.externalized_checkpoints.enabled=true

as checkpointing pause was increased and timeout was reduced on multiple 
occasions as the job kept failing unable to recover from backpressure. RocksDB 
is configured state backend. The problem keeps reproducing even with one minute 
timeout. Also I would like to point out that the perfect checkpointing for that 
job would be 2 min.
I would like to ask what might be the problem or at least how to trace it ?

Best Regards,
Monika Hristova

Get Outlook for Android



yarn-site.xml
Description: yarn-site.xml


Flink job server with HA

2019-06-03 Thread Boris Lublinsky
I am trying to experiment with Flink Job server with HA and I am noticing, that 
in this case
method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I can 
see that it is invoked in the case of session cluster when a job is added)
As a result, when I am trying to restart a Job Master, it finds no running jobs 
and is not trying to restore it.
Am I missing something?

 

Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/



Collections as Flink job parameters

2019-11-18 Thread Протченко Алексей

 
Hello all.
 
I have a question about providing complex configuration to Flink job. We are 
working on some kind of platform for running used-defined packages which 
actually cantain the main business logic. All the parameters we are providing 
via command line and parse with ParameterTool. That’s ok until we have 
parameters of simple types like String, int etc. But the problem is that we 
need to add a Map of custom parameters for users to provide configuration 
variables specific for their code. 
 
Reading documentation and code of ParameterTool I do not see clear possibility 
to do it. Is using third-party arguments parser is the only option?
 
Best regards,
Alex
 
 
--
Алексей Протченко

Re: Flink Job claster scalability

2020-01-08 Thread Zhu Zhu
Hi KristoffSC,

Each task needs a slot to run. However, Flink enables slot sharing[1] by
default so that one slot can host one parallel instance of each task in a
job. That's why your job can start with 6 slots.
However, different parallel instances of the same task cannot share a slot.
That's why you need at least 6 slots to run your job.

You can set tasks to be in different slot sharing group via
'.slotSharingGroup(xxx)' to force certain tasks to not share slots. This
allows the tasks to not burden each other. However, in this way the job
will need more slots to start.

So for your questions:
#1 yes
#2 ATM, you will need to resubmit your job with the adjusted parallelism.
The rescale cli was experimental and was temporarily removed [2]

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html#task-slots-and-resources
[2]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Temporarily-remove-support-for-job-rescaling-via-CLI-action-quot-modify-quot-td27447.html

Thanks,
Zhu Zhu

KristoffSC  于2020年1月9日周四 上午1:05写道:

> Hi all,
> I must say I'm very impressed by Flink and what it can do.
>
> I was trying to play around with Flink operator parallelism and scalability
> and I have few questions regarding this subject.
>
> My setup is:
> 1. Flink 1.9.1
> 2. Docker Job Cluster, where each Task manager has only one task slot. I'm
> following [1]
> 3. env setup:
> env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, 1000));
> env.setParallelism(1);
> env.setMaxParallelism(128);
> env.enableCheckpointing(10 * 60 * 1000);
>
> Please mind that I am using operator chaining here.
>
> My pipeline setup:
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2311/Capture2.png>
>
>
>
> As you can see I have 7 operators (few of them were actually chained and
> this is ok), with different parallelism level. This all gives me 23 tasks
> total.
>
>
> I've noticed that with "one task manager = one task slot" approach I have
> to
> have 6 task slots/task managers to be able to start this pipeline.
>
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2311/Capture1.png>
>
>
> If number of task slots is lower than 6, job is scheduled but not started.
>
> With 6 task slots everything is working fine and I've must say that I'm
> very
> impressed with a way that Flinks balanced data between task slots. Data was
> distributed very evenly between operator instances/tasks.
>
> In this setup (7 operators, 23 tasks and 6 task slots), some task slots
> have
> to be reused by more than one operator. While inspecting UI I've found
> examples such operators. This is what I was expecting though.
>
> However I was surprised a little bit after I added one additional task
> manager (hence one new task slot)
>
> <
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2311/Capture3.png>
>
>
> After adding new resources, Flink did not re balanced/redistributed the
> graph. So this host was sitting there and doing nothing. Even after putting
> some load on the cluster, still this node was not used.
>
>
> *After doing this exercise I have few questions:*
>
> 1. It seems that number of task slots must be equal or greater than max
> number of parallelism used in the pipeline. In my case it was 6. When I
> changed parallelism for one of the operator to 7, I had to have 7 task
> slots
> (task managers in my setup) to be able to even start the job.
> Is this the case?
>
> 2. What I can do to use the extra node that was spanned while job was
> running?
> In other words, If I would see that one of my nodes has to much load what I
> can do? Please mind that I'm using keyBy/hashing function in my pipeline
> and
> in my tests I had around 5000 unique keys.
>
> I've try to use REST API to call "rescale" but I got this response:
> /302{"errors":["Rescaling is temporarily disabled. See FLINK-12312."]}/
>
> Thanks.
>
> [1]
>
> https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


Re: Flink Job claster scalability

2020-01-09 Thread David Maddison
Hi KristoffSC,

As Zhu Zhu explained, Flink does not currently auto-scale a Job as new
resources become available. Instead the Job must be stopped via a savepoint
and restarted with a new parallelism (the old rescale CLI experiment use to
perform this).

Making Flink reactive to new resources and auto scaling jobs is something
I'm currently very interested in. An approach on how to change Flink to
support this has been previously outlined/discussed in FLINK-10407 (
https://issues.apache.org/jira/browse/FLINK-10407)

/David/

On Thu, Jan 9, 2020 at 7:38 AM Zhu Zhu  wrote:

> Hi KristoffSC,
>
> Each task needs a slot to run. However, Flink enables slot sharing[1] by
> default so that one slot can host one parallel instance of each task in a
> job. That's why your job can start with 6 slots.
> However, different parallel instances of the same task cannot share a
> slot. That's why you need at least 6 slots to run your job.
>
> You can set tasks to be in different slot sharing group via
> '.slotSharingGroup(xxx)' to force certain tasks to not share slots. This
> allows the tasks to not burden each other. However, in this way the job
> will need more slots to start.
>
> So for your questions:
> #1 yes
> #2 ATM, you will need to resubmit your job with the adjusted parallelism.
> The rescale cli was experimental and was temporarily removed [2]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html#task-slots-and-resources
> [2]
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/DISCUSS-Temporarily-remove-support-for-job-rescaling-via-CLI-action-quot-modify-quot-td27447.html
>
> Thanks,
> Zhu Zhu
>
> KristoffSC  于2020年1月9日周四 上午1:05写道:
>
>> Hi all,
>> I must say I'm very impressed by Flink and what it can do.
>>
>> I was trying to play around with Flink operator parallelism and
>> scalability
>> and I have few questions regarding this subject.
>>
>> My setup is:
>> 1. Flink 1.9.1
>> 2. Docker Job Cluster, where each Task manager has only one task slot. I'm
>> following [1]
>> 3. env setup:
>> env.setRestartStrategy(RestartStrategies.fixedDelayRestart(1000, 1000));
>> env.setParallelism(1);
>> env.setMaxParallelism(128);
>> env.enableCheckpointing(10 * 60 * 1000);
>>
>> Please mind that I am using operator chaining here.
>>
>> My pipeline setup:
>> <
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2311/Capture2.png>
>>
>>
>>
>> As you can see I have 7 operators (few of them were actually chained and
>> this is ok), with different parallelism level. This all gives me 23 tasks
>> total.
>>
>>
>> I've noticed that with "one task manager = one task slot" approach I have
>> to
>> have 6 task slots/task managers to be able to start this pipeline.
>>
>> <
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2311/Capture1.png>
>>
>>
>> If number of task slots is lower than 6, job is scheduled but not
>> started.
>>
>> With 6 task slots everything is working fine and I've must say that I'm
>> very
>> impressed with a way that Flinks balanced data between task slots. Data
>> was
>> distributed very evenly between operator instances/tasks.
>>
>> In this setup (7 operators, 23 tasks and 6 task slots), some task slots
>> have
>> to be reused by more than one operator. While inspecting UI I've found
>> examples such operators. This is what I was expecting though.
>>
>> However I was surprised a little bit after I added one additional task
>> manager (hence one new task slot)
>>
>> <
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t2311/Capture3.png>
>>
>>
>> After adding new resources, Flink did not re balanced/redistributed the
>> graph. So this host was sitting there and doing nothing. Even after
>> putting
>> some load on the cluster, still this node was not used.
>>
>>
>> *After doing this exercise I have few questions:*
>>
>> 1. It seems that number of task slots must be equal or greater than max
>> number of parallelism used in the pipeline. In my case it was 6. When I
>> changed parallelism for one of the operator to 7, I had to have 7 task
>> slots
>> (task managers in my setup) to be able to even start the job.
>> Is this the case?
>>
>> 2. What I can do to use the extra node that was spanned while job was
>> running?
>> In other words, If I would see that one of my nodes has to much load what
>> I
>> can do? Please mind that I'm using keyBy/hashing function in my pipeline
>> and
>> in my tests I had around 5000 unique keys.
>>
>> I've try to use REST API to call "rescale" but I got this response:
>> /302{"errors":["Rescaling is temporarily disabled. See FLINK-12312."]}/
>>
>> Thanks.
>>
>> [1]
>>
>> https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md
>>
>>
>>
>> --
>> Sent from:
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>>
>


Re: Flink Job claster scalability

2020-01-09 Thread KristoffSC
Thank you David and Zhu Zhu,
this helps a lot.

I have follow up questions though.

Having this
/"Instead the Job must be stopped via a savepoint and restarted with a new
parallelism"/

and slot sharing [1] feature, I got the impression that if I would start my
cluster with more than 6 task slots, Flink will try deploy tasks across all
resources, trying to use all available resources during job submission

I did a two tests having my original task.
1. I started a Job Cluster with 7 task slots (7 task manager since in this
case 1 task manager has one task slot).
2. I started a Session cluster with 28 task slots in total. In this case I
had 7 task managers, 4 task slot each. 

For case 1, I use "FLINK_JOB" variable as stated in [2]. For case 2, I
submitted my job from UI after Flink started to be operative. 


For both cases it used only 6 task slots, so it was still reusing task
slots. I got the impression that it will try to use as much available
resources as it can.

What do you think about this?


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html#task-slots-and-resources
[2]
https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md








--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink Job claster scalability

2020-01-09 Thread Zhu Zhu
Hi KristoffSC,

Did you increase the parallelism of the vertex that has the largest
parallelism?
Or did you explicitly set tasks to be in different slot sharing group?
With the default slot sharing, the number of slots required/used equals to
the max parallelism of a JobVertex, which is 6 in your case.

KristoffSC  于2020年1月9日周四 下午9:26写道:

> Thank you David and Zhu Zhu,
> this helps a lot.
>
> I have follow up questions though.
>
> Having this
> /"Instead the Job must be stopped via a savepoint and restarted with a new
> parallelism"/
>
> and slot sharing [1] feature, I got the impression that if I would start my
> cluster with more than 6 task slots, Flink will try deploy tasks across all
> resources, trying to use all available resources during job submission
>
> I did a two tests having my original task.
> 1. I started a Job Cluster with 7 task slots (7 task manager since in this
> case 1 task manager has one task slot).
> 2. I started a Session cluster with 28 task slots in total. In this case I
> had 7 task managers, 4 task slot each.
>
> For case 1, I use "FLINK_JOB" variable as stated in [2]. For case 2, I
> submitted my job from UI after Flink started to be operative.
>
>
> For both cases it used only 6 task slots, so it was still reusing task
> slots. I got the impression that it will try to use as much available
> resources as it can.
>
> What do you think about this?
>
>
> [1]
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html#task-slots-and-resources
> [2]
>
> https://github.com/apache/flink/blob/release-1.9/flink-container/docker/README.md
>
>
>
>
>
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


Re: Flink Job claster scalability

2020-01-10 Thread KristoffSC
Hi Zhu Zhu,
well In my last test I did not change the job config, so I did not change
the parallelism level of any operator and I did not change policy regarding
slot sharing (it stays as default one). Operator Chaining is set to true
without any extra actions like "start new chain, disable chain etc"

What I assume however is that Flink will try find most efficient way to use
available resources during job submission. 

In the first case, where I had only 6 task managers (which matches max
parallelism of my JobVertex), Flink reused some TaskSlots. Adding extra task
slots did was not effective because reason described by David. This is
understandable.

However, I was assuming that if I submit my job on a cluster that have more
task managers than 6, Flink will not share task slots by default. That did
not happen. Flink deployed the job in the same way regardless of extra
resources.


So the conclusion is that simple job resubmitting will not work in this case
and actually I cant have any certainty that it will. Since in my case Flink
still reuses slot task.

If this would be the production case, I would have to do a test job
submission on testing env and potentially change the job. Not the config,
but adding  slot sharing groups etc. 
So if this would be the production case I will not be able to react fast, I
would have to deploy new version of my app/job which could be problematic. 




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink Job claster scalability

2020-01-10 Thread Yangze Guo
Hi KristoffSC

As Zhu said, Flink enables slot sharing[1] by default. This feature is
nothing to do with the resource of your cluster. The benefit of this
feature is written in [1] as well. I mean, it will not detect how many
slots in your cluster and adjust its behavior toward this number. If
you want to make the best use of your cluster, you can increase the
parallelism of the vertex that has the largest parallelism or
"disable" the slot sharing by [2]. IMO, the first way matches your
purpose.

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/concepts/runtime.html#task-slots-and-resources
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/#task-chaining-and-resource-groups

Best,
Yangze Guo

On Fri, Jan 10, 2020 at 6:49 PM KristoffSC
 wrote:
>
> Hi Zhu Zhu,
> well In my last test I did not change the job config, so I did not change
> the parallelism level of any operator and I did not change policy regarding
> slot sharing (it stays as default one). Operator Chaining is set to true
> without any extra actions like "start new chain, disable chain etc"
>
> What I assume however is that Flink will try find most efficient way to use
> available resources during job submission.
>
> In the first case, where I had only 6 task managers (which matches max
> parallelism of my JobVertex), Flink reused some TaskSlots. Adding extra task
> slots did was not effective because reason described by David. This is
> understandable.
>
> However, I was assuming that if I submit my job on a cluster that have more
> task managers than 6, Flink will not share task slots by default. That did
> not happen. Flink deployed the job in the same way regardless of extra
> resources.
>
>
> So the conclusion is that simple job resubmitting will not work in this case
> and actually I cant have any certainty that it will. Since in my case Flink
> still reuses slot task.
>
> If this would be the production case, I would have to do a test job
> submission on testing env and potentially change the job. Not the config,
> but adding  slot sharing groups etc.
> So if this would be the production case I will not be able to react fast, I
> would have to deploy new version of my app/job which could be problematic.
>
>
>
>
> --
> Sent from: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Error submitting flink job

2017-09-01 Thread ant burton
Is this of any help

https://stackoverflow.com/questions/33890759/how-to-specify-overwrite-to-writeastext-in-apache-flink-streaming-0-10-0
 
<https://stackoverflow.com/questions/33890759/how-to-specify-overwrite-to-writeastext-in-apache-flink-streaming-0-10-0>

fs.overwrite-files: true in your flink-conf.yaml


> On 1 Sep 2017, at 23:36, Krishnanand Khambadkone  
> wrote:
> 
> I am trying to submit a flink job from the command line and seeing this 
> error.  Any idea what could be happening
> 
> java.io.IOException: File or directory already exists. Existing files and 
> directories are not overwritten in NO_OVERWRITE mode. Use OVERWRITE mode to 
> overwrite existing files and directories.
> 



Re: Error submitting flink job

2017-09-01 Thread Krishnanand Khambadkone
I had to restart the flink process. That fixed the issue

Sent from my iPhone

> On Sep 1, 2017, at 3:39 PM, ant burton  wrote:
> 
> Is this of any help
> 
> https://stackoverflow.com/questions/33890759/how-to-specify-overwrite-to-writeastext-in-apache-flink-streaming-0-10-0
> 
> fs.overwrite-files: true in your flink-conf.yaml
> 
> 
>> On 1 Sep 2017, at 23:36, Krishnanand Khambadkone  
>> wrote:
>> 
>> I am trying to submit a flink job from the command line and seeing this 
>> error.  Any idea what could be happening
>> 
>> java.io.IOException: File or directory already exists. Existing files and 
>> directories are not overwritten in NO_OVERWRITE mode. Use OVERWRITE mode to 
>> overwrite existing files and directories.
>> 
> 


Integration testing my Flink job

2017-11-14 Thread Ron Crocker
Is there a good way to do integration testing for a Flink job - that is, I want 
to inject a set of events and see the proper behavior?

Ron
—
Ron Crocker
Principal Engineer & Architect
( ( •)) New Relic
rcroc...@newrelic.com
M: +1 630 363 8835



Flink Job across Data Centers

2023-04-12 Thread Chirag Dewan via user
Hi,
Can anyone share any experience on running Flink jobs across data centers?
I am trying to create a Multi site/Geo Replicated Kafka cluster. I want that my 
Flink job to be closely colocated with my Kafka multi site cluster. If the 
Flink job is bound to a single data center, I believe we will observe a lot of 
client latency by trying to access the broker in another DC.
Rather if I can make my Flink Kafka collectors as rack aware and start fetching 
data from the closest Kafka broker, I should get better results.
I will be deploying Flink 1.16 on Kubernetes with Strimzi managed Apache Kafka.
Thanks.


Re: Flink job Deployement problem

2024-06-05 Thread Xiqian YU
Hi Fokou,

Seems `features` column was inferenced to be RAW type, which doesn’t carry any 
specific data information, and causes following type casting failed.

Sometimes it will happen when Flink can’t infer return type from a lambda 
expression but no explicit returning type information was provided[1], and 
could be solved by inserting an extraneous .returns() 
call.[2]

If that doesn’t solve your problem, could you please share more information 
like a code snippet or a minimum POC?

Regards,
yux

[1] 
https://nightlies.apache.org/flink/flink-docs-release-1.19/docs/dev/datastream/java_lambdas/
[2] 
https://stackoverflow.com/questions/70295970/flink-table-get-type-information


De : Fokou Toukam, Thierry 
Date : jeudi, 6 juin 2024 à 09:04
À : user 
Objet : Flink job Deployement problem
Hi, i'm trying to deploy flink job but i have this error. How to solve it 
please?
[cid:b13357ec-00b9-4a15-8d04-1c797a4eced3]

Thierry FOKOU |  IT M.A.Sc Student

Département de génie logiciel et TI

École de technologie supérieure  |  Université du Québec

1100, rue Notre-Dame Ouest

Montréal (Québec)  H3C 1K3

[image001]<http://etsmtl.ca/>



Re: Flink job Deployement problem

2024-06-05 Thread Hang Ruan
Hi, Fokou Toukam.

This error occurs when the schema in the sink mismatches the schema you
provided from the upstream.
You may need to check whether the provided type of field `features` in sink
is the same as the type in the provided upstream.

Best,
Hang

Fokou Toukam, Thierry  于2024年6月6日周四
10:22写道:

> Hi, i'm trying to deploy flink job but i have this error. How to solve it
> please?
>
> *Thierry FOKOU *| * IT M.A.Sc <http://M.A.Sc> Student*
>
> Département de génie logiciel et TI
>
> École de technologie supérieure  |  Université du Québec
>
> 1100, rue Notre-Dame Ouest
>
> Montréal (Québec)  H3C 1K3
>
>
> [image: image001] <http://etsmtl.ca/>
>


Re: Visualize result of Flink job

2016-05-29 Thread Kanstantsin Kamkou
>  I am thinking it may not be the best fit, because Elastic is by nature a 
> search engine that is good for trending and stuff like that - not entire 
> replacement of the current view.
Why u think that the elasticsearch is not the right tool? To visualise
something u have to find what to visualise  first, right?


On Sun, May 29, 2016 at 8:20 PM, Palle  wrote:
> Hi there
>
> I am using Flink to analyse a lot of incoming data. Every 10 seconds it makes 
> sense to present the analysis so far as some form of visualization. Every 10 
> seconds I therefore will replace the current contents of the 
> visualization/presentation with the analysis result of the most recent 10 
> seconds.
>
> I was first thinking of using ElasticSearch/Kibana for this because I know it 
> should be easy to set up, but I am thinking it may not be the best fit, 
> because Elastic is by nature a search engine that is good for trending and 
> stuff like that - not entire replacement of the current view. And therefore I 
> may also experience difficulties implementing the view in Elastic.
>
> Does anyone know of any other visualization tools that work well with Flink? 
> ...where it is easy to export the result of a Flink job to a user interface 
> (web).
>
> Thanks
> Palle


Re: Visualize result of Flink job

2016-05-29 Thread Palle
I know exactly what to visualize. As I wrote, it is the latest result of the 
Flink job I would like to visualize. There is no need to use elastic to find it 
first.

The data I have is of such a  nature that  they every 10 seconds could be 
written into a file, meaning that the file at all times would contain the most 
recent results (at latest 10 seconds old). I am not interested in the history, 
and therefore I should think elastic is not the best fit. So my question is if 
anyone knows of a component (Apache or other) that can make the visualization a 
little nicer than just the file :-)


- Original meddelelse -
> Fra: Kanstantsin Kamkou 
> Til: user@flink.apache.org
> Dato: Søn, 29. maj 2016 22:42
> Emne: Re: Visualize result of Flink job
> 
> >  I am thinking it may not be the best fit, because Elastic is by
> nature a search engine that is good for trending and stuff like that
> - not entire replacement of the current view.
> Why u think that the elasticsearch is not the right tool? To
> visualise
> something u have to find what to visualise  first, right?
> 
> 
> On Sun, May 29, 2016 at 8:20 PM, Palle  wrote:
> > Hi there
> >
> > I am using Flink to analyse a lot of incoming data. Every 10
> seconds it makes sense to present the analysis so far as some form
> of visualization. Every 10 seconds I therefore will replace the
> current contents of the visualization/presentation with the analysis
> result of the most recent 10 seconds.
> >
> > I was first thinking of using ElasticSearch/Kibana for this
> because I know it should be easy to set up, but I am thinking it may
> not be the best fit, because Elastic is by nature a search engine
> that is good for trending and stuff like that - not entire
> replacement of the current view. And therefore I may also experience
> difficulties implementing the view in Elastic.
> >
> > Does anyone know of any other visualization tools that work well
> with Flink? ...where it is easy to export the result of a Flink job
> to a user interface (web).
> >
> > Thanks
> > Palle



Re: Visualize result of Flink job

2016-05-30 Thread Palle
OK, I found product that seems to be what I am looking for: Apache Zeppelin. I 
will have a look into that one. If anyone can point me to an example (Git) 
outputting data from Flink to the Zeppelin Notebook I would be happy.


- Original meddelelse -
> Fra: Palle 
> Til: user@flink.apache.org
> Dato: Man, 30. maj 2016 08:20
> Emne: Re: Visualize result of Flink job
> 
> I know exactly what to visualize. As I wrote, it is the latest
> result of the Flink job I would like to visualize. There is no need
> to use elastic to find it first.
> 
> The data I have is of such a  nature that  they every 10 seconds
> could be written into a file, meaning that the file at all times
> would contain the most recent results (at latest 10 seconds old). I
> am not interested in the history, and therefore I should think
> elastic is not the best fit. So my question is if anyone knows of a
> component (Apache or other) that can make the visualization a little
> nicer than just the file :-)
> 
> 
> - Original meddelelse -
> > Fra: Kanstantsin Kamkou 
> > Til: user@flink.apache.org
> > Dato: Søn, 29. maj 2016 22:42
> > Emne: Re: Visualize result of Flink job
> > 
> > >  I am thinking it may not be the best fit, because Elastic is by
> > nature a search engine that is good for trending and stuff like
> that
> > - not entire replacement of the current view.
> > Why u think that the elasticsearch is not the right tool? To
> > visualise
> > something u have to find what to visualise  first, right?
> > 
> > 
> > On Sun, May 29, 2016 at 8:20 PM, Palle  wrote:
> > > Hi there
> > >
> > > I am using Flink to analyse a lot of incoming data. Every 10
> > seconds it makes sense to present the analysis so far as some form
> > of visualization. Every 10 seconds I therefore will replace the
> > current contents of the visualization/presentation with the
> analysis
> > result of the most recent 10 seconds.
> > >
> > > I was first thinking of using ElasticSearch/Kibana for this
> > because I know it should be easy to set up, but I am thinking it
> may
> > not be the best fit, because Elastic is by nature a search engine
> > that is good for trending and stuff like that - not entire
> > replacement of the current view. And therefore I may also
> experience
> > difficulties implementing the view in Elastic.
> > >
> > > Does anyone know of any other visualization tools that work well
> > with Flink? ...where it is easy to export the result of a Flink
> job
> > to a user interface (web).
> > >
> > > Thanks
> > > Palle
> 



Re: Visualize result of Flink job

2016-05-30 Thread Robert Metzger
Hi Palle,
I think there is currently no way of sending the data from a streaming
Flink job into Zeppelin.
What rate / amount of data do you expect to send every 10 seconds to the
visualization tool?

People have used Flink -> ES -> Kibana for this purpose in the past [1],
but I think you can not send millions of records per second into ES.
Something like 1000 - 5000 elements / second should easily work for a small
ES setup.


[1]
https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana

On Mon, May 30, 2016 at 10:00 AM, Palle  wrote:

> OK, I found product that seems to be what I am looking for: Apache
> Zeppelin. I will have a look into that one. If anyone can point me to an
> example (Git) outputting data from Flink to the Zeppelin Notebook I would
> be happy.
>
>
> - Original meddelelse -
> > Fra: Palle 
> > Til: user@flink.apache.org
> > Dato: Man, 30. maj 2016 08:20
> > Emne: Re: Visualize result of Flink job
> >
> > I know exactly what to visualize. As I wrote, it is the latest
> > result of the Flink job I would like to visualize. There is no need
> > to use elastic to find it first.
> >
> > The data I have is of such a  nature that  they every 10 seconds
> > could be written into a file, meaning that the file at all times
> > would contain the most recent results (at latest 10 seconds old). I
> > am not interested in the history, and therefore I should think
> > elastic is not the best fit. So my question is if anyone knows of a
> > component (Apache or other) that can make the visualization a little
> > nicer than just the file :-)
> >
> >
> > - Original meddelelse -
> > > Fra: Kanstantsin Kamkou 
> > > Til: user@flink.apache.org
> > > Dato: Søn, 29. maj 2016 22:42
> > > Emne: Re: Visualize result of Flink job
> > >
> > > >  I am thinking it may not be the best fit, because Elastic is by
> > > nature a search engine that is good for trending and stuff like
> > that
> > > - not entire replacement of the current view.
> > > Why u think that the elasticsearch is not the right tool? To
> > > visualise
> > > something u have to find what to visualise  first, right?
> > >
> > >
> > > On Sun, May 29, 2016 at 8:20 PM, Palle  wrote:
> > > > Hi there
> > > >
> > > > I am using Flink to analyse a lot of incoming data. Every 10
> > > seconds it makes sense to present the analysis so far as some form
> > > of visualization. Every 10 seconds I therefore will replace the
> > > current contents of the visualization/presentation with the
> > analysis
> > > result of the most recent 10 seconds.
> > > >
> > > > I was first thinking of using ElasticSearch/Kibana for this
> > > because I know it should be easy to set up, but I am thinking it
> > may
> > > not be the best fit, because Elastic is by nature a search engine
> > > that is good for trending and stuff like that - not entire
> > > replacement of the current view. And therefore I may also
> > experience
> > > difficulties implementing the view in Elastic.
> > > >
> > > > Does anyone know of any other visualization tools that work well
> > > with Flink? ...where it is easy to export the result of a Flink
> > job
> > > to a user interface (web).
> > > >
> > > > Thanks
> > > > Palle
> >
>
>


Re: Visualize result of Flink job

2016-05-30 Thread Palle
Hi Robert.

Thank you for the answer.

I am looking at a rate of max 10.000 elements / 10 seconds, so
Elastic/Kibana is probably the way to go. I'll find a way to model it.

Thanks.

/Palle

- Original meddelelse -

> Fra: Robert Metzger 
> Til: user@flink.apache.org 
> Dato: Man, 30. maj 2016 12:31
> Emne: Re: Visualize result of Flink job
> 
> Hi Palle, I think there is currently no way of sending the data from
> a streaming Flink job into Zeppelin.What rate / amount of data do you
> expect to send every 10 seconds to the visualization tool?People have
> used Flink -> ES -> Kibana for this purpose in the past [1], but I
> think you can not send millions of records per second into ES.
> Something like 1000 - 5000 elements / second should easily work for a
> small ES setup.[1]
> https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana
>  
> [https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana]
> On Mon, May 30, 2016 at 10:00 AM, Palle < pa...@sport.dk 
> [mailto:pa...@sport.dk]>
> wrote:
> 
>   OK, I found product that seems to be what I am looking for:
>   Apache Zeppelin. I will have a look into that one. If anyone can
>   point me to an example (Git) outputting data from Flink to the
>   Zeppelin Notebook I would be happy.
> 
> 
>   - Original meddelelse -
>   > Fra: Palle < pa...@sport.dk [mailto:pa...@sport.dk]>
>   > Til: user@flink.apache.org [mailto:user@flink.apache.org]
>   > Dato: Man, 30. maj 2016 08:20
>   > Emne: Re: Visualize result of Flink job
>   >
>   > I know exactly what to visualize. As I wrote, it is the latest
>   > result of the Flink job I would like to visualize. There is no
>   need
>   > to use elastic to find it first.
>   >
>   > The data I have is of such a nature that they every 10 seconds
>   > could be written into a file, meaning that the file at all
>   times
>   > would contain the most recent results (at latest 10 seconds
>   old). I
>   > am not interested in the history, and therefore I should think
>   > elastic is not the best fit. So my question is if anyone knows
>   of a
>   > component (Apache or other) that can make the visualization a
>   little
>   > nicer than just the file :-)
>   >
>   >
>   > - Original meddelelse -----
>   > > Fra: Kanstantsin Kamkou < kkam...@gmail.com [mailto:kkam...@gmail.com]>
>   > > Til: user@flink.apache.org [mailto:user@flink.apache.org]
>   > > Dato: Søn, 29. maj 2016 22:42
>   > > Emne: Re: Visualize result of Flink job
>   > >
>   > > > I am thinking it may not be the best fit, because Elastic
>   is by
>   > > nature a search engine that is good for trending and stuff
>   like
>   > that
>   > > - not entire replacement of the current view.
>   > > Why u think that the elasticsearch is not the right tool? To
>   > > visualise
>   > > something u have to find what to visualise first, right?
>   > >
>   > >
>   > > On Sun, May 29, 2016 at 8:20 PM, Palle < pa...@sport.dk 
> [mailto:pa...@sport.dk]>
>   wrote:
>   > > > Hi there
>   > > >
>   > > > I am using Flink to analyse a lot of incoming data. Every
>   10
>   > > seconds it makes sense to present the analysis so far as some
>   form
>   > > of visualization. Every 10 seconds I therefore will replace
>   the
>   > > current contents of the visualization/presentation with the
>   > analysis
>   > > result of the most recent 10 seconds.
>   > > >
>   > > > I was first thinking of using ElasticSearch/Kibana for this
>   > > because I know it should be easy to set up, but I am thinking
>   it
>   > may
>   > > not be the best fit, because Elastic is by nature a search
>   engine
>   > > that is good for trending and stuff like that - not entire
>   > > replacement of the current view. And therefore I may also
>   > experience
>   > > difficulties implementing the view in Elastic.
>   > > >
>   > > > Does anyone know of any other visualization tools that work
>   well
>   > > with Flink? ...where it is easy to export the result of a
>   Flink
>   > job
>   > > to a user interface (web).
>   > > >
>   > > > Thanks
>   > > > Palle
>   >



Re: Visualize result of Flink job

2016-05-30 Thread Robert Metzger
Hi,
sounds doable. I think it should be easy to set up a first proof of concept
for this.
Let us know if you need any further assistance.

Regards,
Robert

On Mon, May 30, 2016 at 2:29 PM, Palle  wrote:

> Hi Robert.
>
> Thank you for the answer.
>
> I am looking at a rate of max 10.000 elements / 10 seconds, so
> Elastic/Kibana is probably the way to go. I'll find a way to model it.
>
> Thanks.
>
> /Palle
>
> - Original meddelelse -
>
> *Fra:* Robert Metzger 
> *Til:* user@flink.apache.org 
> *Dato:* Man, 30. maj 2016 12:31
>
> *Emne:* Re: Visualize result of Flink job
>
> Hi Palle,
> I think there is currently no way of sending the data from a streaming
> Flink job into Zeppelin.
> What rate / amount of data do you expect to send every 10 seconds to the
> visualization tool?
> People have used Flink -> ES -> Kibana for this purpose in the past [1],
> but I think you can not send millions of records per second into ES.
> Something like 1000 - 5000 elements / second should easily work for a small
> ES setup.
> [1]
> https://www.elastic.co/blog/building-real-time-dashboard-applications-with-apache-flink-elasticsearch-and-kibana
>
> On Mon, May 30, 2016 at 10:00 AM, Palle  wrote:
>
>> OK, I found product that seems to be what I am looking for: Apache
>> Zeppelin. I will have a look into that one. If anyone can point me to an
>> example (Git) outputting data from Flink to the Zeppelin Notebook I would
>> be happy.
>>
>>
>> - Original meddelelse -
>> > Fra: Palle 
>> > Til: user@flink.apache.org
>> > Dato: Man, 30. maj 2016 08:20
>> > Emne: Re: Visualize result of Flink job
>> >
>> > I know exactly what to visualize. As I wrote, it is the latest
>> > result of the Flink job I would like to visualize. There is no need
>> > to use elastic to find it first.
>> >
>> > The data I have is of such a  nature that  they every 10 seconds
>> > could be written into a file, meaning that the file at all times
>> > would contain the most recent results (at latest 10 seconds old). I
>> > am not interested in the history, and therefore I should think
>> > elastic is not the best fit. So my question is if anyone knows of a
>> > component (Apache or other) that can make the visualization a little
>> > nicer than just the file :-)
>> >
>> >
>> > - Original meddelelse -
>> > > Fra: Kanstantsin Kamkou 
>> > > Til: user@flink.apache.org
>> > > Dato: Søn, 29. maj 2016 22:42
>> > > Emne: Re: Visualize result of Flink job
>> > >
>> > > >  I am thinking it may not be the best fit, because Elastic is by
>> > > nature a search engine that is good for trending and stuff like
>> > that
>> > > - not entire replacement of the current view.
>> > > Why u think that the elasticsearch is not the right tool? To
>> > > visualise
>> > > something u have to find what to visualise  first, right?
>> > >
>> > >
>> > > On Sun, May 29, 2016 at 8:20 PM, Palle  wrote:
>> > > > Hi there
>> > > >
>> > > > I am using Flink to analyse a lot of incoming data. Every 10
>> > > seconds it makes sense to present the analysis so far as some form
>> > > of visualization. Every 10 seconds I therefore will replace the
>> > > current contents of the visualization/presentation with the
>> > analysis
>> > > result of the most recent 10 seconds.
>> > > >
>> > > > I was first thinking of using ElasticSearch/Kibana for this
>> > > because I know it should be easy to set up, but I am thinking it
>> > may
>> > > not be the best fit, because Elastic is by nature a search engine
>> > > that is good for trending and stuff like that - not entire
>> > > replacement of the current view. And therefore I may also
>> > experience
>> > > difficulties implementing the view in Elastic.
>> > > >
>> > > > Does anyone know of any other visualization tools that work well
>> > > with Flink? ...where it is easy to export the result of a Flink
>> > job
>> > > to a user interface (web).
>> > > >
>> > > > Thanks
>> > > > Palle
>> >
>>
>
>
>


Flink job restart at checkpoint interval

2016-11-14 Thread Satish Chandra Gupta
Hi,

I am using Value State, backed by FsStateBackend on hdfs, as following:

env.setStateBackend(new FsStateBackend(stateBackendPath))
env.enableCheckpointing(checkpointInterval)


It is non-iterative job running Flink/Yarn. The job restarts at
checkpointInterval, I have tried interval varying from 30 sec to 10 min.
Any idea why it could be restarting.

I see following exception in the log:

==

2016-11-14 09:24:28,787 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph-
Source: Custom Source -> Map -> Filter -> cell_users_update (1/1)
(fd72961bedbb0f18bffb5ae66b926313) switched from RUNNING to CANCELING
2016-11-14 09:24:28,788 INFO  org.apache.flink.yarn.YarnJobManager
 - Status of job 03a56958263a688dc34cc8d5069aac8f
(Processor) changed to FAILING.*java.lang.RuntimeException: Error
triggering a checkpoint as the result of receiving checkpoint barrier*
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$2.onEvent(StreamTask.java:701)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$2.onEvent(StreamTask.java:691)
at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:203)
at 
org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:129)
at 
org.apache.flink.streaming.runtime.io.StreamTwoInputProcessor.processInput(StreamTwoInputProcessor.java:215)
at 
org.apache.flink.streaming.runtime.tasks.TwoInputStreamTask.run(TwoInputStreamTask.java:89)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:225)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.esotericsoftware.kryo.KryoException:
java.io.IOException: DataStreamer Exception:
at com.esotericsoftware.kryo.io.Output.flush(Output.java:165)
at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:200)
at 
org.apache.flink.runtime.state.filesystem.AbstractFsState.snapshot(AbstractFsState.java:85)
at 
org.apache.flink.runtime.state.AbstractStateBackend.snapshotPartitionedState(AbstractStateBackend.java:265)
at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotOperatorState(AbstractStreamOperator.java:176)
at 
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotOperatorState(AbstractUdfStreamOperator.java:121)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:498)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask$2.onEvent(StreamTask.java:695)
... 8 more
Caused by: java.io.IOException: DataStreamer Exception:
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:563)
Caused by: java.lang.ExceptionInInitializerError
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1322)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1266)
at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
Caused by: java.lang.RuntimeException:
javax.xml.parsers.ParserConfigurationException: Feature
'http://apache.org/xml/features/xinclude' is not recognized.
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2648)
at 
org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2492)
at 
org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2405)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
at 
org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
at org.apache.hadoop.conf.Configuration.getInt(Configuration.java:1251)
at 
org.apache.hadoop.hdfs.protocol.HdfsConstants.(HdfsConstants.java:76)
... 3 more
Caused by: javax.xml.parsers.ParserConfigurationException: Feature
'http://apache.org/xml/features/xinclude' is not recognized.
at 
org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(Unknown
Source)
at 
org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2530)
... 9 more
2016-11-14 09:24:28,789 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph-
Source: Custom Source -> Map -> Filter -> device_status_update (1/1)
(9fe20e7a4336b3960b88febc89135d97) switched from RUNNING to CANCELING
2016-11-14 09:24:28,789 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph-
Source: Custom Source -> Map -> Filter -> Map -> Filter ->
cab_position_update (1/1) (91ea224efa3ba7d130405fbd247f4a45) switched
from RUNNING to CANCELING

==

Thanks,
+satish


Re: EOFException when running Flink job

2015-04-17 Thread Stephan Ewen
Hi!

After a quick look over the code, this seems like a bug. One cornercase of
the overflow handling code does not check for the "running out of memory"
condition.

I would like to wait if Robert Waury has some ideas about that, he is the
one most familiar with the code.

I would guess, though, that you should  be able to work around that by
either setting the solution set as "unmanaged", or by slightly changing the
memory configuration. It seems a rare cornercase, only if the memory runs
out in a very special situation. You may be able to avoid running into it
when using slightly more or less memory.

Greetings,
Stephan


On Fri, Apr 17, 2015 at 3:59 PM, Stefan Bunk 
wrote:

> Hi Squirrels,
>
> I have some trouble with a delta-iteration transitive closure program [1].
> When I run the program, I get the following error:
>
> java.io.EOFException
> at
> org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333)
> at
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140)
> at
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223)
> at
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291)
> at
> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307)
> at
> org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62)
> at
> org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26)
> at
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89)
> at
> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29)
> at
> org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219)
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
> at
> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
> at
> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
> at
> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
> at
> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> at
> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
> at java.lang.Thread.run(Thread.java:745)
>
> Both input files have been generated and written to HDFS by Flink jobs. I
> already ran the Flink program that generated them several times: the error
> persists.
> You can find the logs at [2] and [3].
>
> I am using the 0.9.0-milestone-1 release.
>
> Best,
> Stefan
>
>
> [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57
> [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4
> [3] One task manager's logs:
> https://gist.github.com/knub/8f2f953da95c8d7adefc
>


Re: EOFException when running Flink job

2015-04-19 Thread Stefan Bunk
Hi,

I tested three configurations:

taskmanager.heap.mb: 6144, taskmanager.memory.fraction: 0.5
taskmanager.heap.mb: 5544, taskmanager.memory.fraction: 0.6
taskmanager.heap.mb: 5144, taskmanager.memory.fraction: 0.7

The error occurs in all three configurations.
In the last configuration, I can even find another exception in the logs of
one of the taskmanagers:

19.Apr. 13:39:29 INFO  Task -
IterationHead(WorksetIteration (Resolved-Redirects)) (10/10) switched to
FAILED : java.lang.IndexOutOfBoundsException: Index: 161, Size: 161
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at
org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.resetTo(InMemoryPartition.java:352)
at
org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.access$100(InMemoryPartition.java:301)
at
org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:226)
at
org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
at
org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
at
org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
at
org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
at
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
at java.lang.Thread.run(Thread.java:745)

Greetings
Stefan

On 17 April 2015 at 16:05, Stephan Ewen  wrote:

> Hi!
>
> After a quick look over the code, this seems like a bug. One cornercase of
> the overflow handling code does not check for the "running out of memory"
> condition.
>
> I would like to wait if Robert Waury has some ideas about that, he is the
> one most familiar with the code.
>
> I would guess, though, that you should  be able to work around that by
> either setting the solution set as "unmanaged", or by slightly changing the
> memory configuration. It seems a rare cornercase, only if the memory runs
> out in a very special situation. You may be able to avoid running into it
> when using slightly more or less memory.
>
> Greetings,
> Stephan
>
>
> On Fri, Apr 17, 2015 at 3:59 PM, Stefan Bunk 
> wrote:
>
>> Hi Squirrels,
>>
>> I have some trouble with a delta-iteration transitive closure program [1].
>> When I run the program, I get the following error:
>>
>> java.io.EOFException
>> at
>> org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333)
>> at
>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.advance(AbstractPagedOutputView.java:140)
>> at
>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeByte(AbstractPagedOutputView.java:223)
>> at
>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeLong(AbstractPagedOutputView.java:291)
>> at
>> org.apache.flink.runtime.memorymanager.AbstractPagedOutputView.writeDouble(AbstractPagedOutputView.java:307)
>> at
>> org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:62)
>> at
>> org.apache.flink.api.common.typeutils.base.DoubleSerializer.serialize(DoubleSerializer.java:26)
>> at
>> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89)
>> at
>> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29)
>> at
>> org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219)
>> at
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
>> at
>> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
>> at
>> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
>> at
>> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
>> at
>> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>> at
>> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Both input files have been generated and written to HDFS by Flink jobs. I
>> already ran the Flink program that gener

Re: EOFException when running Flink job

2015-04-19 Thread Mihail Vieru
un(IterationHeadPactTask.java:270)
at

org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
at

org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
at java.lang.Thread.run(Thread.java:745)

Both input files have been generated and written to HDFS by
Flink jobs. I already ran the Flink program that generated
them several times: the error persists.
You can find the logs at [2] and [3].

I am using the 0.9.0-milestone-1 release.

        Best,
Stefan


[1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57
[2] Job manager logs:
https://gist.github.com/knub/01e3a4b0edb8cde66ff4
[3] One task manager's logs:
https://gist.github.com/knub/8f2f953da95c8d7adefc







Re: EOFException when running Flink job

2015-04-20 Thread Stephan Ewen
la.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:89)
>>>  at
>>> org.apache.flink.api.scala.typeutils.CaseClassSerializer.serialize(CaseClassSerializer.scala:29)
>>>  at
>>> org.apache.flink.runtime.operators.hash.InMemoryPartition.appendRecord(InMemoryPartition.java:219)
>>>  at
>>> org.apache.flink.runtime.operators.hash.CompactingHashTable.insertOrReplaceRecord(CompactingHashTable.java:536)
>>>  at
>>> org.apache.flink.runtime.operators.hash.CompactingHashTable.buildTableWithUniqueKey(CompactingHashTable.java:347)
>>>  at
>>> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.readInitialSolutionSet(IterationHeadPactTask.java:209)
>>>  at
>>> org.apache.flink.runtime.iterative.task.IterationHeadPactTask.run(IterationHeadPactTask.java:270)
>>>  at
>>> org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
>>>  at
>>> org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:217)
>>>  at java.lang.Thread.run(Thread.java:745)
>>>
>>>  Both input files have been generated and written to HDFS by Flink
>>> jobs. I already ran the Flink program that generated them several times:
>>> the error persists.
>>> You can find the logs at [2] and [3].
>>>
>>>  I am using the 0.9.0-milestone-1 release.
>>>
>>>  Best,
>>> Stefan
>>>
>>>
>>>  [1] Flink job: https://gist.github.com/knub/0bfec859a563009c1d57
>>> [2] Job manager logs: https://gist.github.com/knub/01e3a4b0edb8cde66ff4
>>> [3] One task manager's logs:
>>> https://gist.github.com/knub/8f2f953da95c8d7adefc
>>>
>>
>>
>
>


JSON data source for Flink Job

2015-05-28 Thread Tamara Mendt
Hello,

I have a JSON file containing multiple JSON objects and wish to use this as
a data source for a Flink Job.

What is the best way to do this?

Cheers,

Tamara


Re: Flink job repeated restart failure

2021-03-25 Thread Arvid Heise
Hi Vinaya,

SpillingAdaptiveSpanningRecordDeserializer tries to create a directory in
the temp directory, which you can configure by setting io.tmp.dirs. By
default, it's set to System.getProperty("java.io.tmpdir"), which seems to
be invalid in your case. (Note that the directory has to exist on the task
managers)

Best,

Arvid

On Thu, Mar 25, 2021 at 7:27 AM VINAYA KUMAR BENDI 
wrote:

> Dear all,
>
>
>
> One of the Flink jobs gave below exception and failed. Several attempts to
> restart the job resulted in the same exception and the job failed each
> time. The job started successfully only after changing the file name.
>
>
>
> *Flink Version*: 1.11.2
>
>
>
> *Exception*
>
> 2021-03-24 20:13:09,288 INFO
> org.apache.kafka.clients.producer.KafkaProducer  [] - [Producer
> clientId=producer-2] Closing the Kafka producer with timeoutMillis = 0 ms.
>
> 2021-03-24 20:13:09,288 INFO
> org.apache.kafka.clients.producer.KafkaProducer  [] - [Producer
> clientId=producer-2] Proceeding to force close the producer since pending
> requests could not be completed within timeout 0 ms.
>
> 2021-03-24 20:13:09,304 WARN
> org.apache.flink.runtime.taskmanager.Task[] - Flat Map
> -> async wait operator -> Process -> Sink: Unnamed (1/1)
> (8905142514cb25adbd42980680562d31) switched from RUNNING to FAILED.
>
> java.io.IOException: No such file or directory
>
> at java.io.UnixFileSystem.createFileExclusively(Native Method)
> ~[?:1.8.0_252]
>
> at java.io.File.createNewFile(File.java:1012) ~[?:1.8.0_252]
>
> at
> org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.createSpillingChannel(SpanningWrapper.java:291)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.updateLength(SpanningWrapper.java:178)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.runtime.io.network.api.serialization.SpanningWrapper.transferFrom(SpanningWrapper.java:111)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.readNonSpanningRecord(SpillingAdaptiveSpanningRecordDeserializer.java:110)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:85)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.io.StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:146)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:566)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536)
> ~[flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
> [flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
> [flink-dist_2.12-1.11.2.jar:1.11.2]
>
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
>
> 2021-03-24 20:13:09,305 INFO
> org.apache.flink.runtime.taskmanager.Task[] - Freeing
> task resources for Flat Map -> async wait operator -> Process -> Sink:
> Unnamed (1/1) (8905142514cb25adbd42980680562d31).
>
> 2021-03-24 20:13:09,311 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor   [] -
> Un-registering task and sending final execution state FAILED to JobManager
> for task Flat Map -> async wait operator -> Process -> Sink: Unnamed (1/1)
> 8905142514cb25adbd42980680562d31.
>
>
>
> *File*:
> https://github.com/apache/flink/blob/release-1.11.2/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/api/serialization/SpanningWrapper.java
>
>
>
> *Related Jira ID*: https://issues.apache.org/jira/browse/FLINK-18811
>
>
>
> Similar exception mentioned in FLINK-18811 which has a fix in 1.12.0.
> Though in our case, we didn’t notice any disk failure. Is there any other
> reason(s) for the above mentioned IOException?
>
>
>
> While we are planning to upgrade to the latest Flink version, are there
> any other workaround(s) instead of deploying the j

Re: Flink job repeated restart failure

2021-03-25 Thread vinaya
Hi Arvid,

Thank you for the suggestion.

Indeed, the specified setting was commented out in the Flink configuration
(flink-conf.yaml).

  # io.tmp.dirs: /tmp

Is there a fallback (e.g. /tmp) if io.tmp.dirs and
System.getProperty("java.io.tmpdir") are both not set?

Will configure this setting to a valid value as suggested.

Kind regards,
Vinaya



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink job repeated restart failure

2021-03-26 Thread Arvid Heise
Hi Vinaya,

java.io.tmpdir is already the fallback and I'm not aware of another level
of fallback.

Ensuring java.io.tmpdir is valid is also relevant for some third-party
libraries that rely on it (e.g. FileSystem that cache local files). It's
good practice to set that appropriately.

On Fri, Mar 26, 2021 at 6:32 AM vinaya  wrote:

> Hi Arvid,
>
> Thank you for the suggestion.
>
> Indeed, the specified setting was commented out in the Flink configuration
> (flink-conf.yaml).
>
>   # io.tmp.dirs: /tmp
>
> Is there a fallback (e.g. /tmp) if io.tmp.dirs and
> System.getProperty("java.io.tmpdir") are both not set?
>
> Will configure this setting to a valid value as suggested.
>
> Kind regards,
> Vinaya
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>


Re: ERROR submmiting a flink job

2020-07-15 Thread Yun Tang
Hi Aissa

The reason why the job exits is due to "Recovery is suppressed by 
NoRestartBackoffTimeStrategy" and this is because Flink use "no restart" 
strategy when checkpoint is not enabled [1]. That is to say, you should better 
look at why the job failed at the 1st time, once the job failed and you will 
meet the errors you pasted if the strategy is "no restart".

[1] 
https://ci.apache.org/projects/flink/flink-docs-stable/dev/task_failure_recovery.html#restart-strategies

Best
Yun Tang

From: Aissa Elaffani 
Sent: Wednesday, July 15, 2020 7:29
To: user@flink.apache.org 
Subject: ERROR submmiting a flink job

Hello Guys,
I am trying to launch a FLINK app on a distance server, but I have this error 
message.
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: org.apache.flink.client.program.ProgramInvocationException: 
Job failed (JobID: 8bf7f299746e051ea7b94afd07e29d3d)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:335)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:205)
at 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:138)
at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:662)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:210)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:893)
at 
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:966)
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:966)
Caused by: java.util.concurrent.ExecutionException: 
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID: 
8bf7f299746e051ea7b94afd07e29d3d)
at 
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at 
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:83)
at 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1620)
at sensors.StreamingJob.main(StreamingJob.java:145)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:321)
... 8 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: Job 
failed (JobID: 8bf7f299746e051ea7b94afd07e29d3d)
at 
org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:112)
at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$21(RestClusterClient.java:565)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at 
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$8(FutureUtils.java:291)
at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at 
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:575)
at 
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:943)
at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.cli

Tools for Flink Job performance testing

2020-08-10 Thread narasimha
Hi,

I'm new to the streaming world, checking on Performance testing tools.  Are
there any recommended Performance testing tools for Flink?

-- 
A.Narasimha Swamy


Running Flink job as a rest

2020-12-02 Thread dhurandar S
Can Flink job be running as Rest Server, Where Apache Flink job is
listening on a port (443). When a user calls this URL with payload,
data directly goes to the Apache Flink windowing function.

Right now Flink can ingest data from Kafka or Kinesis, but we have a use
case where we would like to push data to Flink, where Flink is listening on
a port

-- 
Thank you and regards,
Dhurandar


Re: Cancel flink job occur exception

2018-09-08 Thread Gary Yao
Hi all,

The question is being handled on the dev mailing list [1].

Best,
Gary

[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Cancel-flink-job-occur-exception-td24056.html

On Tue, Sep 4, 2018 at 2:21 PM, rileyli(李瑞亮)  wrote:

> Hi all,
>   I submit a flink job through yarn-cluster mode and cancel job with
> savepoint option immediately after job status change to deployed.
> Sometimes i met this error:
>
> org.apache.flink.util.FlinkException: Could not cancel job .
> at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(
> CliFrontend.java:585)
> at org.apache.flink.client.cli.CliFrontend.runClusterAction(
> CliFrontend.java:960)
> at org.apache.flink.client.cli.CliFrontend.cancel(
> CliFrontend.java:577)
> at org.apache.flink.client.cli.CliFrontend.parseParameters(
> CliFrontend.java:1034)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not
> complete the operation. Number of retries has been exhausted.
> at java.util.concurrent.CompletableFuture.reportGet(
> CompletableFuture.java:357)
> at java.util.concurrent.CompletableFuture.get(
> CompletableFuture.java:1895)
> at org.apache.flink.client.program.rest.RestClusterClient.
> cancelWithSavepoint(RestClusterClient.java:398)
> at org.apache.flink.client.cli.CliFrontend.lambda$cancel$4(
> CliFrontend.java:583)
> ... 6 more
> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
> Could not complete the operation. Number of retries has been exhausted.
> at org.apache.flink.runtime.concurrent.FutureUtils.lambda$
> retryOperationWithDelay$5(FutureUtils.java:213)
> at java.util.concurrent.CompletableFuture.uniWhenComplete(
> CompletableFuture.java:760)
> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(
> CompletableFuture.java:736)
> ... 1 more
> Caused by: java.util.concurrent.CompletionException:
> java.net.ConnectException: Connect refuse: xxx/xxx.xxx.xxx.xxx:xxx
> at java.util.concurrent.CompletableFuture.encodeThrowable(
> CompletableFuture.java:292)
> at java.util.concurrent.CompletableFuture.completeThrowable(
> CompletableFuture.java:308)
> at java.util.concurrent.CompletableFuture.uniCompose(
> CompletableFuture.java:943)
> at java.util.concurrent.CompletableFuture$UniCompose.
> tryFire(CompletableFuture.java:926)
> ... 16 more
> Caused by: java.net.ConnectException: Connect refuse:
> xxx/xxx.xxx.xxx.xxx:xxx
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(
> SocketChannelImpl.java:717)
> at org.apache.flink.shaded.netty4.io.netty.channel.
> socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at org.apache.flink.shaded.netty4.io.netty.channel.nio.
> AbstractNioChannel$AbstractNioUnsafe.finishConnect(
> AbstractNioChannel.java:281)
> ... 7 more
>
> I check the jobmanager log, no error found. Savepoint is correct saved
> in hdfs. Yarn appliction status changed to FINISHED and FinalStatus change
> to KILLED.
> I think this issue occur because RestClusterClient cannot find
> jobmanager addresss after Jobmanager(AM) has shutdown.
> My flink version is 1.5.3.
> Anyone could help me to resolve this issue, thanks!
>
> Best Regard!
>


Flink Job Cluster Deployment on K8s

2018-10-18 Thread Thad Truman
Hello,

I am trying to experiment with the new Flink job cluster on Kubernetes that is 
available with the Flink 1.6.x release.

I am using the instructions 
here<https://github.com/apache/flink/blob/release-1.6/flink-container/docker/README.md>
 to create the docker image, which is working great.   This image then gets 
pushed to our Artifactory.

I am able to create the job cluster service using 
this<https://github.com/apache/flink/blob/release-1.6/flink-container/kubernetes/job-cluster-service.yaml>
 helm chart.

However when I try to deploy the job cluster job using the helm chart below 
(based on 
this<https://github.com/apache/flink/blob/release-1.6/flink-container/kubernetes/job-cluster-job.yaml.template>
 one):

apiVersion: batch/v1
kind: Job
metadata:
  name: flink-job-cluster
spec:
  template:
metadata:
  labels:
app: flink
component: job-cluster
spec:
  imagePullSecrets:
  - name: artifactory-docker-registry
  restartPolicy: OnFailure
  containers:
      - name: flink-job-cluster
image: {ImageOnArtifactory}
args: ["job-cluster", "--job-classname", "job.jar", 
"-Djobmanager.rpc.address=flink-job-cluster",
   "-Dparallelism.default=1", "-Dblob.server.port=6124", 
"-Dquery.server.ports=6125"]
ports:
- containerPort: 6123
  name: rpc
- containerPort: 6124
  name: blob
- containerPort: 6125
  name: query
- containerPort: 8081
  name: ui

I get this error on the pod:

%d [%thread] %-5level %logger - %msg%n org.apache.flink.util.FlinkException: 
Could not load the provied entrypoint class.
at 
org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.createPackagedProgram(StandaloneJobClusterEntryPoint.java:102)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.retrieveJobGraph(StandaloneJobClusterEntryPoint.java:84)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.runtime.entrypoint.JobClusterEntrypoint.createDispatcher(JobClusterEntrypoint.java:107)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startClusterComponents(ClusterEntrypoint.java:353)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:232)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:190)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:189)
 [flink-dist_2.11-1.6.1.jar:1.6.1]
at 
org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.main(StandaloneJobClusterEntryPoint.java:170)
 [flink-dist_2.11-1.6.1.jar:1.6.1]
Caused by: java.lang.ClassNotFoundException: job.jar
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) 
~[?:1.8.0_111-internal]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_111-internal]
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) 
~[?:1.8.0_111-internal]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_111-internal]
at 
org.apache.flink.container.entrypoint.StandaloneJobClusterEntryPoint.createPackagedProgram(StandaloneJobClusterEntryPoint.java:99)
 ~[flink-dist_2.11-1.6.1.jar:1.6.1]
... 8 more


When I launch and attach the docker image, job.jar exists in /opt/flink and 
there is a symbolic link in /opt/flink/lib.

Any ideas as to why job.jar can't be found?

Our flink version is 1.6.1.  We are using the flink-1.6.1-bin-scala_2.11 
distribution.

Any help would be much appreciated.

Thanks,

Thad Truman | Software Engineer | Neovest, Inc.
A:
T:
E:

1145 S 800 E, Ste 310 Orem, UT 84097
+1 801 900 2480
ttru...@neovest.com<mailto:ttru...@neovest.com>


Support Desk: supp...@neovest.com<mailto:supp...@neovest.com> / +1 800 433 4276



[Alt logo for white backgrounds (Grey Flat)2]

This email is confidential and subject to important disclaimers and conditions 
including on offers for purchase or sale of securities accuracy and 
completeness of information viruses confidentiality legal privilege and legal 
entity disclaimers available at 
www.neovest.com/disclosures.html<http://www.neovest.com/disclosures.html>






Problem with Flink job and Kafka.

2021-10-18 Thread Marco Villalobos
I have the simplest Flink job that simply deques off of a kafka topic and
writes to another kafka topic, but with headers, and manually copying the
event time into the kafka sink.

It works as intended, but sometimes I am getting this error:

org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to
write a non-default producerId at version 1.

Does anybody know what this means and how to fix this?

Thank you.


Trigerring Savepoint for the Flink Job

2018-05-31 Thread Anil
I am using Flink 1.4.2. I have forker Uber's AthenaX  project
<https://github.com/uber/AthenaX>  . 

The Flink jobs are deployed in Yarn cluster. I needed to save the Savepoint
for all the jobs everyday.

ClusterClient
<https://github.com/apache/flink/blob/release-1.4.2/flink-clients/src/main/java/org/apache/flink/client/program/ClusterClient.java#L672>
  
gave an implementation for saving savepoint using Flink ID. 
YarnClusterClient
<https://github.com/apache/flink/blob/release-1.4.2/flink-yarn/src/main/java/org/apache/flink/yarn/YarnClusterClient.java>
  
is an implementation of ClusteClient.

Initial though was to use YarnClusterClient instance with Flink Id (I save
this when the Flink Job is deployed to Yarn cluster) to trigger savepoint. 
So I created an instance of YarnClusterClient once and saved it so that I
could use it anytime in the application. But this doesn't seems to work. It
doesn't seems that it can cancel or trigger savepoint even with valid Flink
ID. When I try to cancel a valid Flink Job it throws and error for invalid
id. 

I would appreciate if someone could help me out here.





--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Constant backpressure on flink job

2019-04-25 Thread Dawid Wysakowicz
Hi Monika,

I would start with identifying the operator that causes backpressure.
More information how to monitor backpressure you can find here in the
docs[1]. You might also be interested in Seth's (cc'ed) webinar[2],
where he also talks how to debug backpressure.

Best,

Dawid

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.8/monitoring/back_pressure.html#monitoring-back-pressure

[2]
https://www.ververica.com/resources/webinar/webinar/debugging-flink-tutorial

On 22/04/2019 17:44, Monika Hristova wrote:
> Hello,
>  
> We are experiencing regular backpressure (at least once a week). I
> would like to ask how we can fix it.
>  
> Currently our configurations are:
> *vi /usr/lib/flink/conf/flink-conf.yaml*
> # Settings applied by Cloud Dataproc initialization action
> jobmanager.rpc.address: bonusengine-prod-m-0
> jobmanager.heap.mb: 4096
> jobmanager.rpc.port: 6123
> taskmanager.heap.mb: 4096
> taskmanager.memory.preallocate: false
> taskmanager.numberOfTaskSlots: 8
> #taskmanager.network.numberOfBuffers: 21952 # legacy deprecated
> taskmanager.network.memory.fraction: 0.9
> taskmanager.network.memory.min: 67108864
> taskmanager.network.memory.max: 1073741824
> taskmanager.memory.segment-size: 65536
> parallelism.default: 52
> web.port: 8081
> web.timeout: 12
> heartbeat.interval: 1
> heartbeat.timeout: 10
> yarn.application-attempts: 10
> high-availability: zookeeper
> high-availability.zookeeper.quorum:
> bonusengine-prod-m-0:2181,bonusengine-prod-m-1:2181,bonusengine-prod-m-2:2181
> high-availability.zookeeper.path.root: /flink
> #high-availability.zookeeper.storageDir: hdfs:///flink/recovery #
> legacy deprecated
> high-availability.storageDir: hdfs:///flink/recovery
> flink.partition-discovery.interval-millis=6
> fs.hdfs.hadoopconf: /etc/hadoop/conf
> state.backend: rocksdb
> state.checkpoints.dir: hdfs:///bonusengine/checkpoints/
> state.savepoints.dir: hdfs:///bonusengine/savepoints
> metrics.reporters: stsd
> metrics.reporter.stsd.class:
> org.apache.flink.metrics.statsd.StatsDReporter
> metrics.reporter.stsd.host: 127.0.0.1
> metrics.reporter.stsd.port: 8125
> zookeeper.sasl.disable: true
> yarn.reallocate-failed: true
> yarn.maximum-failed-containers: 32
> web.backpressure.refresh-interval: 6
> web.backpressure.num-samples: 100
> web.backpressure.delay-between-samples: 50
>  
> with Hadoop HA cluster: masters -> 8 vCPUs, 7.2 GB and slaves -> 16
> vCPUs, 60 GB with yarn configuration(*see attached file*)
>  
> We have one yarn session started where 8 jobs are run. Three of them
> are consuming the same source (kafka) which is causing the
> backpressure, but only one of them experiences backpressure. The state
> of the job is 20 something MB and the checkpoint is configured as follows:
> *checkpointing.interval**=*30 # makes sure value in  ms of
> progress happens between checkpoints
> *checkpointing.pause_between_checkpointing**=*24 # checkpoints
> have to complete within value in ms or are discarded
> *checkpointing.timeout**=*6 # allows given number of checkpoints
> to be in progress at the same time
> *checkpointing.max_concurrent_checkpoints**=*1 # enables/disables
> externalized checkpoints which are retained after job cancellation
> *checkpointing.externalized_checkpoints.enabled**=*true
>  
> as checkpointing pause was increased and timeout was reduced on
> multiple occasions as the job kept failing unable to recover from
> backpressure. RocksDB is configured state backend. The problem keeps
> reproducing even with one minute timeout. Also I would like to point
> out that the perfect checkpointing for that job would be 2 min.
> I would like to ask what might be the problem or at least how to trace
> it ?
>  
> Best Regards,
> Monika Hristova
>
> Get Outlook for Android 
>


signature.asc
Description: OpenPGP digital signature


Approach to Auto Scaling Flink Job

2019-05-06 Thread Anil
I'm using Uber Open Source project Athenax.  As mentioned in it's docs[1] it
supports `Auto scaling for AthenaX jobs`. I went through the source code on
Github but didn't find the auto scaling  part. Can someone aware of this
project please point me in the right direction here. 

I'm using Flink's Table API (Flink 1.4.2) and submit my jobs programatically
to the Yarn Cluster. All the JM and TM metric are saved in Prometheus. I am
thinking of using these metric to develop an algo to re-scale jobs. I would
also appreciate if someone could share how they developed there auto-scaling
part. 

[1]  https://athenax.readthedocs.io/en/latest/
  




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/


Re: Flink job server with HA

2019-06-03 Thread Xintong Song
Hi Boris,

I think what you described that putJobGraph is not invoked in Flink job
cluster is by design and should not cause a failure of job recovering. For
a Flink job cluster, there is only one job graph to execute. Instead of
uploading job graph to an already running cluster (like in a session
cluster), the job graph in a Flink job cluster is uploaded before the
cluster is started, together with the Flink framework jars. Please refer to
MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.

I think we need more information to find the root cause of your problem.
For example, can you explain what are the detailed operation steps do you
perform when you say "trying to restart a Job Master".

Thank you~

Xintong Song



On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <
boris.lublin...@lightbend.com> wrote:

> I am trying to experiment with Flink Job server with HA and I am noticing,
> that in this case
> method putJobGraph in the class SubmittedJobGraphStore Is never invoked.
> (I can see that it is invoked in the case of session cluster when a job is
> added)
> As a result, when I am trying to restart a Job Master, it finds no running
> jobs and is not trying to restore it.
> Am I missing something?
>
>
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
>


Re: Flink job server with HA

2019-06-03 Thread Boris Lublinsky
Thanks,
Thats what I thought initially.
The issue is that because of this, during restart, it does not know which job 
was running before (it is obtained from submitted job graph store).
Because this is empty, there is no restarted jobs and the cluster does not even 
try to restore checkpoints.
I can see that checkpoints are stored correctly, but they are never accessed.

Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/

> On Jun 3, 2019, at 9:23 PM, Xintong Song  wrote:
> 
> Hi Boris,
> 
> I think what you described that putJobGraph is not invoked in Flink job 
> cluster is by design and should not cause a failure of job recovering. For a 
> Flink job cluster, there is only one job graph to execute. Instead of 
> uploading job graph to an already running cluster (like in a session 
> cluster), the job graph in a Flink job cluster is uploaded before the cluster 
> is started, together with the Flink framework jars. Please refer to 
> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.
> 
> I think we need more information to find the root cause of your problem. For 
> example, can you explain what are the detailed operation steps do you perform 
> when you say "trying to restart a Job Master".
> 
> Thank you~
> Xintong Song
> 
> 
> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky 
> mailto:boris.lublin...@lightbend.com>> wrote:
> I am trying to experiment with Flink Job server with HA and I am noticing, 
> that in this case
> method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I 
> can see that it is invoked in the case of session cluster when a job is added)
> As a result, when I am trying to restart a Job Master, it finds no running 
> jobs and is not trying to restore it.
> Am I missing something?
> 
>  
> 
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>
> https://www.lightbend.com/ <https://www.lightbend.com/>



Re: Flink job server with HA

2019-06-03 Thread Xintong Song
So here are my questions:
1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
2. How do you trigger "restart a Job Master"?

Thank you~

Xintong Song



On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <
boris.lublin...@lightbend.com> wrote:

> Thanks,
> Thats what I thought initially.
> The issue is that because of this, during restart, it does not know which
> job was running before (it is obtained from submitted job graph store).
> Because this is empty, there is no restarted jobs and the cluster does not
> even try to restore checkpoints.
> I can see that checkpoints are stored correctly, but they are never
> accessed.
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Jun 3, 2019, at 9:23 PM, Xintong Song  wrote:
>
> Hi Boris,
>
> I think what you described that putJobGraph is not invoked in Flink job
> cluster is by design and should not cause a failure of job recovering. For
> a Flink job cluster, there is only one job graph to execute. Instead of
> uploading job graph to an already running cluster (like in a session
> cluster), the job graph in a Flink job cluster is uploaded before the
> cluster is started, together with the Flink framework jars. Please refer to
> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.
>
> I think we need more information to find the root cause of your problem.
> For example, can you explain what are the detailed operation steps do you
> perform when you say "trying to restart a Job Master".
>
> Thank you~
> Xintong Song
>
>
>
> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <
> boris.lublin...@lightbend.com> wrote:
>
>> I am trying to experiment with Flink Job server with HA and I am
>> noticing, that in this case
>> method putJobGraph in the class SubmittedJobGraphStore Is never invoked.
>> (I can see that it is invoked in the case of session cluster when a job is
>> added)
>> As a result, when I am trying to restart a Job Master, it finds no
>> running jobs and is not trying to restore it.
>> Am I missing something?
>>
>>
>>
>> Boris Lublinsky
>> FDP Architect
>> boris.lublin...@lightbend.com
>> https://www.lightbend.com/
>>
>>
>


Re: Flink job server with HA

2019-06-03 Thread Boris Lublinsky
I am running on k8
Job master runs as a deployment of 1, so just killing a pod restarts it

Boris Lublinsky
FDP Architect
boris.lublin...@lightbend.com
https://www.lightbend.com/

> On Jun 3, 2019, at 9:46 PM, Xintong Song  wrote:
> 
> So here are my questions:
> 1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
> 2. How do you trigger "restart a Job Master"?
> 
> Thank you~
> Xintong Song
> 
> 
> On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky 
> mailto:boris.lublin...@lightbend.com>> wrote:
> Thanks,
> Thats what I thought initially.
> The issue is that because of this, during restart, it does not know which job 
> was running before (it is obtained from submitted job graph store).
> Because this is empty, there is no restarted jobs and the cluster does not 
> even try to restore checkpoints.
> I can see that checkpoints are stored correctly, but they are never accessed.
> 
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>
> https://www.lightbend.com/ <https://www.lightbend.com/>
>> On Jun 3, 2019, at 9:23 PM, Xintong Song > <mailto:tonysong...@gmail.com>> wrote:
>> 
>> Hi Boris,
>> 
>> I think what you described that putJobGraph is not invoked in Flink job 
>> cluster is by design and should not cause a failure of job recovering. For a 
>> Flink job cluster, there is only one job graph to execute. Instead of 
>> uploading job graph to an already running cluster (like in a session 
>> cluster), the job graph in a Flink job cluster is uploaded before the 
>> cluster is started, together with the Flink framework jars. Please refer to 
>> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.
>> 
>> I think we need more information to find the root cause of your problem. For 
>> example, can you explain what are the detailed operation steps do you 
>> perform when you say "trying to restart a Job Master".
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky 
>> mailto:boris.lublin...@lightbend.com>> wrote:
>> I am trying to experiment with Flink Job server with HA and I am noticing, 
>> that in this case
>> method putJobGraph in the class SubmittedJobGraphStore Is never invoked. (I 
>> can see that it is invoked in the case of session cluster when a job is 
>> added)
>> As a result, when I am trying to restart a Job Master, it finds no running 
>> jobs and is not trying to restore it.
>> Am I missing something?
>> 
>>  
>> 
>> Boris Lublinsky
>> FDP Architect
>> boris.lublin...@lightbend.com <mailto:boris.lublin...@lightbend.com>
>> https://www.lightbend.com/ <https://www.lightbend.com/>
> 



Re: Flink job server with HA

2019-06-03 Thread Xintong Song
If that is the case, then I would suggest you to check the following two
things:
1. Is the HA mode configured properly in Flink configuration? There should
be a config option "high-availability" in your flink-conf.yarml. If not
configured, the default value would be "NONE".
2. It "ClassPathJobGraphRetriever#retrieveJobGraph" actually invoked, and
is there any exceptions thrown from it. This is to verify whether the
correct code path for job cluster is invoked.

Thank you~

Xintong Song



On Tue, Jun 4, 2019 at 10:48 AM Boris Lublinsky <
boris.lublin...@lightbend.com> wrote:

> I am running on k8
> Job master runs as a deployment of 1, so just killing a pod restarts it
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Jun 3, 2019, at 9:46 PM, Xintong Song  wrote:
>
> So here are my questions:
> 1. What environment do you run Flink in? Is it locally, on Yarn or Mesos?
> 2. How do you trigger "restart a Job Master"?
>
> Thank you~
> Xintong Song
>
>
>
> On Tue, Jun 4, 2019 at 10:35 AM Boris Lublinsky <
> boris.lublin...@lightbend.com> wrote:
>
>> Thanks,
>> Thats what I thought initially.
>> The issue is that because of this, during restart, it does not know which
>> job was running before (it is obtained from submitted job graph store).
>> Because this is empty, there is no restarted jobs and the cluster does
>> not even try to restore checkpoints.
>> I can see that checkpoints are stored correctly, but they are never
>> accessed.
>>
>> Boris Lublinsky
>> FDP Architect
>> boris.lublin...@lightbend.com
>> https://www.lightbend.com/
>>
>> On Jun 3, 2019, at 9:23 PM, Xintong Song  wrote:
>>
>> Hi Boris,
>>
>> I think what you described that putJobGraph is not invoked in Flink job
>> cluster is by design and should not cause a failure of job recovering. For
>> a Flink job cluster, there is only one job graph to execute. Instead of
>> uploading job graph to an already running cluster (like in a session
>> cluster), the job graph in a Flink job cluster is uploaded before the
>> cluster is started, together with the Flink framework jars. Please refer to
>> MiniDispatcher and SingleJobSubmittedJobGraphStore for the details.
>>
>> I think we need more information to find the root cause of your problem.
>> For example, can you explain what are the detailed operation steps do you
>> perform when you say "trying to restart a Job Master".
>>
>> Thank you~
>> Xintong Song
>>
>>
>>
>> On Mon, Jun 3, 2019 at 10:05 PM Boris Lublinsky <
>> boris.lublin...@lightbend.com> wrote:
>>
>>> I am trying to experiment with Flink Job server with HA and I am
>>> noticing, that in this case
>>> method putJobGraph in the class SubmittedJobGraphStore Is never
>>> invoked. (I can see that it is invoked in the case of session cluster when
>>> a job is added)
>>> As a result, when I am trying to restart a Job Master, it finds no
>>> running jobs and is not trying to restore it.
>>> Am I missing something?
>>>
>>>
>>>
>>> Boris Lublinsky
>>> FDP Architect
>>> boris.lublin...@lightbend.com
>>> https://www.lightbend.com/
>>>
>>>
>>
>


  1   2   3   4   5   6   7   8   >