RE:Flink job on secure Yarn fails after many hours

2016-03-19 Thread Thomas Lamirault
Hi Max,

I will try these workaround.
Thanks

Thomas


De : Maximilian Michels [m...@apache.org]
Envoyé : mardi 15 mars 2016 16:51
À : user@flink.apache.org
Cc : Niels Basjes
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Thomas,

Nils (CC) and I found out that you need at least Hadoop version 2.6.1
to properly run Kerberos applications on Hadoop clusters. Versions
before that have critical bugs related to the internal security token
handling that may expire the token although it is still valid.

That said, there is another limitation of Hadoop that the maximum
internal token life time is one week. To work around this limit, you
have two options:

a) increasing the maximum token life time

In yarn-site.xml:


  yarn.resourcemanager.delegation.token.max-lifetime
  9223372036854775807


In hdfs-site.xml


  dfs.namenode.delegation.token.max-lifetime
  9223372036854775807



b) setup the Yarn ResourceManager as a proxy for the HDFS Namenode:

>From 
>http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html

"You can work around this by configuring the ResourceManager as a
proxy user for the corresponding HDFS NameNode so that the
ResourceManager can request new tokens when the existing ones are past
their maximum lifetime."

@Nils: Could you comment on what worked best for you?

Best,
Max


On Mon, Mar 14, 2016 at 12:24 PM, Thomas Lamirault
 wrote:
>
> Hello everyone,
>
>
>
> We are facing the same probleme now in our Flink applications, launch using 
> YARN.
>
> Just want to know if there is any update about this exception ?
>
>
>
> Thanks
>
>
>
> Thomas
>
>
>
> 
>
> De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes 
> [ni...@basjes.nl]
> Envoyé : vendredi 4 décembre 2015 10:40
> À : user@flink.apache.org
> Objet : Re: Flink job on secure Yarn fails after many hours
>
> Hi Maximilian,
>
> I just downloaded the version from your google drive and used that to run my 
> test topology that accesses HBase.
> I deliberately started it twice to double the chance to run into this 
> situation.
>
> I'll keep you posted.
>
> Niels
>
>
> On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels  wrote:
>>
>> Hi Niels,
>>
>> Just got back from our CI. The build above would fail with a
>> Checkstyle error. I corrected that. Also I have built the binaries for
>> your Hadoop version 2.6.0.
>>
>> Binaries:
>>
>> https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip
>>
>> Thanks,
>> Max
>>
>> On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels <0.0.0.0:41281
>> >>>> >> >> > 21:30:28,185 ERROR 
>> >>>> >> >> > org.apache.flink.runtime.jobmanager.JobManager
>> >>>> >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
>> >>>> >> >> > stopping
>> >>>> >> >> > process...
>> >>>> >> >> > 21:30:28,286 INFO
>> >>>> >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> >>>> >> >> > - Removing web root dir
>> >>>> >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
>> >>>> >> >> >
>> >>>> >> >> >
>> >>>> >> >> > --
>> >>>> >> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >> >
>> >>>> >> >> > Niels Basjes
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> >
>> >>>> >> > --
>> >>>> >> > Best regards / Met vriendelijke groeten,
>> >>>> >> >
>> >>>> >> > Niels Basjes
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> >
>> >>>> > --
>> >>>> > Best regards / Met vriendelijke groeten,
>> >>>> >
>> >>>> > Niels Basjes
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Best regards / Met vriendelijke groeten,
>> >>>
>> >>> Niels Basjes
>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes

RE:Flink job on secure Yarn fails after many hours

2016-03-14 Thread Thomas Lamirault
Hello everyone,



We are facing the same probleme now in our Flink applications, launch using 
YARN.

Just want to know if there is any update about this exception ?



Thanks



Thomas





De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes [ni...@basjes.nl]
Envoyé : vendredi 4 décembre 2015 10:40
À : user@flink.apache.org
Objet : Re: Flink job on secure Yarn fails after many hours

Hi Maximilian,

I just downloaded the version from your google drive and used that to run my 
test topology that accesses HBase.
I deliberately started it twice to double the chance to run into this situation.

I'll keep you posted.

Niels


On Thu, Dec 3, 2015 at 11:44 AM, Maximilian Michels 
mailto:m...@apache.org>> wrote:
Hi Niels,

Just got back from our CI. The build above would fail with a
Checkstyle error. I corrected that. Also I have built the binaries for
your Hadoop version 2.6.0.

Binaries:

https://github.com/mxm/flink/archive/kerberos-yarn-heartbeat-fail-0.10.1.zip

Thanks,
Max

On Wed, Dec 2, 2015 at 6:52 PM, Maximilian Michels 
<0.0.0.0:41281
 >> >> > 21:30:28,185 ERROR org.apache.flink.runtime.jobmanager.JobManager
 >> >> > - Actor akka://flink/user/jobmanager#403236912 terminated,
 >> >> > stopping
 >> >> > process...
 >> >> > 21:30:28,286 INFO
 >> >> > org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
 >> >> > - Removing web root dir
 >> >> > /tmp/flink-web-e1a44f94-ea6d-40ee-b87c-e3122d5cb9bd
 >> >> >
 >> >> >
 >> >> > --
 >> >> > Best regards / Met vriendelijke groeten,
 >> >> >
 >> >> > Niels Basjes
 >> >
 >> >
 >> >
 >> >
 >> > --
 >> > Best regards / Met vriendelijke groeten,
 >> >
 >> > Niels Basjes
 >
 >
 >
 >
 > --
 > Best regards / Met vriendelijke groeten,
 >
 > Niels Basjes
>>>
>>>
>>>
>>>
>>> --
>>> Best regards / Met vriendelijke groeten,
>>>
>>> Niels Basjes



--
Best regards / Met vriendelijke groeten,

Niels Basjes


RE:Flink HA

2016-02-19 Thread Thomas Lamirault
I have resolved my issues.
It seems that Avro had difficulties with my POJO. I change the management of 
the null value and it works fine.

But, there is a way to cancel the old jobGraph who are starving in restarting 
status, and to keep the last one to restart ? Other than cancel JobId manually ?

Thanks

Thomas

De : Thomas Lamirault [thomas.lamira...@ericsson.com]
Envoyé : vendredi 19 février 2016 10:56
À : user@flink.apache.org
Objet : RE:Flink HA

After set this configuration, I have some exceptions :

java.lang.Exception: Could not restore checkpointed state to operators and 
functions
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:414)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.InvalidClassException: java.util.HashMap; invalid descriptor 
for field
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:710)
at 
java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:830)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:294)
at 
org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.(WindowOperator.java:446)
at 
org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.restoreState(WindowOperator.java:621)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:406)
... 3 more
Caused by: java.lang.IllegalArgumentException: illegal signature
at java.io.ObjectStreamField.(ObjectStreamField.java:122)
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:708)
... 13 more


If I run the application in not-HA mode, there is no problem.
What can cause this kind of error ?

Thanks

Thomas
De : Thomas Lamirault 
[thomas.lamira...@ericsson.com]Envoyé : vendredi 19 février 2016 09:39À : 
user@flink.apache.orgObjet : RE:Flink HAThanks for the quick reply !> 
state.backend.fs.checkpointdirIs actually pointing to a hdfs directory but I 
will modify  the recovery.zookeeper.path.root> This is only relevant if you are 
using YARN. From your completeYes, I omit to say we will use YARN.>Does this 
help?Yes, a lot :-)ThomasDe : Ufuk 
Celebi [u...@apache.org]Envoyé : jeudi 18 février 2016 19:19À : 
user@flink.apache.orgObjet : Re: Flink HAOn Thu, Feb 18, 2016 at 6:59 PM, 
Thomas Lamirault wrote:> We are trying flink in 
HA mode.Great to hear!> We set in the flink yaml :>> state.backend: 
filesystem>> recovery.mode: zookeeper> recovery.zookeeper.quorum:>> 
recovery.zookeeper.path.root: >> recovery.zookeeper.storageDir: 
>> recovery.backend.fs.checkpointdir: It should be 
state.backend.fs.checkpointdir.Just to check: Both 
state.backend.fs.checkpointdir andrecovery.zookeeper.path.root should point to 
a file system path.> yarn.application-attempts: 100This is only relevant if you 
are using YARN. From your complete> We want in case of application crash, the 
pending window has to be restore> when the application restart.>> Pending data 
are store into the /blob directory ?>> Also, we try to write a 
script who restart the application after exceed the> max attempts, with the 
last pending window.>> How can I do that ? A simple restart of the application 
is enough, or do I> have to "clean" the recovery.zookeeper.path.root ?Restore 
happens automatically to the most recently checkpointed state.Everything under 
 contains the actual state (includingJARs and JobGraph). ZooKeeper 
contains pointers to this state.Therefore, you must not delete the ZooKeeper 
root path.For the automatic restart, I would recommend using YARN. If you 
wantto do it manually, you need to restart the JobManager/TaskManagerinstances. 
The application will be recovered automatically fromZooKeeper/state 
backend.Does this help?– Ufuk

RE:Flink HA

2016-02-19 Thread Thomas Lamirault
After set this configuration, I have some exceptions :

java.lang.Exception: Could not restore checkpointed state to operators and 
functions
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:414)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:208)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.InvalidClassException: java.util.HashMap; invalid descriptor 
for field 
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:710)
at 
java.io.ObjectInputStream.readClassDescriptor(ObjectInputStream.java:830)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1601)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:294)
at 
org.apache.flink.streaming.runtime.operators.windowing.WindowOperator$Context.(WindowOperator.java:446)
at 
org.apache.flink.streaming.runtime.operators.windowing.WindowOperator.restoreState(WindowOperator.java:621)
at 
org.apache.flink.streaming.runtime.tasks.StreamTask.restoreStateLazy(StreamTask.java:406)
... 3 more
Caused by: java.lang.IllegalArgumentException: illegal signature
at java.io.ObjectStreamField.(ObjectStreamField.java:122)
at java.io.ObjectStreamClass.readNonProxy(ObjectStreamClass.java:708)
... 13 more


If I run the application in not-HA mode, there is no problem.
What can cause this kind of error ?

Thanks

Thomas
De : Thomas Lamirault 
[thomas.lamira...@ericsson.com]Envoyé : vendredi 19 février 2016 09:39À : 
user@flink.apache.orgObjet : RE:Flink HAThanks for the quick reply !> 
state.backend.fs.checkpointdirIs actually pointing to a hdfs directory but I 
will modify  the recovery.zookeeper.path.root> This is only relevant if you are 
using YARN. From your completeYes, I omit to say we will use YARN.>Does this 
help?Yes, a lot :-)ThomasDe : Ufuk 
Celebi [u...@apache.org]Envoyé : jeudi 18 février 2016 19:19À : 
user@flink.apache.orgObjet : Re: Flink HAOn Thu, Feb 18, 2016 at 6:59 PM, 
Thomas Lamirault wrote:> We are trying flink in 
HA mode.Great to hear!> We set in the flink yaml :>> state.backend: 
filesystem>> recovery.mode: zookeeper> recovery.zookeeper.quorum:>> 
recovery.zookeeper.path.root: >> recovery.zookeeper.storageDir: 
>> recovery.backend.fs.checkpointdir: It should be 
state.backend.fs.checkpointdir.Just to check: Both 
state.backend.fs.checkpointdir andrecovery.zookeeper.path.root should point to 
a file system path.> yarn.application-attempts: 100This is only relevant if you 
are using YARN. From your complete> We want in case of application crash, the 
pending window has to be restore> when the application restart.>> Pending data 
are store into the /blob directory ?>> Also, we try to write a 
script who restart the application after exceed the> max attempts, with the 
last pending window.>> How can I do that ? A simple restart of the application 
is enough, or do I> have to "clean" the recovery.zookeeper.path.root ?Restore 
happens automatically to the most recently checkpointed state.Everything under 
 contains the actual state (includingJARs and JobGraph). ZooKeeper 
contains pointers to this state.Therefore, you must not delete the ZooKeeper 
root path.For the automatic restart, I would recommend using YARN. If you 
wantto do it manually, you need to restart the JobManager/TaskManagerinstances. 
The application will be recovered automatically fromZooKeeper/state 
backend.Does this help?– Ufuk

RE:Flink HA

2016-02-19 Thread Thomas Lamirault
Thanks for the quick reply !

> state.backend.fs.checkpointdir
Is actually pointing to a hdfs directory but I will modify  the 
recovery.zookeeper.path.root

> This is only relevant if you are using YARN. From your complete
Yes, I omit to say we will use YARN.

>Does this help?
Yes, a lot :-)

Thomas


De : Ufuk Celebi [u...@apache.org]
Envoyé : jeudi 18 février 2016 19:19
À : user@flink.apache.org
Objet : Re: Flink HA

On Thu, Feb 18, 2016 at 6:59 PM, Thomas Lamirault
 wrote:
> We are trying flink in HA mode.

Great to hear!

> We set in the flink yaml :
>
> state.backend: filesystem
>
> recovery.mode: zookeeper
> recovery.zookeeper.quorum:
>
> recovery.zookeeper.path.root: 
>
> recovery.zookeeper.storageDir: 
>
> recovery.backend.fs.checkpointdir: 

It should be state.backend.fs.checkpointdir.

Just to check: Both state.backend.fs.checkpointdir and
recovery.zookeeper.path.root should point to a file system path.

> yarn.application-attempts: 100

This is only relevant if you are using YARN. From your complete


> We want in case of application crash, the pending window has to be restore
> when the application restart.
>
> Pending data are store into the /blob directory ?
>
> Also, we try to write a script who restart the application after exceed the
> max attempts, with the last pending window.
>
> How can I do that ? A simple restart of the application is enough, or do I
> have to "clean" the recovery.zookeeper.path.root ?

Restore happens automatically to the most recently checkpointed state.

Everything under  contains the actual state (including
JARs and JobGraph). ZooKeeper contains pointers to this state.
Therefore, you must not delete the ZooKeeper root path.

For the automatic restart, I would recommend using YARN. If you want
to do it manually, you need to restart the JobManager/TaskManager
instances. The application will be recovered automatically from
ZooKeeper/state backend.


Does this help?

– Ufuk

Flink HA

2016-02-18 Thread Thomas Lamirault
Hi !



We are trying flink in HA mode.

Our application is a streaming application with windowing mechanism.

We set in the flink yaml :



state.backend: filesystem

recovery.mode: zookeeper
recovery.zookeeper.quorum:

recovery.zookeeper.path.root: 

recovery.zookeeper.storageDir: 

recovery.backend.fs.checkpointdir: 

yarn.application-attempts: 100



We want in case of application crash, the pending window has to be restore when 
the application restart.

Pending data are store into the /blob directory ?

Also, we try to write a script who restart the application after exceed the max 
attempts, with the last pending window.

How can I do that ? A simple restart of the application is enough, or do I have 
to "clean" the recovery.zookeeper.path.root ?



Thanks !



Thomas Lamirault


Flink application with HBase

2015-12-22 Thread Thomas Lamirault
Hello everybody,

I am using Flink (0.10.1) with a streaming source (Kafka) , and I write results 
of  flatMap/keyBy/timeWindow/reduce to a HBase table.
I have try with a class (Sinkclass) who implements SinkFunction, and 
a class (HBaseOutputFormat) who implements OutputFormat. For you, 
it's better to use the Sinkclass or HBaseOutputFormat, for better performance 
and cleaner code ? (Or equivalent ?)

Thanks,

B.R / Cordialement

Thomas Lamirault