RE: Stream enrichment with ingest mode

2024-02-14 Thread LINZ, Arnaud
Hello,

You’re right, one of our main use cases consist of adding missing fields, 
stored in a “small” reference table, periodically refreshed, to a stream. Using 
a broadcast stream and flink join was not the choice we made, because we didn’t 
want to add tricky watermarks and hold one stream (it may build a huge state 
using a window, and you don’t always have control on the source function to 
wait before emitting) until everything is broadcasted.

So, we developed tools that load a static RAM hashmap cache from the reference 
table in the open() method of our enrichment operator, without using flink 
streams, and launch a thread to periodically refresh the hashmap. We also use 
the same hashing mechanism as flink to load on each task manager only the part 
of the table which is relevant to the keyed stream.

IMHO this stuff should be part of the framework, it‘s easier to do with Spark 
Streaming… :-)

Best regards,
Arnaud

De : Lars Skjærven 
Envoyé : mercredi 14 février 2024 08:12
À : user 
Objet : Stream enrichment with ingest mode

Dear all,

A reoccurring challenge we have with stream enrichment in Flink is a robust 
mechanism to estimate that all messages of the source(s) have been 
consumed/processed before output is collected.

A simple example is two sources of catalogue metadata:
- source A delivers products,
- source B delivers product categories,

For a process function to enrich the categories with the number of products in 
each category, we would do a KeyedCoProcessFunction (or a RichCoFlatMap), keyed 
by category ID, and put both the category and products in state. Then count all 
products for each keyed state and collect the result.

Typically, however, we don't want to start counting before all products are 
included in state (to avoid emitting incomplete aggregations downstream). 
Therefore we use the event lag time (i.e. processing time - current watermark) 
to indicate "ingest mode" of the processor (e.g. lag time > 30 seconds). When 
in "ingest mode" we will trigger a timer, and return without collecting. 
Finally, the timer fires when the watermark has advanced sufficiently.

This strategy of "ingest mode" (and timers) seems to be more complicated when 
you have multiple process functions (with the same need of ingest mode) 
downstream of the first one processor. The reason seems to be that watermarks 
are passed from the first process function even though no elements are 
collected. Therefore, when elements finally arrive at the second process 
function, the current watermark has already advanced, so the same strategy of 
watermarks is less robust.

I'm curious how others in the community handle this "challenge" of initial 
ingest. Any ideas are greatly appreciated.

Note: we use a custom watermark generator that emits watermarks derived from 
event time, and advances the watermarks when the source is idle for a longer 
period (e.g. 30 seconds).

Thanks !

L






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Deploying the K8S operator sample on GKE Autopilot : Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed,

2024-01-12 Thread LINZ, Arnaud
Hi,
Some more tests results from the task/job managers pod :

From task manager I cannot connect to job manager:
root@basic-example-taskmanager-5d54f9f94-rbcr4:/opt/flink# wget 
basic-example.default:6123
--2024-01-12 15:16:15--  http://basic-example.default:6123/
Resolving basic-example.default (basic-example.default)... 100.64.3.182
Connecting to basic-example.default 
(basic-example.default)|100.64.3.182|:6123... ^C

From job manager I can (DNS is OK, same IP is given) :
root@basic-example-57774f887d-6bht8:/opt/flink# wget basic-example.default:6123
--2024-01-12 15:16:25--  http://basic-example.default:6123/
Resolving basic-example.default (basic-example.default)... 100.64.3.182
Connecting to basic-example.default 
(basic-example.default)|100.64.3.182|:6123... connected.
HTTP request sent, awaiting response... No data received.
Retrying.

However services are created :
basic-example   ClusterIP   None 
6123/TCP,6124/TCP2s
basic-example-rest  ClusterIP   100.87.240.180   
8081/TCP 2s

Maybe the job manager only listens on localhost  instead of 0.0.0.0 or its real 
IP ?  Is it something I have the hand on?
Thanks,
Arnaud

From: LINZ, Arnaud
Sent: Friday, January 12, 2024 2:07 PM
To: user@flink.apache.org
Subject: FW: Deploying the K8S operator sample on GKE Autopilot : Association 
with remote system [akka.tcp://flink@basic-example.default:6123] has failed,

Hello,

I am trying to follow the “quickstart” guide on a GKE Autopilot k8s cluster.
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/
I could install the operator (without webhook) without issue ; however, when 
running
kubectl create -f 
https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.7/examples/basic.yaml

The job does not work because the task manager does not reach the job manager 
(maybe a DNS issue?). Is there some special dns/network configuration to 
perform in GKE? Has anybody already made it work?
Thanks,
Arnaud

Log in job manager is :
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source (1/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_0_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source (2/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_1_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map -> 
Sink: Print to Std. Out (1/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_0_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map -> 
Sink: Print to Std. Out (2/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_1_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,879 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job 096668d0039ed54215ae334b5d89aa82: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
2024-01-12 11:01:56,880 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job 096668d0039ed54215ae334b5d89aa82: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
2024-01-12 11:01:56,902 INFO  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint 
triggering task Source: Custom Source (1/2) of job 
096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting 
checkpoint. Failure reason: Not all required tasks are currently running..
2024-01-12 11:01:57,014 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need 
request 1 new workers, current worker number 0, declared worker number 1
2024-01-12 11:01:57,015 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0, 
taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 
bytes), numSlots=2}, current pending count: 1.
2024-01-12 11:01:57,016 INFO  
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []
2024-01-12 11:01:57,018 INFO  org.apache.flink.configuration.Configuration  
   [] - Config uses fallback configuration key 
'kubernetes.service-account' instead of key 
'kubernetes.taskmanager.service-account'
20

FW: Deploying the K8S operator sample on GKE Autopilot : Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed,

2024-01-12 Thread LINZ, Arnaud
Hello,

I am trying to follow the “quickstart” guide on a GKE Autopilot k8s cluster.
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/
I could install the operator (without webhook) without issue ; however, when 
running
kubectl create -f 
https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.7/examples/basic.yaml

The job does not work because the task manager does not reach the job manager 
(maybe a DNS issue?). Is there some special dns/network configuration to 
perform in GKE? Has anybody already made it work?
Thanks,
Arnaud

Log in job manager is :
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source (1/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_0_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source: 
Custom Source (2/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_1_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map -> 
Sink: Print to Std. Out (1/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_0_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,878 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map -> 
Sink: Print to Std. Out (2/2) 
(c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_1_2) 
switched from CREATED to SCHEDULED.
2024-01-12 11:01:56,879 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job 096668d0039ed54215ae334b5d89aa82: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=1}]
2024-01-12 11:01:56,880 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] 
- Received resource requirements from job 096668d0039ed54215ae334b5d89aa82: 
[ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN}, 
numberOfRequiredSlots=2}]
2024-01-12 11:01:56,902 INFO  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint 
triggering task Source: Custom Source (1/2) of job 
096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting 
checkpoint. Failure reason: Not all required tasks are currently running..
2024-01-12 11:01:57,014 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - need 
request 1 new workers, current worker number 0, declared worker number 1
2024-01-12 11:01:57,015 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0, 
taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=158.720mb (166429984 bytes), managedMemSize=634.880mb (665719939 
bytes), numSlots=2}, current pending count: 1.
2024-01-12 11:01:57,016 INFO  
org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - Enabled 
external resources: []
2024-01-12 11:01:57,018 INFO  org.apache.flink.configuration.Configuration  
   [] - Config uses fallback configuration key 
'kubernetes.service-account' instead of key 
'kubernetes.taskmanager.service-account'
2024-01-12 11:01:57,022 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Creating new 
TaskManager pod with name basic-example-taskmanager-1-3 and resource <2048,1.0>.
2024-01-12 11:01:57,095 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
basic-example-taskmanager-1-3 is created.
2024-01-12 11:01:57,116 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: basic-example-taskmanager-1-3
2024-01-12 11:01:57,117 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker basic-example-taskmanager-1-3 with resource spec 
WorkerResourceSpec {cpuCores=1.0, taskHeapSize=537.600mb (563714445 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=158.720mb (166429984 bytes), 
managedMemSize=634.880mb (665719939 bytes), numSlots=2}.
2024-01-12 11:01:58,902 INFO  
org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to 
trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since Checkpoint 
triggering task Source: Custom Source (1/2) of job 
096668d0039ed54215ae334b5d89aa82 is not being executed at the moment. Aborting 
checkpoint. Failure reason: Not all required tasks are currently running..
(…)

Log in task manager is :
(…)
2024-01-12 11:02:02,229 INFO  org.apache.flink.core.plugin.DefaultPluginManager 
   [] - Plugin loader with ID found, reusing it: metrics-jmx
2024-01-12 11:02:02,232 INFO  
org.apache.flink.runtime.st

RE: "Authentication failed" in "ConnectionState" when enabling internal SSL on Yarn with self signed certificate

2022-11-22 Thread LINZ, Arnaud
Last update :
My flink version is 1.14.3 in fact. The application works when enabling 
internal SSL in “local” intra-jvm cluster mode, so the certificate seems 
correct.
I see no log in Yarn server side, only that the application get killed.
I will try to take stack traces…

De : LINZ, Arnaud
Envoyé : mardi 22 novembre 2022 17:41
À : user 
Objet : RE: "Authentication failed" in "ConnectionState" when enabling internal 
SSL on Yarn with self signed certificate

Update :
In fact this « Authentication failed” message also appears when SSL is turned 
off (and when the yarn application succeeds), so it’s more of a warning and has 
no link with the “freeze” when SSL is turned on.

Thus, when internal SSL is enabled, I have no error in the yarn log, and the 
only error I get is a “timed out error” like the one you get when you don’t 
have enough ressources :
(NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout)
But I do have enough resources.

De : LINZ, Arnaud
Envoyé : mardi 22 novembre 2022 17:18
À : user mailto:user@flink.apache.org>>
Objet : "Authentication failed" in "ConnectionState" when enabling internal SSL 
on Yarn with self signed certificate

Hello,
I use Flink 1.14.3 in Yarn cluster mode.
I’ve followed the instructions listed here 
(https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/
 
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/%20>
   ) to turn on internal SSL:


$ keytool -genkeypair \

  -alias flink.internal \

  -keystore internal.keystore \

  -dname "CN=flink.internal" \

  -storepass internal_store_password \

  -keyalg RSA \

  -keysize 4096 \

  -storetype PKCS12



security.ssl.internal.enabled: true

security.ssl.internal.keystore: /path/to/flink/conf/internal.keystore

security.ssl.internal.truststore: /path/to/flink/conf/internal.keystore

security.ssl.internal.keystore-password: internal_store_password

security.ssl.internal.truststore-password: internal_store_password

security.ssl.internal.key-password: internal_store_password


I’ve shipped the keystore on every node, and get no error from keystore reading.
However the application fails to start (stuck in initializing step), with the 
only error log in Yarn containers :
15:49:46.397 [main-EventThread] ERROR 
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState - 
Authentication failed


Could you please explain me what this “zookeeper” curator connection does and 
why it no longer works when enabling internal SSL ?



Best regards,

Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: "Authentication failed" in "ConnectionState" when enabling internal SSL on Yarn with self signed certificate

2022-11-22 Thread LINZ, Arnaud
Update :
In fact this « Authentication failed” message also appears when SSL is turned 
off (and when the yarn application succeeds), so it’s more of a warning and has 
no link with the “freeze” when SSL is turned on.

Thus, when internal SSL is enabled, I have no error in the yarn log, and the 
only error I get is a “timed out error” like the one you get when you don’t 
have enough ressources :
(NoResourceAvailableException: Slot request bulk is not fulfillable! Could not 
allocate the required slot within slot request timeout)
But I do have enough resources.

De : LINZ, Arnaud
Envoyé : mardi 22 novembre 2022 17:18
À : user 
Objet : "Authentication failed" in "ConnectionState" when enabling internal SSL 
on Yarn with self signed certificate

Hello,
I use Flink 1.11.2 in Yarn cluster mode.
I’ve followed the instructions listed here 
(https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/
 
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/%20>
   ) to turn on internal SSL:


$ keytool -genkeypair \

  -alias flink.internal \

  -keystore internal.keystore \

  -dname "CN=flink.internal" \

  -storepass internal_store_password \

  -keyalg RSA \

  -keysize 4096 \

  -storetype PKCS12



security.ssl.internal.enabled: true

security.ssl.internal.keystore: /path/to/flink/conf/internal.keystore

security.ssl.internal.truststore: /path/to/flink/conf/internal.keystore

security.ssl.internal.keystore-password: internal_store_password

security.ssl.internal.truststore-password: internal_store_password

security.ssl.internal.key-password: internal_store_password


I’ve shipped the keystore on every node, and get no error from keystore reading.
However the application fails to start (stuck in initializing step), with the 
only error log in Yarn containers :
15:49:46.397 [main-EventThread] ERROR 
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState - 
Authentication failed


Could you please explain me what this “zookeeper” curator connection does and 
why it no longer works when enabling internal SSL ?



Best regards,

Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


"Authentication failed" in "ConnectionState" when enabling internal SSL on Yarn with self signed certificate

2022-11-22 Thread LINZ, Arnaud
Hello,
I use Flink 1.11.2 in Yarn cluster mode.
I’ve followed the instructions listed here 
(https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/security/security-ssl/
 

   ) to turn on internal SSL:


$ keytool -genkeypair \

  -alias flink.internal \

  -keystore internal.keystore \

  -dname "CN=flink.internal" \

  -storepass internal_store_password \

  -keyalg RSA \

  -keysize 4096 \

  -storetype PKCS12



security.ssl.internal.enabled: true

security.ssl.internal.keystore: /path/to/flink/conf/internal.keystore

security.ssl.internal.truststore: /path/to/flink/conf/internal.keystore

security.ssl.internal.keystore-password: internal_store_password

security.ssl.internal.truststore-password: internal_store_password

security.ssl.internal.key-password: internal_store_password


I’ve shipped the keystore on every node, and get no error from keystore reading.
However the application fails to start (stuck in initializing step), with the 
only error log in Yarn containers :
15:49:46.397 [main-EventThread] ERROR 
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState - 
Authentication failed


Could you please explain me what this “zookeeper” curator connection does and 
why it no longer works when enabling internal SSL ?



Best regards,

Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-23 Thread LINZ, Arnaud
Hello,

It’s hard to say what caused the timeout to trigger – I agree with you that it 
should not have stopped the heartbeat thread, but it did. The easy fix was to 
increase it until we no longer see our app self-killed. The task was using a 
CPU-intensive computation (with a few threads created at some points… Somehow 
breaking the “slot number” contract).
For the RAM cache, I believe that the hearbeat timeout may also times out 
because of a busy network.

Cheers,
Arnaud


De : Till Rohrmann 
Envoyé : jeudi 22 juillet 2021 11:33
À : LINZ, Arnaud 
Cc : Gen Luo ; Yang Wang ; dev 
; user 
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default 
values

Thanks for your inputs Gen and Arnaud.

I do agree with you, Gen, that we need better guidance for our users on when to 
change the heartbeat configuration. I think this should happen in any case. I 
am, however, not so sure whether we can give hard threshold like 5000 tasks, 
for example, because as Arnaud said it strongly depends on the workload. Maybe 
we can explain it based on symptoms a user might experience and what to do then.

Concerning your workloads, Arnaud, I'd be interested to learn a bit more. The 
user code runs in its own thread. This means that its operation won't block the 
main thread/heartbeat. The only thing that can happen is that the user code 
starves the heartbeat in terms of CPU cycles or causes a lot of GC pauses. If 
you are observing the former problem, then we might think about changing the 
priorities of the respective threads. This should then improve Flink's 
stability for these workloads and a shorter heartbeat timeout should be 
possible.

Also for the RAM-cached repositories, what exactly is causing the heartbeat to 
time out? Is it because you have a lot of GC or that the heartbeat thread does 
not get enough CPU cycles?

Cheers,
Till

On Thu, Jul 22, 2021 at 9:16 AM LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

From a user perspective: we have some (rare) use cases where we use “coarse 
grain” datasets, with big beans and tasks that do lengthy operation (such as ML 
training). In these cases we had to increase the time out to huge values 
(heartbeat.timeout: 50) so that our app is not killed.
I’m aware this is not the way Flink was meant to be used, but it’s a convenient 
way to distribute our workload on datanodes without having to use another 
concurrency framework (such as M/R) that would require the recoding of sources 
and sinks.

In some other (most common) cases, our tasks do some R/W accesses to RAM-cached 
repositories backed by a key-value storage such as Kudu (or Hbase). If most of 
those calls are very fast, sometimes when the system is under heavy load they 
may block more than a few seconds, and having our app killed because of a short 
timeout is not an option.

That’s why I’m not in favor of very short timeouts… Because in my experience it 
really depends on what user code does in the tasks. (I understand that 
normally, as user code is not a JVM-blocking activity such as a GC, it should 
have no impact on heartbeats, but from experience, it really does)

Cheers,
Arnaud


De : Gen Luo mailto:luogen...@gmail.com>>
Envoyé : jeudi 22 juillet 2021 05:46
À : Till Rohrmann mailto:trohrm...@apache.org>>
Cc : Yang Wang mailto:danrtsey...@gmail.com>>; dev 
mailto:d...@flink.apache.org>>; user 
mailto:user@flink.apache.org>>
Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default 
values

Hi,
Thanks for driving this @Till Rohrmann<mailto:trohrm...@apache.org> . I would 
give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 
15s and 3s would be enough either.

IMO, except for the standalone cluster, where the heartbeat mechanism in Flink 
is totally relied, reducing the heartbeat can also help JM to find out faster 
TaskExecutors in abnormal conditions that can not respond to the heartbeat 
requests, e.g., continuously Full GC, though the process of TaskExecutor is 
alive and may not be known by the deployment system. Since there are cases that 
can benefit from this change, I think it could be done if it won't break the 
experience in other scenarios.

If we can address what will block the main threads from processing heartbeats, 
or enlarge the GC costs, we can try to get rid of them to have a more 
predictable response time of heartbeat, or give some advices to users if their 
jobs may encounter these issues. For example, as far as I know JM of a large 
scale job will be more busy and may not able to process heartbeats in time, 
then we can give a advice that users working with job large than 5000 tasks 
should enlarge there heartbeat interval to 10s and timeout to 50s. The numbers 
are written casually.

As for the issue in FLINK-23216, I think it should be fixed and may not be a 
main concern for this case.

On Wed, Jul 21, 2021 at 6:26 PM Till Rohr

RE: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default values

2021-07-22 Thread LINZ, Arnaud
Hello,

From a user perspective: we have some (rare) use cases where we use “coarse 
grain” datasets, with big beans and tasks that do lengthy operation (such as ML 
training). In these cases we had to increase the time out to huge values 
(heartbeat.timeout: 50) so that our app is not killed.
I’m aware this is not the way Flink was meant to be used, but it’s a convenient 
way to distribute our workload on datanodes without having to use another 
concurrency framework (such as M/R) that would require the recoding of sources 
and sinks.

In some other (most common) cases, our tasks do some R/W accesses to RAM-cached 
repositories backed by a key-value storage such as Kudu (or Hbase). If most of 
those calls are very fast, sometimes when the system is under heavy load they 
may block more than a few seconds, and having our app killed because of a short 
timeout is not an option.

That’s why I’m not in favor of very short timeouts… Because in my experience it 
really depends on what user code does in the tasks. (I understand that 
normally, as user code is not a JVM-blocking activity such as a GC, it should 
have no impact on heartbeats, but from experience, it really does)

Cheers,
Arnaud


De : Gen Luo 
Envoyé : jeudi 22 juillet 2021 05:46
À : Till Rohrmann 
Cc : Yang Wang ; dev ; user 

Objet : Re: [DISCUSS] FLIP-185: Shorter heartbeat timeout and interval default 
values

Hi,
Thanks for driving this @Till Rohrmann . I would 
give +1 on reducing the heartbeat timeout and interval, though I'm not sure if 
15s and 3s would be enough either.

IMO, except for the standalone cluster, where the heartbeat mechanism in Flink 
is totally relied, reducing the heartbeat can also help JM to find out faster 
TaskExecutors in abnormal conditions that can not respond to the heartbeat 
requests, e.g., continuously Full GC, though the process of TaskExecutor is 
alive and may not be known by the deployment system. Since there are cases that 
can benefit from this change, I think it could be done if it won't break the 
experience in other scenarios.

If we can address what will block the main threads from processing heartbeats, 
or enlarge the GC costs, we can try to get rid of them to have a more 
predictable response time of heartbeat, or give some advices to users if their 
jobs may encounter these issues. For example, as far as I know JM of a large 
scale job will be more busy and may not able to process heartbeats in time, 
then we can give a advice that users working with job large than 5000 tasks 
should enlarge there heartbeat interval to 10s and timeout to 50s. The numbers 
are written casually.

As for the issue in FLINK-23216, I think it should be fixed and may not be a 
main concern for this case.

On Wed, Jul 21, 2021 at 6:26 PM Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:
Thanks for sharing these insights.

I think it is no longer true that the ResourceManager notifies the JobMaster 
about lost TaskExecutors. See FLINK-23216 [1] for more details.

Given the GC pauses, would you then be ok with decreasing the heartbeat timeout 
to 20 seconds? This should give enough time to do the GC and then still 
send/receive a heartbeat request.

I also wanted to add that we are about to get rid of one big cause of blocking 
I/O operations from the main thread. With FLINK-22483 [2] we will get rid of 
Filesystem accesses to retrieve completed checkpoints. This leaves us with one 
additional file system access from the main thread which is the one completing 
a pending checkpoint. I think it should be possible to get rid of this access 
because as Stephan said it only writes information to disk that is already 
written before. Maybe solving these two issues could ease concerns about long 
pauses of unresponsiveness of Flink.

[1] https://issues.apache.org/jira/browse/FLINK-23216
[2] https://issues.apache.org/jira/browse/FLINK-22483

Cheers,
Till

On Wed, Jul 21, 2021 at 4:58 AM Yang Wang 
mailto:danrtsey...@gmail.com>> wrote:
Thanks @Till Rohrmann  for starting this discussion

Firstly, I try to understand the benefit of shorter heartbeat timeout. IIUC, it 
will make the JobManager aware of
TaskManager faster. However, it seems that only the standalone cluster could 
benefit from this. For Yarn and
native Kubernetes deployment, the Flink ResourceManager should get the 
TaskManager lost event in a very short time.

* About 8 seconds, 3s for Yarn NM -> Yarn RM, 5s for Yarn RM -> Flink RM
* Less than 1 second, Flink RM has a watch for all the TaskManager pods

Secondly, I am not very confident to decrease the timeout to 15s. I have 
quickly checked the TaskManager GC logs
in the past week of our internal Flink workloads and find more than 100 
10-seconds Full GC logs, but no one is bigger than 15s.
We are using CMS GC for old generation.


Best,
Yang

Till Rohrmann mailto:trohrm...@apache.org>> 于2021年7月17日周六 
上午1:05写道:
Hi everyone,

Since Flink 1.5 we have the

RE: Random Task executor shutdown

2020-11-18 Thread LINZ, Arnaud
Hi,
It’s 3.4.10 and does contain the bug. I’ll patch my flink client and see if it 
happens again.
Best regards,
Arnaud

De : LINZ, Arnaud
Envoyé : mercredi 18 novembre 2020 10:35
À : 'Guowei Ma' 
Cc : 'user' 
Objet : RE: Random Task executor shutdown

Hello,

We are wondering whether it is related to 
https://issues.apache.org/jira/browse/ZOOKEEPER-2775 or not.
What is the version of the shaded zookeeper client in Flink 1.10.0 ?
Best,
Arnaud

De : LINZ, Arnaud
Envoyé : mercredi 18 novembre 2020 09:39
À : 'Guowei Ma' mailto:guowei@gmail.com>>
Cc : user mailto:user@flink.apache.org>>
Objet : RE: Random Task executor shutdown

Hello,

Actually the log is more complete when the application ends, and it’s a 
Zookeeper related issue.
I took another log.
Job Manager’s log:

(…)
2020-11-12 14:34:09,798 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 73 from task 
d0d19968414641878c35c392e2062c59 of job b62ad2008af85add0ff9e6680be0cc42 at 
container_e38_1604477334666_0733_01_03 @ XXX (dataPort=33692).
2020-11-12 14:34:15,015 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e38_1604477334666_0733_01_05 because: The TaskExecutor is 
shutting down.
2020-11-12 14:34:15,072 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Map (13/15) 
(4010b02f9f8094b38f182d1f55b9be4b) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-11-12 14:34:15,111 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - Calculating tasks to restart to recover the failed task 
2f6467d98899e64a4721f0a7b6a059a8_12.
(…)

The shutting down task executor log is:

(…)
2020-11-12 14:34:14,497 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: SUSPENDED
2020-11-12 14:34:14,497 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2020-11-12 14:34:14,497 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2020-11-12 14:34:14,879 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.client.ZooKeeperSaslClient
  - Client will use GSSAPI as SASL mechanism.
2020-11-12 14:34:14,879 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181. Will attempt to 
SASL-authenticate using Login Context section 'Client'
2020-11-12 14:34:14,880 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181, initiating session
2020-11-12 14:34:14,881 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181, sessionid = 
0x175924beaf03acf, negotiated timeout = 6
2020-11-12 14:34:14,881 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
 

RE: Random Task executor shutdown

2020-11-18 Thread LINZ, Arnaud
Hello,

We are wondering whether it is related to 
https://issues.apache.org/jira/browse/ZOOKEEPER-2775 or not.
What is the version of the shaded zookeeper client in Flink 1.10.0 ?
Best,
Arnaud

De : LINZ, Arnaud
Envoyé : mercredi 18 novembre 2020 09:39
À : 'Guowei Ma' 
Cc : user 
Objet : RE: Random Task executor shutdown

Hello,

Actually the log is more complete when the application ends, and it’s a 
Zookeeper related issue.
I took another log.
Job Manager’s log:

(…)
2020-11-12 14:34:09,798 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 73 from task 
d0d19968414641878c35c392e2062c59 of job b62ad2008af85add0ff9e6680be0cc42 at 
container_e38_1604477334666_0733_01_03 @ XXX (dataPort=33692).
2020-11-12 14:34:15,015 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e38_1604477334666_0733_01_05 because: The TaskExecutor is 
shutting down.
2020-11-12 14:34:15,072 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Map (13/15) 
(4010b02f9f8094b38f182d1f55b9be4b) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.
at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-11-12 14:34:15,111 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - Calculating tasks to restart to recover the failed task 
2f6467d98899e64a4721f0a7b6a059a8_12.
(…)

The shutting down task executor log is:

(…)
2020-11-12 14:34:14,497 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: SUSPENDED
2020-11-12 14:34:14,497 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2020-11-12 14:34:14,497 WARN  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper suspended. Can no longer retrieve the leader from 
ZooKeeper.
2020-11-12 14:34:14,879 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.client.ZooKeeperSaslClient
  - Client will use GSSAPI as SASL mechanism.
2020-11-12 14:34:14,879 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Opening 
socket connection to server 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181. Will attempt to 
SASL-authenticate using Login Context section 'Client'
2020-11-12 14:34:14,880 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Socket 
connection established to 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181, initiating session
2020-11-12 14:34:14,881 INFO  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
establishment complete on server 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181, sessionid = 
0x175924beaf03acf, negotiated timeout = 6
2020-11-12 14:34:14,881 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: RECONNECTED
2020-11-12 14:34:14,881 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2020-11-12 14:34:14,881 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKe

RE: Random Task executor shutdown

2020-11-18 Thread LINZ, Arnaud
= 6
2020-11-12 14:33:43,566 INFO  
org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager
  - State change: RECONNECTED
2020-11-12 14:33:43,569 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2020-11-12 14:33:43,570 INFO  
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2020-11-12 14:33:43,592 WARN  
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Session 
0x175924beaf03acf for server 
io-master-u1-n02.bie.jupiter.nbyt.fr/10.136.169.130:2181, unexpected error, 
closing socket connection and attempting reconnect
java.io.IOException: Xid out of order. Got Xid 37 with err 0 expected Xid 38 
for a packet with details: clientPath:null serverPath:null finished:false 
header:: 38,101  replyHeader:: 0,0,-4  request:: 
347892643788,v{'/flink_prd_XXX_StreamFlink/application_1604477334666_0733/leader/resource_manager_lock,'/flink_prd_XXX_StreamFlink/application_1604477334666_0733/leader/b62ad2008af85add0ff9e6680be0cc42/job_manager_lock},v{},v{}
  response:: null
at 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:823)
at 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94)
at 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at 
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)

and the last one, there is only one minute. Is there other parameters to adjust 
to make the Zookeeper synchronization more robust when the network is slowed 
down ?

Best,
Arnaud


De : Guowei Ma 
Envoyé : mardi 17 novembre 2020 00:49
À : LINZ, Arnaud 
Cc : user 
Objet : Re: Random Task executor shutdown

Hi, Arnaud
Would you like to share the log of the shutdown task executor?
BTW could you check the gc log of the task executor?
Best,
Guowei


On Mon, Nov 16, 2020 at 8:57 PM LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
(reposted with proper subject line -- sorry for the copy/paste)
-Original message-
Hello,

I'm running Flink 1.10 on a yarn cluster. I have a streaming application, that, 
when under heavy load, fails from time to time with this unique error message 
in the whole yarn log:

(...)
2020-11-15 16:18:42,202 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 63 from task 
4cbc940112a596db54568b24f9209aac of job 1e1717d19bd8ea296314077e42e1c7e5 at 
container_e38_1604477334666_0960_01_04 @ xxx (dataPort=33099).
2020-11-15 16:18:55,043 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e38_1604477334666_0960_01_04 because: The TaskExecutor is 
shutting down.
2020-11-15 16:18:55,087 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Map (7/15) 
(c8e92cacddcd4e41f51a2433d07d2153) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.

  at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
at 
akka.japi.pf<http://akka.japi.pf>.UnitCaseStatement.apply(CaseStatements.scala:26)
at 
akka.japi.pf<http://akka.japi.pf>.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at 
akka.japi.pf<http://akka.japi.pf>.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
akka.dispatch.forkjoin.ForkJoinW

Random Task executor shutdown

2020-11-16 Thread LINZ, Arnaud
(reposted with proper subject line -- sorry for the copy/paste)
-Original message-
Hello,

I'm running Flink 1.10 on a yarn cluster. I have a streaming application, that, 
when under heavy load, fails from time to time with this unique error message 
in the whole yarn log:

(...)
2020-11-15 16:18:42,202 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 63 from task 
4cbc940112a596db54568b24f9209aac of job 1e1717d19bd8ea296314077e42e1c7e5 at 
container_e38_1604477334666_0960_01_04 @ xxx (dataPort=33099).
2020-11-15 16:18:55,043 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e38_1604477334666_0960_01_04 because: The TaskExecutor is 
shutting down.
2020-11-15 16:18:55,087 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Map (7/15) 
(c8e92cacddcd4e41f51a2433d07d2153) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.

  at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-11-15 16:18:55,092 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - Calculating tasks to restart to recover the failed task 
2f6467d98899e64a4721f0a7b6a059a8_6.
2020-11-15 16:18:55,101 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - 230 tasks should be restarted to recover the failed task 
2f6467d98899e64a4721f0a7b6a059a8_6.
(...)

What could be the cause of this failure? Why is there no other error message?

I've tried to increase the value of heartbeat.timeout, thinking that maybe it 
was due to a slow responding mapper, but it did not solve the issue.

Best regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Re: Flink 1.11 not showing logs

2020-11-16 Thread LINZ, Arnaud
Hello,

I'm running Flink 1.10 on a yarn cluster. I have a streaming application, that, 
when under heavy load, fails from time to time with this unique error message 
in the whole yarn log:

(...)
2020-11-15 16:18:42,202 WARN  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received late 
message for now expired checkpoint attempt 63 from task 
4cbc940112a596db54568b24f9209aac of job 1e1717d19bd8ea296314077e42e1c7e5 at 
container_e38_1604477334666_0960_01_04 @ xxx (dataPort=33099).
2020-11-15 16:18:55,043 INFO  org.apache.flink.yarn.YarnResourceManager 
- Closing TaskExecutor connection 
container_e38_1604477334666_0960_01_04 because: The TaskExecutor is 
shutting down.
2020-11-15 16:18:55,087 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- Map (7/15) 
(c8e92cacddcd4e41f51a2433d07d2153) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.

  at 
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
at 
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2020-11-15 16:18:55,092 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - Calculating tasks to restart to recover the failed task 
2f6467d98899e64a4721f0a7b6a059a8_6.
2020-11-15 16:18:55,101 INFO  
org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
  - 230 tasks should be restarted to recover the failed task 
2f6467d98899e64a4721f0a7b6a059a8_6.
(...)

What could be the cause of this failure? Why is there no other error message?

I've tried to increase the value of heartbeat.timeout, thinking that maybe it 
was due to a slow responding mapper, but it did not solve the issue.

Best regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Best way to "emulate" a rich Partitioner with open() and close() methods ?

2020-05-29 Thread LINZ, Arnaud
Hello,

Yes, that would definitely do the trick, with an extra mapper after keyBy to 
remove the tuple so that it stays seamless. It’s less hacky that what I was 
thinking of, thanks!
However, is there any plan in a future release to have rich partitioners ? That 
would avoid adding  overhead and “intermediate” technical info in the stream 
payload.
Best,
Arnaud

De : Robert Metzger 
Envoyé : vendredi 29 mai 2020 13:10
À : LINZ, Arnaud 
Cc : user 
Objet : Re: Best way to "emulate" a rich Partitioner with open() and close() 
methods ?

Hi Arnaud,

Maybe I don't fully understand the constraints, but what about
stream.map(new GetKuduPartitionMapper).keyBy(0).addSink(KuduSink());

The map(new GetKuduPartitionMapper) will be a regular RichMapFunction with 
open() and close() where you can handle the connection with Kudu's partitioning 
service.
The map will output a Tuple2 (or something nicer :) ), then 
Flink shuffles your data correctly, and the sinks will process the data 
correctly partitioned.

I hope that this is what you were looking for!

Best,
Robert

On Thu, May 28, 2020 at 6:21 PM LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:

Hello,



I would like to upgrade the performance of my Apache Kudu Sink by using the new 
“KuduPartitioner” of Kudu API to match Flink stream partitions with Kudu 
partitions to lower the network shuffling.

For that, I would like to implement something like

stream.partitionCustom(new KuduFlinkPartitioner<>(…)).addSink(new KuduSink(…)));

With KuduFLinkPartitioner a implementation of 
org.apache.flink.api.common.functions.Partitioner that internally make use of 
the KuduPartitioner client tool of Kudu’s API.



However for that KuduPartioner to work, it needs to open – and close at the end 
– a connection to the Kudu table – obviously something that can’t be done for 
each line. But there is no “AbstractRichPartitioner” with open() and close() 
method that I can use for that (the way I use it in the sink for instance).



What is the best way to implement this ?

I thought of ThreadLocals that would be initialized during the first call to 
int partition(K key, int numPartitions);  but I won’t be able to close() things 
nicely as I won’t be notified on job termination.



I thought of putting those static ThreadLocals inside a “Identity Mapper” that 
would be called just prior the partition with something like :

stream.map(richIdentiyConnectionManagerMapper).partitionCustom(new 
KuduFlinkPartitioner<>(…)).addSink(new KuduSink(…)));

with kudu connections initialized in the mapper open(), closed in the mapper 
close(), and used  in the partitioner partition().

However It looks like an ugly hack breaking every coding principle, but as long 
as the threads are reused between the mapper and the partitioner I think that 
it should work.



Is there a better way to do this ?



Best regards,

Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Best way to "emulate" a rich Partitioner with open() and close() methods ?

2020-05-28 Thread LINZ, Arnaud
Hello,



I would like to upgrade the performance of my Apache Kudu Sink by using the new 
“KuduPartitioner” of Kudu API to match Flink stream partitions with Kudu 
partitions to lower the network shuffling.

For that, I would like to implement something like

stream.partitionCustom(new KuduFlinkPartitioner<>(…)).addSink(new KuduSink(…)));

With KuduFLinkPartitioner a implementation of 
org.apache.flink.api.common.functions.Partitioner that internally make use of 
the KuduPartitioner client tool of Kudu’s API.



However for that KuduPartioner to work, it needs to open – and close at the end 
– a connection to the Kudu table – obviously something that can’t be done for 
each line. But there is no “AbstractRichPartitioner” with open() and close() 
method that I can use for that (the way I use it in the sink for instance).



What is the best way to implement this ?

I thought of ThreadLocals that would be initialized during the first call to 
int partition(K key, int numPartitions);  but I won’t be able to close() things 
nicely as I won’t be notified on job termination.



I thought of putting those static ThreadLocals inside a “Identity Mapper” that 
would be called just prior the partition with something like :

stream.map(richIdentiyConnectionManagerMapper).partitionCustom(new 
KuduFlinkPartitioner<>(…)).addSink(new KuduSink(…)));

with kudu connections initialized in the mapper open(), closed in the mapper 
close(), and used  in the partitioner partition().

However It looks like an ugly hack breaking every coding principle, but as long 
as the threads are reused between the mapper and the partitioner I think that 
it should work.



Is there a better way to do this ?



Best regards,

Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Building with Hadoop 3

2020-03-03 Thread LINZ, Arnaud
Hello,
Have you shared it somewhere on the web already?
Best,
Arnaud

De : vino yang 
Envoyé : mercredi 4 décembre 2019 11:55
À : Márton Balassi 
Cc : Chesnay Schepler ; Foster, Craig 
; user@flink.apache.org; d...@flink.apache.org
Objet : Re: Building with Hadoop 3

Hi Marton,

Thanks for your explanation. Personally, I look forward to your contribution!

Best,
Vino

Márton Balassi mailto:balassi.mar...@gmail.com>> 
于2019年12月4日周三 下午5:15写道:
Wearing my Cloudera hat I can tell you that we have done this exercise for our 
distros of the  3.0 and 3.1 Hadoop versions. We have not contributed these back 
just yet, but we are open to do so. If the community is interested we can 
contribute those changes back to flink-shaded and suggest the necessay changes 
to flink too. The task was not overly complex, but it certainly involved a bit 
of dependency hell. :-)

Right now we are focused on internal timelines, but we could invest into 
contributing this back in the end of January timeframe if the community deems 
this a worthwhile effort.

Best,
Marton

On Wed, Dec 4, 2019 at 10:00 AM Chesnay Schepler 
mailto:ches...@apache.org>> wrote:
There's no JIRA and no one actively working on it. I'm not aware of any 
investigations on the matter; hence the first step would be to just try it out.

A flink-shaded artifact isn't a hard requirement; Flink will work with any 2.X 
hadoop distribution (provided that there aren't any dependency clashes).

On 03/12/2019 18:22, Foster, Craig wrote:
Hi:
I don’t see a JIRA for Hadoop 3 support. I see a comment on a JIRA here from a 
year ago that no one is looking into Hadoop 3 support [1]. Is there a document 
or JIRA that now exists which would point to what needs to be done to support 
Hadoop 3? Right now builds with Hadoop 3 don’t work obviously because there’s 
no flink-shaded-hadoop-3 artifacts.

Thanks!
Craig

[1] https://issues.apache.org/jira/browse/FLINK-11086






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


How to fully re-aggregate a keyed windowed aggregate in the same window ?

2020-01-30 Thread LINZ, Arnaud
Hello,

I would like to compute statistics on a stream every hour. For that, I need to 
compute statistics on the keyed stream, then to reaggregate them.
I’ve tried the following thing :

stream.keyBy(mykey)
.window(1 hour process time)
.aggregate(my per-key aggregate)

.windowAll(1 hour process time) // not the same window, add one 
hour delay…

.reduce(fully aggregate intermediary results)
 ... then sink

This works, but I get the first line in the sink 2 hours after the first item 
in the sink, and 1 hour after it should be possible to get it.

My question: How to I trigger the reduce step immediately after the first 
aggregation ?

Best regards,
Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: No yarn option in self-built flink version

2019-06-11 Thread LINZ, Arnaud
Hello,

Thanks a lot, it works. However, may I suggest that you update the 
documentation page :



mvn clean install -DskipTests -Pvendor-repos -Dhadoop.version=2.6.0-cdh5.16.1
has absolutely no interest if you don’t include hadoop, that’s why I thought 
that -Pvendor-repos was including the -Pinclude-hadoop stated in the above 
paragraph…

Arnaud

De : Ufuk Celebi 
Envoyé : vendredi 7 juin 2019 12:00
À : LINZ, Arnaud 
Cc : user ; ches...@apache.org
Objet : Re: No yarn option in self-built flink version

Hey Arnaud,

I think you need to active the Hadoop profile via -Pinclude-hadoop (the default 
was changed to not include Hadoop as far as I know).

For more details, check out:
https://ci.apache.org/projects/flink/flink-docs-release-1.8/flinkDev/building.html#packaging-hadoop-into-the-flink-distribution


If this does not work for you, I would wait for Chesnay's input (cc'd).

– Ufuk


On Fri, Jun 7, 2019 at 11:04 AM LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

I am trying to build my own flink distribution with proper Cloudera 
dependencies.
Reading 
https://ci.apache.org/projects/flink/flink-docs-stable/flinkDev/building.html
I've done :
git clone https://github.com/apache/flink
cd flink
git checkout tags/release-1.8.0
$MAVEN_HOME/bin/mvn clean install -DskipTests -Pvendor-repos 
-Dhadoop.version=2.6.0-cdh5.16.1 -Dfast
cd flink-dist
$MAVEN_HOME/bin/mvn install -DskipTests -Pvendor-repos 
-Dhadoop.version=2.6.0-cdh5.16.1

Everything was successful.

However when running using flink-dist/target/flink-1.8.0-bin/flink-1.8.0
Running /bin/flink -h prints no yarn/Hadoop options.
And running
bin/flink run -m yarn-cluster -yn 4 -yjm 1024m -ytm 4096m 
./examples/batch/WordCount.jar
Prints :
Could not build the program from JAR file.

Am I missing something?

Best regards,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


No yarn option in self-built flink version

2019-06-07 Thread LINZ, Arnaud
Hello,

I am trying to build my own flink distribution with proper Cloudera 
dependencies.
Reading 
https://ci.apache.org/projects/flink/flink-docs-stable/flinkDev/building.html
I've done :
git clone https://github.com/apache/flink
cd flink
git checkout tags/release-1.8.0
$MAVEN_HOME/bin/mvn clean install -DskipTests -Pvendor-repos 
-Dhadoop.version=2.6.0-cdh5.16.1 -Dfast
cd flink-dist
$MAVEN_HOME/bin/mvn install -DskipTests -Pvendor-repos 
-Dhadoop.version=2.6.0-cdh5.16.1

Everything was successful.

However when running using flink-dist/target/flink-1.8.0-bin/flink-1.8.0
Running /bin/flink -h prints no yarn/Hadoop options.
And running
bin/flink run -m yarn-cluster -yn 4 -yjm 1024m -ytm 4096m 
./examples/batch/WordCount.jar
Prints :
Could not build the program from JAR file.

Am I missing something?

Best regards,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Checkpoints and catch-up burst (heavy back pressure)

2019-03-06 Thread LINZ, Arnaud
Hi,
I like the idea, will give it a try.
Thanks,
Arnaud

De : Stephen Connolly 
Envoyé : mardi 5 mars 2019 13:55
À : LINZ, Arnaud 
Cc : zhijiang ; user 
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)



On Tue, 5 Mar 2019 at 12:48, Stephen Connolly 
mailto:stephen.alan.conno...@gmail.com>> wrote:


On Fri, 1 Mar 2019 at 13:05, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,

I think I should go into more details to explain my use case.
I have one non parallel source (parallelism = 1) that list binary files in a 
HDFS directory. DataSet emitted by the source is a data set of file names, not 
file content. These filenames are rebalanced, and sent to workers (parallelism 
= 15) that will use a flatmapper that open the file, read it, decode it, and 
send records (forward mode) to the sinks (with a few 1-to-1 mapping 
in-between). So the flatmap operation is a time-consuming one as the files are 
more than 200Mb large each; the flatmapper will emit millions of record to the 
sink given one source record (filename).

The rebalancing, occurring at the file name level, does not use much I/O and I 
cannot use one-to-one mode at that point if I want some parallelims since I 
have only one source.

I did not put file decoding directly in the sources because I have no good way 
to distribute files to sources without a controller (input directory is unique, 
filenames are random and cannot be “attributed” to one particular source 
instance easily).

Crazy idea: If you know the task number and the number of tasks, you can hash 
the filename using a shared algorithm (e.g. md5 or sha1 or crc32) and then just 
check modulo number of tasks == task number.

That would let you run the list files in parallel without sharing state. which 
would allow file decoding directly in the sources

if you extend RichParallelSourceFunction you will have:

int index = getRuntimeContext().getIndexOfThisSubtask();
int count = getRuntimeContext().getNumberOfParallelSubtasks();

then a hash function like:

private static int hash(String string) {
int result = 0;
for (byte b : DigestUtils.sha1(string)) {
result = result * 31 + b;
}
return result;
}

and just compare the filename like so:

for (String filename: listFiles()) {
  if (Math.floorMod(hash(filename), count) != index) {
continue;
  }
  // this is our file
  ...
}

Note: if you know the file name patterns, you should tune the hash function to 
distribute them evenly. The SHA1 with prime reduction of the bytes is ok for 
general levelling... but may be poor over 15 buckets with your typical data set 
of filenames


Alternatively, I could have used a dispatcher daemon separated from the 
streaming app that distribute files to various directories, each directory 
being associated with a flink source instance, and put the file reading & 
decoding directly in the source, but that seemed more complex to code and 
exploit than the filename source. Would it have been better from the 
checkpointing perspective?

About the ungraceful source sleep(), is there a way, programmatically, to know 
the “load” of the app, or to determine if checkpointing takes too much time, so 
that I can do it only on purpose?

Thanks,
Arnaud

De : zhijiang mailto:wangzhijiang...@aliyun.com>>
Envoyé : vendredi 1 mars 2019 04:59
À : user mailto:user@flink.apache.org>>; LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>>
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

Thanks for the further feedbacks!

For option1: 40min still does not makes sense, which indicates it might take 
more time to finish checkpoint in your case. I also experienced some scenarios 
of catching up data to take several hours to finish one checkpoint. If the 
current checkpoint expires because of timeout, the next new triggered 
checkpoint might still be failed for timeout. So it seems better to wait the 
current checkpoint until finishes, not expires it, unless we can not bear this 
long time for some reasons such as wondering failover to restore more data 
during this time.

For option2: The default network setting should be make sense. The lower values 
might cause performance regression and the higher values would increase the 
inflighing buffers and checkpoint delay more seriously.

For option3: If the resource is limited, it is still not working on your side.

It is an option and might work in your case for sleeping some time in source as 
you mentioned, although it seems not a graceful way.

I think there are no data skew in your case to cause backpressure, because you 
used the rebalance mode as mentioned. Another option might use the forward mode 
which would be better than rebalance mode if possible in your case. Because the 
source and downstream task is one-to-one in forward mode, so the total 
flighting buffers are 2+2+8 for one single downstream task before barrier. If 
in r

Re: Checkpoints and catch-up burst (heavy back pressure)

2019-03-03 Thread LINZ, Arnaud
Hi,

My source checkpoint is actually the file list. But it's not trivially small as 
I may have hundreds of thousand of files, with long filenames.
My sink checkpoint is a smaller hdfs file list with current size.

 Message d'origine 
De : Ken Krugler 
Date : ven., mars 01, 2019 7:05 PM +0100
A : "LINZ, Arnaud" 
CC : zhijiang , user 
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

1. What’s your checkpoint configuration? Wondering if you’re writing to HDFS, 
and thus the load you’re putting on it while catching up & checkpointing is too 
high.

If so, then you could monitor the TotalLoad metric (FSNamesystem) in your 
source, and throttle back the emitting of file paths when this (empirically) 
gets too high.

2. I’m wondering what all you are checkpointing, and why.

E.g. if this is just an ETL-ish workflow to pull files, parse them, and write 
out (transformed) results, then you could in theory just checkpoint which files 
have been processed.

This means catching up after a failure could take more time, but your 
checkpoint size will be trivially small.

— Ken


On Mar 1, 2019, at 5:04 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:

Hi,

I think I should go into more details to explain my use case.
I have one non parallel source (parallelism = 1) that list binary files in a 
HDFS directory. DataSet emitted by the source is a data set of file names, not 
file content. These filenames are rebalanced, and sent to workers (parallelism 
= 15) that will use a flatmapper that open the file, read it, decode it, and 
send records (forward mode) to the sinks (with a few 1-to-1 mapping 
in-between). So the flatmap operation is a time-consuming one as the files are 
more than 200Mb large each; the flatmapper will emit millions of record to the 
sink given one source record (filename).

The rebalancing, occurring at the file name level, does not use much I/O and I 
cannot use one-to-one mode at that point if I want some parallelims since I 
have only one source.

I did not put file decoding directly in the sources because I have no good way 
to distribute files to sources without a controller (input directory is unique, 
filenames are random and cannot be “attributed” to one particular source 
instance easily).
Alternatively, I could have used a dispatcher daemon separated from the 
streaming app that distribute files to various directories, each directory 
being associated with a flink source instance, and put the file reading & 
decoding directly in the source, but that seemed more complex to code and 
exploit than the filename source. Would it have been better from the 
checkpointing perspective?

About the ungraceful source sleep(), is there a way, programmatically, to know 
the “load” of the app, or to determine if checkpointing takes too much time, so 
that I can do it only on purpose?

Thanks,
Arnaud

De : zhijiang mailto:wangzhijiang...@aliyun.com>>
Envoyé : vendredi 1 mars 2019 04:59
À : user mailto:user@flink.apache.org>>; LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>>
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

Thanks for the further feedbacks!

For option1: 40min still does not makes sense, which indicates it might take 
more time to finish checkpoint in your case. I also experienced some scenarios 
of catching up data to take several hours to finish one checkpoint. If the 
current checkpoint expires because of timeout, the next new triggered 
checkpoint might still be failed for timeout. So it seems better to wait the 
current checkpoint until finishes, not expires it, unless we can not bear this 
long time for some reasons such as wondering failover to restore more data 
during this time.

For option2: The default network setting should be make sense. The lower values 
might cause performance regression and the higher values would increase the 
inflighing buffers and checkpoint delay more seriously.

For option3: If the resource is limited, it is still not working on your side.

It is an option and might work in your case for sleeping some time in source as 
you mentioned, although it seems not a graceful way.

I think there are no data skew in your case to cause backpressure, because you 
used the rebalance mode as mentioned. Another option might use the forward mode 
which would be better than rebalance mode if possible in your case. Because the 
source and downstream task is one-to-one in forward mode, so the total 
flighting buffers are 2+2+8 for one single downstream task before barrier. If 
in rebalance mode, the total flighting buffer would be (a*2+a*2+8) for one 
single downstream task (`a` is the parallelism of source vertex), because it is 
all-to-all connection. The barrier alignment takes more time in rebalance mode 
than forward mode.

Best,
Zhijiang
--
From:LINZ, Arnaud mailto:al

RE: Checkpoints and catch-up burst (heavy back pressure)

2019-03-02 Thread LINZ, Arnaud
Hello,
When I think about it, I figure out that a barrier for the source is the whole 
set of files and therefore the checkpoint will never complete until the sink 
have caught up.
The simplest way to deal with it without refactoring is to add 2 parameters to 
the source, a file number  threshold detecting the catchup mode and a max file 
per sec limitation when this occupe, slightly lower than the natural catchup 
rate.

 Message d'origine 
De : "LINZ, Arnaud" 
Date : ven., mars 01, 2019 2:04 PM +0100
A : zhijiang , user 
Objet : RE: Checkpoints and catch-up burst (heavy back pressure)

Hi,

I think I should go into more details to explain my use case.
I have one non parallel source (parallelism = 1) that list binary files in a 
HDFS directory. DataSet emitted by the source is a data set of file names, not 
file content. These filenames are rebalanced, and sent to workers (parallelism 
= 15) that will use a flatmapper that open the file, read it, decode it, and 
send records (forward mode) to the sinks (with a few 1-to-1 mapping 
in-between). So the flatmap operation is a time-consuming one as the files are 
more than 200Mb large each; the flatmapper will emit millions of record to the 
sink given one source record (filename).

The rebalancing, occurring at the file name level, does not use much I/O and I 
cannot use one-to-one mode at that point if I want some parallelims since I 
have only one source.

I did not put file decoding directly in the sources because I have no good way 
to distribute files to sources without a controller (input directory is unique, 
filenames are random and cannot be “attributed” to one particular source 
instance easily).
Alternatively, I could have used a dispatcher daemon separated from the 
streaming app that distribute files to various directories, each directory 
being associated with a flink source instance, and put the file reading & 
decoding directly in the source, but that seemed more complex to code and 
exploit than the filename source. Would it have been better from the 
checkpointing perspective?

About the ungraceful source sleep(), is there a way, programmatically, to know 
the “load” of the app, or to determine if checkpointing takes too much time, so 
that I can do it only on purpose?

Thanks,
Arnaud

De : zhijiang 
Envoyé : vendredi 1 mars 2019 04:59
À : user ; LINZ, Arnaud 
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

Thanks for the further feedbacks!

For option1: 40min still does not makes sense, which indicates it might take 
more time to finish checkpoint in your case. I also experienced some scenarios 
of catching up data to take several hours to finish one checkpoint. If the 
current checkpoint expires because of timeout, the next new triggered 
checkpoint might still be failed for timeout. So it seems better to wait the 
current checkpoint until finishes, not expires it, unless we can not bear this 
long time for some reasons such as wondering failover to restore more data 
during this time.

For option2: The default network setting should be make sense. The lower values 
might cause performance regression and the higher values would increase the 
inflighing buffers and checkpoint delay more seriously.

For option3: If the resource is limited, it is still not working on your side.

It is an option and might work in your case for sleeping some time in source as 
you mentioned, although it seems not a graceful way.

I think there are no data skew in your case to cause backpressure, because you 
used the rebalance mode as mentioned. Another option might use the forward mode 
which would be better than rebalance mode if possible in your case. Because the 
source and downstream task is one-to-one in forward mode, so the total 
flighting buffers are 2+2+8 for one single downstream task before barrier. If 
in rebalance mode, the total flighting buffer would be (a*2+a*2+8) for one 
single downstream task (`a` is the parallelism of source vertex), because it is 
all-to-all connection. The barrier alignment takes more time in rebalance mode 
than forward mode.

Best,
Zhijiang
--
From:LINZ, Arnaud mailto:al...@bouyguestelecom.fr>>
Send Time:2019年3月1日(星期五) 00:46
To:zhijiang mailto:wangzhijiang...@aliyun.com>>; 
user mailto:user@flink.apache.org>>
Subject:RE: Checkpoints and catch-up burst (heavy back pressure)

Update :
Option  1 does not work. It still fails at the end of the timeout, no matter 
its value.
Should I implement a “bandwidth” management system by using artificial 
Thread.sleep in the source depending on the back pressure ?

De : LINZ, Arnaud
Envoyé : jeudi 28 février 2019 15:47
À : 'zhijiang' mailto:wangzhijiang...@aliyun.com>>; 
user mailto:user@flink.apache.org>>
Objet : RE: Checkpoints and catch-up burst (heavy back pressure)

Hi Zhihiang,

Thanks for your feedback.

  * 

RE: Checkpoints and catch-up burst (heavy back pressure)

2019-03-01 Thread LINZ, Arnaud
Hi,

I think I should go into more details to explain my use case.
I have one non parallel source (parallelism = 1) that list binary files in a 
HDFS directory. DataSet emitted by the source is a data set of file names, not 
file content. These filenames are rebalanced, and sent to workers (parallelism 
= 15) that will use a flatmapper that open the file, read it, decode it, and 
send records (forward mode) to the sinks (with a few 1-to-1 mapping 
in-between). So the flatmap operation is a time-consuming one as the files are 
more than 200Mb large each; the flatmapper will emit millions of record to the 
sink given one source record (filename).

The rebalancing, occurring at the file name level, does not use much I/O and I 
cannot use one-to-one mode at that point if I want some parallelims since I 
have only one source.

I did not put file decoding directly in the sources because I have no good way 
to distribute files to sources without a controller (input directory is unique, 
filenames are random and cannot be “attributed” to one particular source 
instance easily).
Alternatively, I could have used a dispatcher daemon separated from the 
streaming app that distribute files to various directories, each directory 
being associated with a flink source instance, and put the file reading & 
decoding directly in the source, but that seemed more complex to code and 
exploit than the filename source. Would it have been better from the 
checkpointing perspective?

About the ungraceful source sleep(), is there a way, programmatically, to know 
the “load” of the app, or to determine if checkpointing takes too much time, so 
that I can do it only on purpose?

Thanks,
Arnaud

De : zhijiang 
Envoyé : vendredi 1 mars 2019 04:59
À : user ; LINZ, Arnaud 
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

Thanks for the further feedbacks!

For option1: 40min still does not makes sense, which indicates it might take 
more time to finish checkpoint in your case. I also experienced some scenarios 
of catching up data to take several hours to finish one checkpoint. If the 
current checkpoint expires because of timeout, the next new triggered 
checkpoint might still be failed for timeout. So it seems better to wait the 
current checkpoint until finishes, not expires it, unless we can not bear this 
long time for some reasons such as wondering failover to restore more data 
during this time.

For option2: The default network setting should be make sense. The lower values 
might cause performance regression and the higher values would increase the 
inflighing buffers and checkpoint delay more seriously.

For option3: If the resource is limited, it is still not working on your side.

It is an option and might work in your case for sleeping some time in source as 
you mentioned, although it seems not a graceful way.

I think there are no data skew in your case to cause backpressure, because you 
used the rebalance mode as mentioned. Another option might use the forward mode 
which would be better than rebalance mode if possible in your case. Because the 
source and downstream task is one-to-one in forward mode, so the total 
flighting buffers are 2+2+8 for one single downstream task before barrier. If 
in rebalance mode, the total flighting buffer would be (a*2+a*2+8) for one 
single downstream task (`a` is the parallelism of source vertex), because it is 
all-to-all connection. The barrier alignment takes more time in rebalance mode 
than forward mode.

Best,
Zhijiang
--
From:LINZ, Arnaud mailto:al...@bouyguestelecom.fr>>
Send Time:2019年3月1日(星期五) 00:46
To:zhijiang mailto:wangzhijiang...@aliyun.com>>; 
user mailto:user@flink.apache.org>>
Subject:RE: Checkpoints and catch-up burst (heavy back pressure)

Update :
Option  1 does not work. It still fails at the end of the timeout, no matter 
its value.
Should I implement a “bandwidth” management system by using artificial 
Thread.sleep in the source depending on the back pressure ?

De : LINZ, Arnaud
Envoyé : jeudi 28 février 2019 15:47
À : 'zhijiang' mailto:wangzhijiang...@aliyun.com>>; 
user mailto:user@flink.apache.org>>
Objet : RE: Checkpoints and catch-up burst (heavy back pressure)

Hi Zhihiang,

Thanks for your feedback.

  *   I’ll try option 1 ; time out is 4min for now, I’ll switch it to 40min and 
will let you know. Setting it higher than 40 min does not make much sense since 
after 40 min the pending output is already quite large.
  *   Option 3 won’t work ; I already take too many ressources, and as my 
source is more or less a hdfs directory listing, it will always be far faster 
than any mapper that reads the file and emits records based on its content or 
sink that store the transformed data, unless I put “sleeps” in it (but is this 
really a good idea?)
  *   Option 2: taskmanager.network.memory.buffers-per-channel and 
tas

RE: Checkpoints and catch-up burst (heavy back pressure)

2019-02-28 Thread LINZ, Arnaud
Update :
Option  1 does not work. It still fails at the end of the timeout, no matter 
its value.
Should I implement a “bandwidth” management system by using artificial 
Thread.sleep in the source depending on the back pressure ?

De : LINZ, Arnaud
Envoyé : jeudi 28 février 2019 15:47
À : 'zhijiang' ; user 
Objet : RE: Checkpoints and catch-up burst (heavy back pressure)

Hi Zhihiang,

Thanks for your feedback.

  *   I’ll try option 1 ; time out is 4min for now, I’ll switch it to 40min and 
will let you know. Setting it higher than 40 min does not make much sense since 
after 40 min the pending output is already quite large.
  *   Option 3 won’t work ; I already take too many ressources, and as my 
source is more or less a hdfs directory listing, it will always be far faster 
than any mapper that reads the file and emits records based on its content or 
sink that store the transformed data, unless I put “sleeps” in it (but is this 
really a good idea?)
  *   Option 2: taskmanager.network.memory.buffers-per-channel and 
taskmanager.network.memory.buffers-per-gate are currently unset in my 
configuration (so to their default of 2 and 8), but for this streaming app I 
have very few exchanges between nodes (just a rebalance after the source that 
emit file names, everything else is local to the node). Should I adjust their 
values nonetheless ? To higher or lower values ?
Best,
Arnaud
De : zhijiang mailto:wangzhijiang...@aliyun.com>>
Envoyé : jeudi 28 février 2019 10:58
À : user mailto:user@flink.apache.org>>; LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>>
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

I think there are two key points. First the checkpoint barrier might be emitted 
delay from source under high backpressure for synchronizing lock.
Second the barrier has to be queued in flighting data buffers, so the 
downstream task has to process all the buffers before barriers to trigger 
checkpoint and this would take some time under back pressure.

There has three ways to work around:
1. Increase the checkpoint timeout avoid expire in short time.
2. Decrease the setting of network buffers to decrease the amount of flighting 
buffers before barrier, you can check the config of  
"taskmanager.network.memory.buffers-per-channel" and 
"taskmanager.network.memory.buffers-per-gate".
3. Adjust the parallelism such as increasing it for sink vertex in order to 
process source data faster, to avoid backpressure in some extent.

You could check which way is suitable for your scenario and may have a try.

Best,
Zhijiang
--
From:LINZ, Arnaud mailto:al...@bouyguestelecom.fr>>
Send Time:2019年2月28日(星期四) 17:28
To:user mailto:user@flink.apache.org>>
Subject:Checkpoints and catch-up burst (heavy back pressure)

Hello,

I have a simple streaming app that get data from a source and store it to HDFS 
using a sink similar to the bucketing file sink. Checkpointing mode is “exactly 
once”.
Everything is fine on a “normal” course as the sink is faster than the source; 
but when we stop the application for a while and then restart it, we have a 
catch-up burst to get all the messages emitted in the meanwhile.
During this burst, the source is faster than the sink, and all checkpoints fail 
(time out) until the source has been totally caught up. This is annoying 
because the sink does not “commit” the data before a successful checkpoint is 
made, and so the app release all the “catch up” data as a atomic block that can 
be huge if the streaming app was stopped for a while, adding an unwanted stress 
to all the following hive treatments that use the data provided in micro 
batches and to the Hadoop cluster.

How should I handle the situation? Is there something special to do to get 
checkpoints even during heavy load?

The problem does not seem to be new, but I was unable to find any practical 
solution in the documentation.

Best regards,
Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



RE: Checkpoints and catch-up burst (heavy back pressure)

2019-02-28 Thread LINZ, Arnaud
Hi Zhihiang,

Thanks for your feedback.

  *   I’ll try option 1 ; time out is 4min for now, I’ll switch it to 40min and 
will let you know. Setting it higher than 40 min does not make much sense since 
after 40 min the pending output is already quite large.
  *   Option 3 won’t work ; I already take too many ressources, and as my 
source is more or less a hdfs directory listing, it will always be far faster 
than any mapper that reads the file and emits records based on its content or 
sink that store the transformed data, unless I put “sleeps” in it (but is this 
really a good idea?)
  *   Option 2: taskmanager.network.memory.buffers-per-channel and 
taskmanager.network.memory.buffers-per-gate are currently unset in my 
configuration (so to their default of 2 and 8), but for this streaming app I 
have very few exchanges between nodes (just a rebalance after the source that 
emit file names, everything else is local to the node). Should I adjust their 
values nonetheless ? To higher or lower values ?
Best,
Arnaud
De : zhijiang 
Envoyé : jeudi 28 février 2019 10:58
À : user ; LINZ, Arnaud 
Objet : Re: Checkpoints and catch-up burst (heavy back pressure)

Hi Arnaud,

I think there are two key points. First the checkpoint barrier might be emitted 
delay from source under high backpressure for synchronizing lock.
Second the barrier has to be queued in flighting data buffers, so the 
downstream task has to process all the buffers before barriers to trigger 
checkpoint and this would take some time under back pressure.

There has three ways to work around:
1. Increase the checkpoint timeout avoid expire in short time.
2. Decrease the setting of network buffers to decrease the amount of flighting 
buffers before barrier, you can check the config of  
"taskmanager.network.memory.buffers-per-channel" and 
"taskmanager.network.memory.buffers-per-gate".
3. Adjust the parallelism such as increasing it for sink vertex in order to 
process source data faster, to avoid backpressure in some extent.

You could check which way is suitable for your scenario and may have a try.

Best,
Zhijiang
--
From:LINZ, Arnaud mailto:al...@bouyguestelecom.fr>>
Send Time:2019年2月28日(星期四) 17:28
To:user mailto:user@flink.apache.org>>
Subject:Checkpoints and catch-up burst (heavy back pressure)

Hello,

I have a simple streaming app that get data from a source and store it to HDFS 
using a sink similar to the bucketing file sink. Checkpointing mode is “exactly 
once”.
Everything is fine on a “normal” course as the sink is faster than the source; 
but when we stop the application for a while and then restart it, we have a 
catch-up burst to get all the messages emitted in the meanwhile.
During this burst, the source is faster than the sink, and all checkpoints fail 
(time out) until the source has been totally caught up. This is annoying 
because the sink does not “commit” the data before a successful checkpoint is 
made, and so the app release all the “catch up” data as a atomic block that can 
be huge if the streaming app was stopped for a while, adding an unwanted stress 
to all the following hive treatments that use the data provided in micro 
batches and to the Hadoop cluster.

How should I handle the situation? Is there something special to do to get 
checkpoints even during heavy load?

The problem does not seem to be new, but I was unable to find any practical 
solution in the documentation.

Best regards,
Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



Checkpoints and catch-up burst (heavy back pressure)

2019-02-28 Thread LINZ, Arnaud
Hello,

I have a simple streaming app that get data from a source and store it to HDFS 
using a sink similar to the bucketing file sink. Checkpointing mode is “exactly 
once”.
Everything is fine on a “normal” course as the sink is faster than the source; 
but when we stop the application for a while and then restart it, we have a 
catch-up burst to get all the messages emitted in the meanwhile.
During this burst, the source is faster than the sink, and all checkpoints fail 
(time out) until the source has been totally caught up. This is annoying 
because the sink does not “commit” the data before a successful checkpoint is 
made, and so the app release all the “catch up” data as a atomic block that can 
be huge if the streaming app was stopped for a while, adding an unwanted stress 
to all the following hive treatments that use the data provided in micro 
batches and to the Hadoop cluster.

How should I handle the situation? Is there something special to do to get 
checkpoints even during heavy load?

The problem does not seem to be new, but I was unable to find any practical 
solution in the documentation.

Best regards,
Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Kerberos error when restoring from HDFS backend after 24 hours

2019-01-09 Thread LINZ, Arnaud
Hi,

I've managed to correct this by implementing my own FsStateBackend based on the 
original one with proper Kerberos relogin in createCheckpointStorage().

Regards,
Arnaud

-Message d'origine-
De : LINZ, Arnaud
Envoyé : vendredi 4 janvier 2019 11:32
À : user 
Objet : Kerberos error when restoring from HDFS backend after 24 hours

Hello and happy new year to all flink users,

I have a streaming application (flink v1.7.0) on a Kerberized cluster, using a 
flink configuration file where the following parameters are set :

security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: X
security.kerberos.login.principal: X

As it is not sufficient, I also log to Kerberos the "open()" method of 
sources/sinks using hdfs or hiveserver2/impala servers using  
UserGroupInformation.loginUserFromKeytab(). And as it is even not sufficient in 
some case (namely the HiveServer2/Impala connection), I also attach a Jaas 
object to the TaskManager setting java.security.auth.login.config property 
dynamically. And as it is in some rare cases not even sufficient, I do run 
kinit as an external process from the task manager to create a local ticket 
cache...

With all that stuff, everything works fine for several days when the streaming 
app does not experience any problem. However, when a problem occurs, when 
restoring from the checkpoint (hdfs backend), I get the following exception if 
it occurs after 24h from the initial application launch (24h is the Kerberos 
ticket validation time):

java.io.IOException: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]; Host Details : local host is: "x"; destination host is: "x;
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
org.apache.hadoop.ipc.Client.call(Client.java:1474)
org.apache.hadoop.ipc.Client.call(Client.java:1401)
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
com.sun.proxy.$Proxy9.mkdirs(Unknown Source)
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
com.sun.proxy.$Proxy10.mkdirs(Unknown Source)
org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2742)
org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2713)
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819)
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:170)
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.mkdirs(SafetyNetWrapperFileSystem.java:112)
org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.(FsCheckpointStorage.java:83)
org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.(FsCheckpointStorage.java:58)
org.apache.flink.runtime.state.filesystem.FsStateBackend.createCheckpointStorage(FsStateBackend.java:444)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:257)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
java.lang.Thread.run(Thread.java:748)
Caused by : java.io.IOException: javax.security.sasl.SaslException: GSS 
initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
(...)

Any idea about how I can circumvent this? For instance, can I "hook" the 
restoring process before the mkdir to relog to Kerberos by hand?

Best regards,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 

Kerberos error when restoring from HDFS backend after 24 hours

2019-01-04 Thread LINZ, Arnaud
Hello and happy new year to all flink users,

I have a streaming application (flink v1.7.0) on a Kerberized cluster, using a 
flink configuration file where the following parameters are set :

security.kerberos.login.use-ticket-cache: false
security.kerberos.login.keytab: X
security.kerberos.login.principal: X

As it is not sufficient, I also log to Kerberos the "open()" method of 
sources/sinks using hdfs or hiveserver2/impala servers using  
UserGroupInformation.loginUserFromKeytab(). And as it is even not sufficient in 
some case (namely the HiveServer2/Impala connection), I also attach a Jaas 
object to the TaskManager setting java.security.auth.login.config property 
dynamically. And as it is in some rare cases not even sufficient, I do run 
kinit as an external process from the task manager to create a local ticket 
cache...

With all that stuff, everything works fine for several days when the streaming 
app does not experience any problem. However, when a problem occurs, when 
restoring from the checkpoint (hdfs backend), I get the following exception if 
it occurs after 24h from the initial application launch (24h is the Kerberos 
ticket validation time):

java.io.IOException: Failed on local exception: java.io.IOException: 
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]; Host Details : local host is: "x"; destination host is: "x;
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
org.apache.hadoop.ipc.Client.call(Client.java:1474)
org.apache.hadoop.ipc.Client.call(Client.java:1401)
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
com.sun.proxy.$Proxy9.mkdirs(Unknown Source)
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:539)
sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:498)
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
com.sun.proxy.$Proxy10.mkdirs(Unknown Source)
org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2742)
org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2713)
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:870)
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:866)
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:866)
org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:859)
org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1819)
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.mkdirs(HadoopFileSystem.java:170)
org.apache.flink.core.fs.SafetyNetWrapperFileSystem.mkdirs(SafetyNetWrapperFileSystem.java:112)
org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.(FsCheckpointStorage.java:83)
org.apache.flink.runtime.state.filesystem.FsCheckpointStorage.(FsCheckpointStorage.java:58)
org.apache.flink.runtime.state.filesystem.FsStateBackend.createCheckpointStorage(FsStateBackend.java:444)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:257)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
java.lang.Thread.run(Thread.java:748)
Caused by : java.io.IOException: javax.security.sasl.SaslException: GSS 
initiate failed [Caused by GSSException: No valid credentials provided 
(Mechanism level: Failed to find any Kerberos tgt)]
org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:682)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
(...)

Any idea about how I can circumvent this? For instance, can I "hook" the 
restoring process before the mkdir to relog to Kerberos by hand?

Best regards,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Apache Flink 1.7.0 jar complete ?

2018-11-30 Thread LINZ, Arnaud
Hi,
I was probably too much of an early bird, removing the cache did solve that 
problem.

However this test program still does not end (cf. 
https://issues.apache.org/jira/browse/FLINK-10832 ) so I still can’t use that 
version…
I wonder why I’m the only one with this problem ? I’ve tested it on two 
different computers…


De : Till Rohrmann 
Envoyé : vendredi 30 novembre 2018 15:04
À : LINZ, Arnaud 
Cc : user 
Objet : Re: Apache Flink 1.7.0 jar complete ?

Hi Arnaud,

I tried to setup the same testing project as you've described and it worked for 
me. Could you maybe try to clear your Maven repository? Maybe not all 
dependencies had been properly mirrored to Maven central.

Cheers,
Till

On Fri, Nov 30, 2018 at 2:31 PM Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:
Thanks for reporting this problem Arnaud. I will investigate this problem.

Cheers,
Till

On Fri, Nov 30, 2018 at 12:20 PM LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,

When trying to update to 1.7.0, a simple local cluster test fails with :

12:03:55.182 [main] DEBUG o.a.f.s.a.graph.StreamGraphGenerator - Transforming 
SinkTransformation{id=2, name='Print to Std. Out', 
outputType=GenericType, parallelism=1}
12:03:55.182 [main] DEBUG o.a.f.s.a.graph.StreamGraphGenerator - Transforming 
SourceTransformation{id=1, name='Custom Source', outputType=String, 
parallelism=1}
12:03:55.182 [main] DEBUG o.a.f.s.api.graph.StreamGraph - Vertex: 1
12:03:55.182 [main] DEBUG o.a.f.s.api.graph.StreamGraph - Vertex: 2
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/shaded/guava18/com/google/common/hash/Hashing
at 
org.apache.flink.streaming.api.graph.StreamGraphHasherV2.traverseStreamGraphAndGenerateHashes(StreamGraphHasherV2.java:80)
at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:145)
at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:93)
at 
org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:669)
at 
org.apache.flink.optimizer.plan.StreamingPlan.getJobGraph(StreamingPlan.java:40)
at 
org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:92)
at flink.flink_10832.App.testFlink10832(App.java:60)
at flink.flink_10832.App.main(App.java:31)
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.shaded.guava18.com.google.common.hash.Hashing
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 8 more

Pom. is :
http://maven.apache.org/POM/4.0.0";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0

flink
flink-10832
0.0.1-SNAPSHOT
jar
flink-10832


UTF-8
1.7.0





org.apache.flink
flink-java
${flink.version}


org.apache.flink
flink-streaming-java_2.11
${flink.version}




ch.qos.logback
logback-classic
1.0.13


org.slf4j
slf4j-api
1.7.5





Code is :
public static void testFlink10832() throws Exception {
// get the execution environment
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
// get input data by connecting to the socket
@SuppressWarnings("serial")
final DataStreamSource text = env.addSource(new 
SourceFunction() {
@Override
public void run(final SourceContext ctx) throws Exception {
for (int count = 0; count < 5; count++) {
ctx.collect(String.valueOf(count));
}
}

@Override
public void cancel() {
}
});
text.print().setParallelism(1);
env.execute("Simple Test");
System.out.println("If you see this the issue is resolved!");
}

Any idea why ?

Regards,
Arnaud

De : Till Rohrmann mailto:trohrm...@apache.org>>
Envoyé : vendredi 30 novembre 2018 10:40
À : d...@flink.apache.org<mailto:d...@flink.apache.org>; user 
mailto:user@flink.apache.org>>; 
annou...@apache.org<mailto:annou...@apache.org>
Objet : [ANNOUNCE] Apache Flink 1.7.0 released

The Apache Flink community is very happy

Apache Flink 1.7.0 jar complete ?

2018-11-30 Thread LINZ, Arnaud
Hi,

When trying to update to 1.7.0, a simple local cluster test fails with :

12:03:55.182 [main] DEBUG o.a.f.s.a.graph.StreamGraphGenerator - Transforming 
SinkTransformation{id=2, name='Print to Std. Out', 
outputType=GenericType, parallelism=1}
12:03:55.182 [main] DEBUG o.a.f.s.a.graph.StreamGraphGenerator - Transforming 
SourceTransformation{id=1, name='Custom Source', outputType=String, 
parallelism=1}
12:03:55.182 [main] DEBUG o.a.f.s.api.graph.StreamGraph - Vertex: 1
12:03:55.182 [main] DEBUG o.a.f.s.api.graph.StreamGraph - Vertex: 2
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/flink/shaded/guava18/com/google/common/hash/Hashing
at 
org.apache.flink.streaming.api.graph.StreamGraphHasherV2.traverseStreamGraphAndGenerateHashes(StreamGraphHasherV2.java:80)
at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:145)
at 
org.apache.flink.streaming.api.graph.StreamingJobGraphGenerator.createJobGraph(StreamingJobGraphGenerator.java:93)
at 
org.apache.flink.streaming.api.graph.StreamGraph.getJobGraph(StreamGraph.java:669)
at 
org.apache.flink.optimizer.plan.StreamingPlan.getJobGraph(StreamingPlan.java:40)
at 
org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:92)
at flink.flink_10832.App.testFlink10832(App.java:60)
at flink.flink_10832.App.main(App.java:31)
Caused by: java.lang.ClassNotFoundException: 
org.apache.flink.shaded.guava18.com.google.common.hash.Hashing
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 8 more

Pom. is :
http://maven.apache.org/POM/4.0.0";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
4.0.0

flink
flink-10832
0.0.1-SNAPSHOT
jar
flink-10832


UTF-8
1.7.0





org.apache.flink
flink-java
${flink.version}


org.apache.flink
flink-streaming-java_2.11
${flink.version}




ch.qos.logback
logback-classic
1.0.13


org.slf4j
slf4j-api
1.7.5





Code is :
public static void testFlink10832() throws Exception {
// get the execution environment
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
// get input data by connecting to the socket
@SuppressWarnings("serial")
final DataStreamSource text = env.addSource(new 
SourceFunction() {
@Override
public void run(final SourceContext ctx) throws Exception {
for (int count = 0; count < 5; count++) {
ctx.collect(String.valueOf(count));
}
}

@Override
public void cancel() {
}
});
text.print().setParallelism(1);
env.execute("Simple Test");
System.out.println("If you see this the issue is resolved!");
}

Any idea why ?

Regards,
Arnaud

De : Till Rohrmann 
Envoyé : vendredi 30 novembre 2018 10:40
À : d...@flink.apache.org; user ; annou...@apache.org
Objet : [ANNOUNCE] Apache Flink 1.7.0 released

The Apache Flink community is very happy to announce the release of Apache 
Flink 1.7.0, which is the next major release.

Apache Flink® is an open-source stream processing framework for distributed, 
high-performing, always-available, and accurate data streaming applications.

The release is available for download at:
https://flink.apache.org/downloads.html

Please check out the release blog post for an overview of the new features and 
improvements for this release:
https://flink.apache.org/news/2018/11/30/release-1.7.0.html

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12343585

We would like to thank all contributors of the Apache Flink community who made 
this release possible!

Cheers,
Till



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefo

RE: Stopping a streaming app from its own code : behaviour change from 1.3 to 1.6

2018-11-08 Thread LINZ, Arnaud
1.FLINK-10832<https://issues.apache.org/jira/browse/FLINK-10832>
Created (with heavy difficulties as typing java code in a jira description was 
an awful experience ☺)


De : LINZ, Arnaud
Envoyé : mercredi 7 novembre 2018 11:43
À : 'user' 
Objet : RE: Stopping a streaming app from its own code : behaviour change from 
1.3 to 1.6

FYI, the code below ends with version 1.6.0, do not end in 1.6.1. I suspect 
it’s a bug instead of a new feature.

De : LINZ, Arnaud
Envoyé : mercredi 7 novembre 2018 11:14
À : 'user' mailto:user@flink.apache.org>>
Objet : RE: Stopping a streaming app from its own code : behaviour change from 
1.3 to 1.6


Hello,



This has nothing to do with HA. All my unit tests involving a streaming app now 
fail in “infinite execution”
This simple code never ends :
@Test
public void testFlink162() throws Exception {
// get the execution environment
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
// get input data
final DataStreamSource text = env.addSource(new 
SourceFunction() {
@Override
public void run(final SourceContext ctx) throws Exception {
for (int count = 0; count < 5; count++) {
ctx.collect(String.valueOf(count));
}
}
@Override
public void cancel() {
}
});
text.print().setParallelism(1);
env.execute("Simple Test");
// Never ends !
}
Is this really a new feature or a critical bug?
In the log, the task executor is stopped
[2018-11-07 11:11:23,608] INFO Stopped TaskExecutor 
akka://flink/user/taskmanager_0. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:330)
But execute() does not return.

Arnaud

Log is :
[2018-11-07 11:11:11,432] INFO Running job on local embedded Flink mini cluster 
(org.apache.flink.streaming.api.environment.LocalStreamEnvironment:114)
[2018-11-07 11:11:11,449] INFO Starting Flink Mini Cluster 
(org.apache.flink.runtime.minicluster.MiniCluster:227)
[2018-11-07 11:11:11,636] INFO Starting Metrics Registry 
(org.apache.flink.runtime.minicluster.MiniCluster:238)
[2018-11-07 11:11:11,652] INFO No metrics reporter configured, no metrics will 
be exposed/reported. (org.apache.flink.runtime.metrics.MetricRegistryImpl:113)
[2018-11-07 11:11:11,703] INFO Starting RPC Service(s) 
(org.apache.flink.runtime.minicluster.MiniCluster:249)
[2018-11-07 11:11:12,244] INFO Slf4jLogger started 
(akka.event.slf4j.Slf4jLogger:92)
[2018-11-07 11:11:12,264] INFO Starting high-availability services 
(org.apache.flink.runtime.minicluster.MiniCluster:290)
[2018-11-07 11:11:12,367] INFO Created BLOB server storage directory 
C:\Users\alinz\AppData\Local\Temp\blobStore-fd104a2d-caaf-4740-a762-d292cb2ed108
 (org.apache.flink.runtime.blob.BlobServer:141)
[2018-11-07 11:11:12,379] INFO Started BLOB server at 0.0.0.0:64504 - max 
concurrent requests: 50 - max backlog: 1000 
(org.apache.flink.runtime.blob.BlobServer:203)
[2018-11-07 11:11:12,380] INFO Starting ResourceManger 
(org.apache.flink.runtime.minicluster.MiniCluster:301)
[2018-11-07 11:11:12,409] INFO Starting RPC endpoint for 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at 
akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 . 
(org.apache.flink.runtime.rpc.akka.AkkaRpcService:224)
[2018-11-07 11:11:12,432] INFO Proposing leadership to contender 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager@5b1f29fa<mailto:org.apache.flink.runtime.resourcemanager.StandaloneResourceManager@5b1f29fa>
 @ akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 
(org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService:274)
[2018-11-07 11:11:12,439] INFO ResourceManager 
akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 was 
granted leadership with fencing token 86394924fb97bad612b67f526f84406f 
(org.apache.flink.runtime.resourcemanager.StandaloneResourceManager:953)
[2018-11-07 11:11:12,440] INFO Starting the SlotManager. 
(org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager:185)
[2018-11-07 11:11:12,442] INFO Received confirmation of leadership for leader 
akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 , 
session=12b67f52-6f84-406f-8639-4924fb97bad6 
(org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService:229)
[2018-11-07 11:11:12,452] INFO Created BLOB cache storage directory 
C:\Users\alinz\AppData\Local\Temp\blobStore-b2618f73-5ec6-4fdf-ad43-1da6d6c19a4f
 (org.apache.flink.runtime.blob.PermanentBlobCache:107)
[2018-11-07 11:11:12,454] INFO Created BLOB cache storage directory 
C:\Users\alinz\AppData\Local\Temp\blobStore-df6c61d2-3c51-4335-a96e-6b00c82e4d90
 (org.apache.flink.runtime.blob.TransientBlobCache:107)
[2018-11-07 11:11:12,454] INFO Starting 1 TaskManger(s) 
(o

RE: Stopping a streaming app from its own code : behaviour change from 1.3 to 1.6

2018-11-07 Thread LINZ, Arnaud
FYI, the code below ends with version 1.6.0, do not end in 1.6.1. I suspect 
it’s a bug instead of a new feature.

De : LINZ, Arnaud
Envoyé : mercredi 7 novembre 2018 11:14
À : 'user' 
Objet : RE: Stopping a streaming app from its own code : behaviour change from 
1.3 to 1.6


Hello,



This has nothing to do with HA. All my unit tests involving a streaming app now 
fail in “infinite execution”
This simple code never ends :
@Test
public void testFlink162() throws Exception {
// get the execution environment
final StreamExecutionEnvironment env = 
StreamExecutionEnvironment.getExecutionEnvironment();
// get input data
final DataStreamSource text = env.addSource(new 
SourceFunction() {
@Override
public void run(final SourceContext ctx) throws Exception {
for (int count = 0; count < 5; count++) {
ctx.collect(String.valueOf(count));
}
}
@Override
public void cancel() {
}
});
text.print().setParallelism(1);
env.execute("Simple Test");
// Never ends !
}
Is this really a new feature or a critical bug?
In the log, the task executor is stopped
[2018-11-07 11:11:23,608] INFO Stopped TaskExecutor 
akka://flink/user/taskmanager_0. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:330)
But execute() does not return.

Arnaud

Log is :
[2018-11-07 11:11:11,432] INFO Running job on local embedded Flink mini cluster 
(org.apache.flink.streaming.api.environment.LocalStreamEnvironment:114)
[2018-11-07 11:11:11,449] INFO Starting Flink Mini Cluster 
(org.apache.flink.runtime.minicluster.MiniCluster:227)
[2018-11-07 11:11:11,636] INFO Starting Metrics Registry 
(org.apache.flink.runtime.minicluster.MiniCluster:238)
[2018-11-07 11:11:11,652] INFO No metrics reporter configured, no metrics will 
be exposed/reported. (org.apache.flink.runtime.metrics.MetricRegistryImpl:113)
[2018-11-07 11:11:11,703] INFO Starting RPC Service(s) 
(org.apache.flink.runtime.minicluster.MiniCluster:249)
[2018-11-07 11:11:12,244] INFO Slf4jLogger started 
(akka.event.slf4j.Slf4jLogger:92)
[2018-11-07 11:11:12,264] INFO Starting high-availability services 
(org.apache.flink.runtime.minicluster.MiniCluster:290)
[2018-11-07 11:11:12,367] INFO Created BLOB server storage directory 
C:\Users\alinz\AppData\Local\Temp\blobStore-fd104a2d-caaf-4740-a762-d292cb2ed108
 (org.apache.flink.runtime.blob.BlobServer:141)
[2018-11-07 11:11:12,379] INFO Started BLOB server at 0.0.0.0:64504 - max 
concurrent requests: 50 - max backlog: 1000 
(org.apache.flink.runtime.blob.BlobServer:203)
[2018-11-07 11:11:12,380] INFO Starting ResourceManger 
(org.apache.flink.runtime.minicluster.MiniCluster:301)
[2018-11-07 11:11:12,409] INFO Starting RPC endpoint for 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at 
akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 . 
(org.apache.flink.runtime.rpc.akka.AkkaRpcService:224)
[2018-11-07 11:11:12,432] INFO Proposing leadership to contender 
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager@5b1f29fa<mailto:org.apache.flink.runtime.resourcemanager.StandaloneResourceManager@5b1f29fa>
 @ akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 
(org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService:274)
[2018-11-07 11:11:12,439] INFO ResourceManager 
akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 was 
granted leadership with fencing token 86394924fb97bad612b67f526f84406f 
(org.apache.flink.runtime.resourcemanager.StandaloneResourceManager:953)
[2018-11-07 11:11:12,440] INFO Starting the SlotManager. 
(org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager:185)
[2018-11-07 11:11:12,442] INFO Received confirmation of leadership for leader 
akka://flink/user/resourcemanager_f70e9071-b9d0-489a-9dcc-888f75b46864 , 
session=12b67f52-6f84-406f-8639-4924fb97bad6 
(org.apache.flink.runtime.highavailability.nonha.embedded.EmbeddedLeaderService:229)
[2018-11-07 11:11:12,452] INFO Created BLOB cache storage directory 
C:\Users\alinz\AppData\Local\Temp\blobStore-b2618f73-5ec6-4fdf-ad43-1da6d6c19a4f
 (org.apache.flink.runtime.blob.PermanentBlobCache:107)
[2018-11-07 11:11:12,454] INFO Created BLOB cache storage directory 
C:\Users\alinz\AppData\Local\Temp\blobStore-df6c61d2-3c51-4335-a96e-6b00c82e4d90
 (org.apache.flink.runtime.blob.TransientBlobCache:107)
[2018-11-07 11:11:12,454] INFO Starting 1 TaskManger(s) 
(org.apache.flink.runtime.minicluster.MiniCluster:316)
[2018-11-07 11:11:12,460] INFO Starting TaskManager with ResourceID: 
e84ce076-ec5e-48d6-90dc-4b18ba7c5757 
(org.apache.flink.runtime.taskexecutor.TaskManagerRunner:352)
[2018-11-07 11:11:12,531] INFO Temporary file directory 
'C:\Users\alinz\AppData\Local\Temp': total 476 GB, usable 149 GB (31,30% 
usable) (org.apache.flink.r

RE: Stopping a streaming app from its own code : behaviour change from 1.3 to 1.6

2018-11-07 Thread LINZ, Arnaud
,840] INFO No state backend has been configured, using 
default (Memory / JobManager) MemoryStateBackend (data in heap memory / 
checkpoints to JobManager) (checkpoints: 'null', savepoints: 'null', 
asynchronous: TRUE, maxStateSize: 5242880) 
(org.apache.flink.streaming.runtime.tasks.StreamTask:230)
0
1
2
3
4
[2018-11-07 11:11:13,897] INFO Source: Custom Source -> Sink: Print to Std. Out 
(1/1) (07ae66bef91de06205cf22a337ea1fe2) switched from RUNNING to FINISHED. 
(org.apache.flink.runtime.taskmanager.Task:915)
[2018-11-07 11:11:13,897] INFO Freeing task resources for Source: Custom Source 
-> Sink: Print to Std. Out (1/1) (07ae66bef91de06205cf22a337ea1fe2). 
(org.apache.flink.runtime.taskmanager.Task:818)
[2018-11-07 11:11:13,898] INFO Ensuring all FileSystem streams are closed for 
task Source: Custom Source -> Sink: Print to Std. Out (1/1) 
(07ae66bef91de06205cf22a337ea1fe2) [FINISHED] 
(org.apache.flink.runtime.taskmanager.Task:845)
[2018-11-07 11:11:13,899] INFO Un-registering task and sending final execution 
state FINISHED to JobManager for task Source: Custom Source -> Sink: Print to 
Std. Out 07ae66bef91de06205cf22a337ea1fe2. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:1337)
[2018-11-07 11:11:13,904] INFO Source: Custom Source -> Sink: Print to Std. Out 
(1/1) (07ae66bef91de06205cf22a337ea1fe2) switched from RUNNING to FINISHED. 
(org.apache.flink.runtime.executiongraph.ExecutionGraph:1316)
[2018-11-07 11:11:13,907] INFO Job Simple Test 
(0ef8697ca98f6d2b565ed928d17c8a49) switched from state RUNNING to FINISHED. 
(org.apache.flink.runtime.executiongraph.ExecutionGraph:1356)
[2018-11-07 11:11:13,908] INFO Stopping checkpoint coordinator for job 
0ef8697ca98f6d2b565ed928d17c8a49. 
(org.apache.flink.runtime.checkpoint.CheckpointCoordinator:320)
[2018-11-07 11:11:13,908] INFO Shutting down 
(org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore:102)
[2018-11-07 11:11:23,579] INFO Shutting down Flink Mini Cluster 
(org.apache.flink.runtime.minicluster.MiniCluster:427)
[2018-11-07 11:11:23,580] INFO Shutting down rest endpoint. 
(org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint:265)
[2018-11-07 11:11:23,582] INFO Stopping TaskExecutor 
akka://flink/user/taskmanager_0. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:291)
[2018-11-07 11:11:23,583] INFO Shutting down 
TaskExecutorLocalStateStoresManager. 
(org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager:213)
[2018-11-07 11:11:23,591] INFO I/O manager removed spill file directory 
C:\Users\alinz\AppData\Local\Temp\flink-io-6a19871b-3b86-4a47-9b82-28eef7e55814 
(org.apache.flink.runtime.io.disk.iomanager.IOManager:110)
[2018-11-07 11:11:23,591] INFO Shutting down the network environment and its 
components. (org.apache.flink.runtime.io.network.NetworkEnvironment:344)
[2018-11-07 11:11:23,591] INFO Removing cache directory 
C:\Users\alinz\AppData\Local\Temp\flink-web-ui 
(org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint:733)
[2018-11-07 11:11:23,593] INFO Closing the SlotManager. 
(org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager:249)
[2018-11-07 11:11:23,593] INFO Suspending the SlotManager. 
(org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager:212)
[2018-11-07 11:11:23,596] INFO Close ResourceManager connection 
cd021102669258aad77c20645ed08aae: ResourceManager leader changed to new address 
null. (org.apache.flink.runtime.jobmaster.JobMaster:1355)
[2018-11-07 11:11:23,607] INFO Stop job leader service. 
(org.apache.flink.runtime.taskexecutor.JobLeaderService:135)
[2018-11-07 11:11:23,608] INFO Stopped TaskExecutor 
akka://flink/user/taskmanager_0. 
(org.apache.flink.runtime.taskexecutor.TaskExecutor:330)



De : LINZ, Arnaud
Envoyé : mardi 6 novembre 2018 14:23
À : user 
Objet : Stopping a streaming app from its own code : behaviour change from 1.3 
to 1.6


Hello,



In flink 1.3, I was able to make a clean stop of a HA streaming application 
just by ending the source “run()” method (with an ending condition).

I try to update my code to flink 1.6.2, but that is no longer working.



Even if there are no sources and no item to process, the cluster continue its 
execution forever, with an infinite number of such messages:

Checkpoint triggering task Source: Custom Source (1/2) of job 
3b286f5344c50f0e466bb8ee79a2bb69 is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.



Why has this behavior changed? How am I supposed to stop a streaming execution 
from its own code now? Is https://issues.apache.org/jira/browse/FLINK-2111 of 
any use?



Thanks,

Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l&

Stopping a streaming app from its own code : behaviour change from 1.3 to 1.6

2018-11-06 Thread LINZ, Arnaud
Hello,



In flink 1.3, I was able to make a clean stop of a HA streaming application 
just by ending the source “run()” method (with an ending condition).

I try to update my code to flink 1.6.2, but that is no longer working.



Even if there are no sources and no item to process, the cluster continue its 
execution forever, with an infinite number of such messages:

Checkpoint triggering task Source: Custom Source (1/2) of job 
3b286f5344c50f0e466bb8ee79a2bb69 is not in state RUNNING but SCHEDULED instead. 
Aborting checkpoint.



Why has this behavior changed? How am I supposed to stop a streaming execution 
from its own code now? Is https://issues.apache.org/jira/browse/FLINK-2111 of 
any use?



Thanks,

Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: How to handle multiple yarn sessions and choose at runtime the one to submit a ha streaming job ?

2018-02-07 Thread LINZ, Arnaud
Hi,

Without any other solution, I made a shell script that copies the original 
content of FLINK_CONF_DIR in a temporary rep, modify flink-conf.yaml to set 
yarn.properties-file.location, and change FLINK_CONF_DIR to that temp rep 
before executing flink.
I am now able to select the container I want, but I think it should be made 
simpler...
I'll open a Jira.

Best regards,
Arnaud


De : LINZ, Arnaud
Envoyé : jeudi 1 février 2018 16:23
À : user@flink.apache.org
Objet : How to handle multiple yarn sessions and choose at runtime the one to 
submit a ha streaming job ?

Hello,

I am using Flink 1.3.2 and I'm struggling to achieve something that should be 
simple.
For isolation reasons, I want to start multiple long living yarn session 
containers (with the same user) and choose at run-time, when I start a HA 
streaming app, which container will hold it.

I start my yarn session with the command line option : 
-Dyarn.properties-file.location=mydir
The session is created and a .yarn-properties-$USER file is generated.

And I've tried the following to submit my job:

CASE 1
flink-conf.yaml : yarn.properties-file.location: mydir
flink run options : none

  *   Uses zookeeper and works  - but I cannot choose the container as the 
property file is global.

CASE 2
flink-conf.yaml : nothing
flink run options : -yid applicationId

  *   Do not use zookeeper, tries to connect to yarn job manager but fails in 
"Job submission to the JobManager timed out" error

CASE 3
flink-conf.yaml : nothing
flink run options : -yid applicationId and -yD with all dynamic properties 
found in the "dynamicPropertiesString" of .yarn-properties-$USER file

  *   Same as case 2

CASE 4
flink-conf.yaml : nothing
flink run options : -yD yarn.properties-file.location=mydir

  *   Tries to connect to local (non yarn) job manager (and fails)

CASE 5
Even weirder:
flink-conf.yaml : yarn.properties-file.location: mydir
flink run options : -yD yarn.properties-file.location=mydir

  *   Still tries to connect to local (non yarn) job manager!

What am I doing wrong?

Logs extracts :
CASE 1:
2018:02:01 15:43:20 - Waiting until all TaskManagers have connected
2018:02:01 15:43:20 - Starting client actor system.
2018:02:01 15:43:20 - Starting ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:20 - Trying to select the network interface and address to use 
by connecting to the leading JobManager.
2018:02:01 15:43:20 - TaskManager will try to connect for 1 milliseconds 
before falling back to heuristics
2018:02:01 15:43:21 - Retrieved new target address 
elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970.
2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - Slf4jLogger started
2018:02:01 15:43:21 - Starting remoting
2018:02:01 15:43:21 - Remoting started; listening on addresses 
:[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:36340]
2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - TaskManager status (2/1)
2018:02:01 15:43:21 - All TaskManagers are connected
2018:02:01 15:43:21 - Submitting job with JobID: 
f69197b0b80a76319a87bde10c1e3f77. Waiting for job completion.
2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - Received SubmitJobAndWait(JobGraph(jobId: 
f69197b0b80a76319a87bde10c1e3f77)) but there is no connection to a JobManager 
yet.
2018:02:01 15:43:21 - Received job SND-IMP-SIGNAST 
(f69197b0b80a76319a87bde10c1e3f77).
2018:02:01 15:43:21 - Disconnect from JobManager null.
2018:02:01 15:43:21 - Connect to JobManager 
Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245].
2018:02:01 15:43:21 - Connected to JobManager at 
Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]
 with leader session id 388af5b8--4923-8ee4-8a4b9bfbb0b9.
2018:02:01 15:43:21 - Sending message to JobManager 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager
 to submit job SND-IMP-SIGNAST (f69197b0b80a76319a87bde10c1e3f77) and wait for 
progress
2018:02:01 15:43:21 - Upload jar files to job manager 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.
2018:02:01 15:43:21 - Blob client connecting to 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager
2018:02:01 15:43:22 - Submit job to the job manager 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.
2018:02:01 15:43:22 - Job f69197b0b80a76319a87bde10c1e3f77 was successfully 
submitted to the JobManager akka://flink/deadLetters.
2018:02:01 15:43:22 - 02/01/2018 15:43:22   Job execution switched to status 
RUNNING.

CASE 2:
2018:02:01 15:48:43 - Waiting until all TaskManagers have connected
2018:02:01 15:48:43 - Starting client actor system.
2018:02:01 15:48:43 - Trying to select t

How to handle multiple yarn sessions and choose at runtime the one to submit a ha streaming job ?

2018-02-01 Thread LINZ, Arnaud
Hello,

I am using Flink 1.3.2 and I'm struggling to achieve something that should be 
simple.
For isolation reasons, I want to start multiple long living yarn session 
containers (with the same user) and choose at run-time, when I start a HA 
streaming app, which container will hold it.

I start my yarn session with the command line option : 
-Dyarn.properties-file.location=mydir
The session is created and a .yarn-properties-$USER file is generated.

And I've tried the following to submit my job:

CASE 1
flink-conf.yaml : yarn.properties-file.location: mydir
flink run options : none

  *   Uses zookeeper and works  - but I cannot choose the container as the 
property file is global.

CASE 2
flink-conf.yaml : nothing
flink run options : -yid applicationId

  *   Do not use zookeeper, tries to connect to yarn job manager but fails in 
"Job submission to the JobManager timed out" error

CASE 3
flink-conf.yaml : nothing
flink run options : -yid applicationId and -yD with all dynamic properties 
found in the "dynamicPropertiesString" of .yarn-properties-$USER file

  *   Same as case 2

CASE 4
flink-conf.yaml : nothing
flink run options : -yD yarn.properties-file.location=mydir

  *   Tries to connect to local (non yarn) job manager (and fails)

CASE 5
Even weirder:
flink-conf.yaml : yarn.properties-file.location: mydir
flink run options : -yD yarn.properties-file.location=mydir

  *   Still tries to connect to local (non yarn) job manager!

What am I doing wrong?

Logs extracts :
CASE 1:
2018:02:01 15:43:20 - Waiting until all TaskManagers have connected
2018:02:01 15:43:20 - Starting client actor system.
2018:02:01 15:43:20 - Starting ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:20 - Trying to select the network interface and address to use 
by connecting to the leading JobManager.
2018:02:01 15:43:20 - TaskManager will try to connect for 1 milliseconds 
before falling back to heuristics
2018:02:01 15:43:21 - Retrieved new target address 
elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970.
2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - Slf4jLogger started
2018:02:01 15:43:21 - Starting remoting
2018:02:01 15:43:21 - Remoting started; listening on addresses 
:[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:36340]
2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - Stopping ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - TaskManager status (2/1)
2018:02:01 15:43:21 - All TaskManagers are connected
2018:02:01 15:43:21 - Submitting job with JobID: 
f69197b0b80a76319a87bde10c1e3f77. Waiting for job completion.
2018:02:01 15:43:21 - Starting ZooKeeperLeaderRetrievalService.
2018:02:01 15:43:21 - Received SubmitJobAndWait(JobGraph(jobId: 
f69197b0b80a76319a87bde10c1e3f77)) but there is no connection to a JobManager 
yet.
2018:02:01 15:43:21 - Received job SND-IMP-SIGNAST 
(f69197b0b80a76319a87bde10c1e3f77).
2018:02:01 15:43:21 - Disconnect from JobManager null.
2018:02:01 15:43:21 - Connect to JobManager 
Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245].
2018:02:01 15:43:21 - Connected to JobManager at 
Actor[akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager#-1554418245]
 with leader session id 388af5b8--4923-8ee4-8a4b9bfbb0b9.
2018:02:01 15:43:21 - Sending message to JobManager 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager
 to submit job SND-IMP-SIGNAST (f69197b0b80a76319a87bde10c1e3f77) and wait for 
progress
2018:02:01 15:43:21 - Upload jar files to job manager 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.
2018:02:01 15:43:21 - Blob client connecting to 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager
2018:02:01 15:43:22 - Submit job to the job manager 
akka.tcp://fl...@elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr:33970/user/jobmanager.
2018:02:01 15:43:22 - Job f69197b0b80a76319a87bde10c1e3f77 was successfully 
submitted to the JobManager akka://flink/deadLetters.
2018:02:01 15:43:22 - 02/01/2018 15:43:22   Job execution switched to status 
RUNNING.

CASE 2:
2018:02:01 15:48:43 - Waiting until all TaskManagers have connected
2018:02:01 15:48:43 - Starting client actor system.
2018:02:01 15:48:43 - Trying to select the network interface and address to use 
by connecting to the leading JobManager.
2018:02:01 15:48:43 - TaskManager will try to connect for 1 milliseconds 
before falling back to heuristics
2018:02:01 15:48:43 - Retrieved new target address 
elara-data-u2-n01.dev.mlb.jupiter.nbyt.fr/10.136.170.193:33970.
2018:02:01 15:48:43 - Slf4jLogger started
2018:02:01 15:48:43 - Starting remoting
2018:02:01 15:48:43 - Remoting started; listening on addresses 
:[akka.tcp://fl...@elara-edge-u2-n01.dev.mlb.jupiter.nbyt.fr:34140]
2018:02:01 15:48:43 - TaskManager status (2/1)
2018:02:01 15:48:43 - All TaskManagers a

RE: How to stop FlinkKafkaConsumer and make job finished?

2018-01-02 Thread LINZ, Arnaud
Hi,

My 2 cents: not being able to programmatically nicely stop a Flink stream is 
what lacks most to the framework IMHO. It's a very common use case: each time 
you want to update the application or change its configuration you need to 
nicely stop  & restart it, without triggering alerts, data loss, or anything 
else.
That's why I never use the provided Flink Sources "out of the box". I've made a 
framework that encapsulate them, adding a monitoring thread that periodically 
check for a special "hdfs stop file" and try to nicely cancel() the source if 
the user requested a stop by this mean (I've found that the hdfs file trick is 
most easy way to reach from an external application all task managers running 
on unknown hosts).

I could not use the "special message" trick because in most real production 
environment you cannot, as a client, post a message in a queue just for your 
client's need: you don't have proper access rights to do so ; and you don't 
know how other clients, connected to the same data, may react to fake 
messages...

Unfortunately most Flink sources cannot be "cancelled" nicely without changing 
part of their code. It's the case for the Kafka source.

- If a kafa consumer source instance is not connected to any partition (because 
it's parallelism level exceeds the kafka consumer group partition number for 
instance), we end up in an infinite wait in FlinkKafkaConsumerBase.run() until 
thread is interrupted :

// wait until this is canceled
final Object waitLock = new Object();
while (running) {
try {
//noinspection 
SynchronizationOnLocalVariableOrMethodParameter
synchronized (waitLock) {
waitLock.wait();
}
}
catch (InterruptedException e) {
if (!running) {
// restore the interrupted 
state, and fall through the loop

Thread.currentThread().interrupt();
}
}
}

So either you change the code, or in your monitoring thread you interrupt the 
source thread -- but that will trigger the HA mechanism, the source instance 
will be relaunched n times before failing.

- BTW it's also the case with RMQSource, as the "nextDelivery" in 
RMQSource.run() is called without timeout :
@Override
public void run(SourceContext ctx) throws Exception {
while (running) {
QueueingConsumer.Delivery delivery = 
consumer.nextDelivery();

So if no message arrives, the while running check is not done and the source 
cannot be cancelled without hard interruption.

Best regards,
Arnaud


-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org]
Envoyé : vendredi 29 décembre 2017 10:30
À : Eron Wright 
Cc : Ufuk Celebi ; Jaxon Hu ; user 
; Aljoscha Krettek 
Objet : Re: How to stop FlinkKafkaConsumer and make job finished?

Yes, that sounds like what Jaxon is looking for. :-) Thanks for the pointer 
Eron.

– Ufuk

On Thu, Dec 28, 2017 at 8:13 PM, Eron Wright  wrote:
> I believe you can extend the `KeyedDeserializationSchema` that you
> pass to the consumer to check for end-of-stream markers.
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.4/api/java/o
> rg/apache/flink/streaming/util/serialization/KeyedDeserializationSchem
> a.html#isEndOfStream-T-
>
> Eron
>
> On Wed, Dec 27, 2017 at 12:35 AM, Ufuk Celebi  wrote:
>>
>> Hey Jaxon,
>>
>> I don't think it's possible to control this via the life-cycle
>> methods of your functions.
>>
>> Note that Flink currently does not support graceful stop in a
>> meaningful manner and you can only cancel running jobs. What comes to
>> my mind to cancel on EOF:
>>
>> 1) Extend Kafka consumer to stop emitting records after your EOF
>> record. Look at the flink-connector-kafka-base module. This is
>> probably not feasible and some work to get familiar with the code.
>> Just putting in out there.
>>
>> 2) Throw a "SuccessException" that fails the job. Easy, but not nice.
>>
>> 3) Use an Http client and cancel your job via the Http endpoint
>>
>> (https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/rest_api.html#job-cancellation).
>> Easy, but not nice, since you need quite some logic in your function
>> (e.g. ignore records after EOF record until cancellation, etc.).
>>
>> Maybe Aljoscha (cc'd) has an idea how to do this in a better way.
>>
>> – Ufuk
>>
>>
>> On Mon, Dec 25, 2017 at 8:59 AM, Jaxon Hu  wrote:
>> > I would like to stop FlinkKafkaConsumer consuming data from kafka
>> > manually.
>> > But I find it won't be close when

RE: OutOfMemory when looping on dataset filter

2016-12-09 Thread LINZ, Arnaud
Hi,
It works with a local cluster, I effectively use a yarn cluster here.

Pushing user code to the lib folder of every datanode is not convenient ; it’s 
hard to maintain & exploit.

If I cannot make the treatment serializable to put everything in a group reduce 
function, I think I’ll try materializing the day-splitted dataset on the hdfs 
and then I’ll loop on re-reading it in the job manager. It’s even probably 
faster than looping on the full select.

Arnaud

De : Stephan Ewen [mailto:se...@apache.org]
Envoyé : vendredi 9 décembre 2016 11:57
À : user@flink.apache.org
Objet : Re: OutOfMemory when looping on dataset filter

Hi Arnaud!

I assume you are using either a standalone setup, or a YARN session?

This looks to me as if classes cannot be properly garbage collected. Since each 
job (each day is executed as a separate job), loads the classes again, the 
PermGen space runs over if classes are not properly collected.

The can be many reasons why classes are not properly collected, most 
prominently some user code or libraries create threads that hold onto objects.

A quick workaround could be to simply add the relevant libraries directly to 
the "lib" folder when starting the YARN session, and not having them in the 
user code jar file. That way, they need not be reloaded for each job.

Greetings,
Stephan



On Fri, Dec 9, 2016 at 11:30 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,
Caching could have been a solution. Another one is using a “group reduce” by 
day, but for that I need to make the “applyComplexNonDistributedTreatment” 
serializable, and that’s not an easy task.

1 & 2 - The OOM in my current test occurs in the 8th iteration (7 were 
successful). In this current test, only the first day have data, in others days 
the filter() returns an empty dataset.
3 – The OOM is in a task manager, during the “select” phase.

Digging further, I see it’s a PermGen OOM occurring during deserialization, not 
a heap one.

2016-12-08 17:38:27,835 ERROR org.apache.flink.runtime.taskmanager.Task 
- Task execution failed.
java.lang.OutOfMemoryError: PermGen space
at sun.misc.Unsafe.defineClass(Native Method)
at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)
at 
sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
at 
sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396)
at java.security.AccessController.doPrivileged(Native Method)
at 
sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395)
at 
sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)
at 
sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)
at 
java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at 
java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at 
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at 
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.hive.hcatalog.common.HCatUtil.deserialize(HCatUtil.java:117)
at 
org.apache.hive.hcatalog.mapreduce.HCatSplit.readFields(HCatSplit.java

RE: OutOfMemory when looping on dataset filter

2016-12-09 Thread LINZ, Arnaud
Hi,
Caching could have been a solution. Another one is using a “group reduce” by 
day, but for that I need to make the “applyComplexNonDistributedTreatment” 
serializable, and that’s not an easy task.

1 & 2 - The OOM in my current test occurs in the 8th iteration (7 were 
successful). In this current test, only the first day have data, in others days 
the filter() returns an empty dataset.
3 – The OOM is in a task manager, during the “select” phase.

Digging further, I see it’s a PermGen OOM occurring during deserialization, not 
a heap one.

2016-12-08 17:38:27,835 ERROR org.apache.flink.runtime.taskmanager.Task 
- Task execution failed.
java.lang.OutOfMemoryError: PermGen space
at sun.misc.Unsafe.defineClass(Native Method)
at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)
at 
sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
at 
sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:396)
at java.security.AccessController.doPrivileged(Native Method)
at 
sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:395)
at 
sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:113)
at 
sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:331)
at 
java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1376)
at 
java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:72)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:493)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:468)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:468)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:365)
at 
java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:602)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at 
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.hive.hcatalog.common.HCatUtil.deserialize(HCatUtil.java:117)
at 
org.apache.hive.hcatalog.mapreduce.HCatSplit.readFields(HCatSplit.java:139)
at 
org.apache.flink.api.java.hadoop.mapreduce.wrapper.HadoopInputSplit.readObject(HadoopInputSplit.java:102)


De : Fabian Hueske [mailto:fhue...@gmail.com]
Envoyé : vendredi 9 décembre 2016 10:51
À : user@flink.apache.org
Objet : Re: OutOfMemory when looping on dataset filter

Hi Arnaud,
Flink does not cache data at the moment.
What happens is that for every day, the complete program is executed, i.e., 
also the program that computes wholeSet.
Each execution should be independent from each other and all temporary data be 
cleaned up.
Since Flink executes programs in a pipelined (or streaming) fashion, wholeSet 
is not kept in memory.
There is also no manual way to pin a DataSet in memory at the moment.

One think you could try is to push the day filter as close to the original 
source as possible.
This would reduce the size of intermediate results.
In general, Flink's DataSet API is implemented to work on managed memory. The 
most common reason for OOMs are user function that collect data on the heap.
However, this should not accumulate and be cleaned up after a job finished.
Collect can be a bit fragile here, because it moves all data to the client 
process.

I also have a few questions:
1. After how many iterations of the for loop is the OOM happening.
2. Is the data for all days of the same size?
3. Is the OOM happening in Flink or in the client process which fetches the 
result?
Best, Fabian


2016-12-09 10:35 GMT+01:00 LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr

OutOfMemory when looping on dataset filter

2016-12-09 Thread LINZ, Arnaud
Hello,

I have a non-distributed treatment to apply to a DataSet of timed events, one 
day after another in a flink batch.
My algorithm is:

// wholeSet is too big to fit in RAM with a collect(), so we cut it in pieces
DataSet wholeSet = [Select WholeSet];
for (day 1 to 31) {
List<> dayData = wholeSet.filter(day).collect();
applyComplexNonDistributedTreatment(dayData);
}

Even if each day can perfectly fit in RAM (I’ve made a test where only the 
first day have data), I quickly get a OOM in a task manager at one point in the 
loop, so I guess that the “wholeSet” si keeped several times times in Ram.

Two questions :

1)  Is there a better way of handling it where the “select wholeset” is 
made only once ?

2)  Even when the “select wholeset” is made at each iteration, how can I 
completely remove the old set so that I don’t get an OOM ?

Thanks,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Collect() freeze on yarn cluster on strange recover/deserialization error

2016-12-08 Thread LINZ, Arnaud
-yjm works, and suits me better than a global fink-conf.yml parameter. I've 
looked for a command line parameter like that, but I've missed it in the doc, 
my mistake.
Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : jeudi 8 décembre 2016 14:43
À : LINZ, Arnaud ; user@flink.apache.org
Cc : rmetz...@apache.org
Objet : RE: Collect() freeze on yarn cluster on strange recover/deserialization 
error

Good point with the collect() docs. Would you mind opening a JIRA issue for 
that?

I'm not sure whether you can specify it via that key for YARN. Can you try to 
use -yjm 8192 when submitting the job?

Looping in Robert who knows best whether this config key is picked up or not 
for YARN.

– Ufuk

On 8 December 2016 at 14:05:41, LINZ, Arnaud (al...@bouyguestelecom.fr) wrote:
> Hi Ufuk,
>  
> Yes, I have a large set of data to collect for a data science job that 
> cannot be distributed easily. Increasing the akka.framesize size do 
> get rid of the collect hang (maybe you should highlight this parameter 
> in the collect() documentation, 10Mb si not that big), thanks.
>  
> However my job manager now fails with OutOfMemory.
>  
> Despite the fact that I have setup
> jobmanager.heap.mb: 8192
>  
> in my flink-conf.yaml, logs shows that it was created with less memory (1374 
> Mb) :
>  
> 2016-12-08 13:50:13,808 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - 
> --
> --
> 2016-12-08 13:50:13,809 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - Starting YARN ApplicationMaster / JobManager (Version: 1.1.3, 
> Rev:8e8d454, Date:10.10.2016 @ 13:26:32 UTC)
> 2016-12-08 13:50:13,809 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - Current user: datcrypt
> 2016-12-08 13:50:13,809 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 
> 1.7/24.45-b08
> 2016-12-08 13:50:13,809 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - Maximum heap size: 1374 MiBytes
> 2016-12-08 13:50:13,810 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - JAVA_HOME: /usr/java/default
> 2016-12-08 13:50:13,811 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - Hadoop version: 2.6.3
> 2016-12-08 13:50:13,811 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - JVM Options:
> 2016-12-08 13:50:13,811 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - -Xmx1434M
> 2016-12-08 13:50:13,811 INFO 
> org.apache.flink.yarn.YarnApplicationMasterRunner
> - 
> -Dlog.file=/data/1/hadoop/yarn/log/application_1480512120243_3635/cont
> ainer_e17_1480512120243_3635_01_01/jobmanager.log
>  
>  
> Is there a command line option of flink / env variable that overrides 
> it or am I missing something ?
> -- Arnaud
>  
> -Message d'origine-
> De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : jeudi 8 décembre 
> 2016 10:49 À : LINZ, Arnaud ; user@flink.apache.org Objet : RE: 
> Collect() freeze on yarn cluster on strange recover/deserialization 
> error
>  
> I also don't get why the job is recovering, but the oversized message 
> is very likely the cause for the freezing collect, because the data set is 
> gather via Akka.
>  
> You can configure the frame size via "akka.framesize", which defaults 
> to 10485760b
> (10 MB).
>  
> Is the collected result larger than that? Could you try to increase 
> the frame size and report back?
>  
> – Ufuk
>  
> On 7 December 2016 at 17:57:22, LINZ, Arnaud (al...@bouyguestelecom.fr) wrote:
> > Hi,
> >
> > Any news? It's maybe caused by an oversized akka payload (many
> > akka.remote.OversizedPayloadException: Discarding oversized payload 
> > sent to 
> > Actor[akka.tcp://flink@172.21.125.20:39449/user/jobmanager#-1264474132]:
> > max allowed size 10485760 bytes, actual size of encoded class 
> > org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMe
> > ss
> > age
> > was 69074412 bytes in the log)
> >
> > How do I set akka's maximum-payload-bytes in my flink cluster?
> >
> > https://issues.apache.org/jira/browse/FLINK-2373 is not clear about 
> > that. I do not use ExecutionEnvironment.createRemoteEnvironment() but 
> > ExecutionEnvironment.getExecutionEnvironment().
> >
> > Do I have to change the way I'm doing things ? How ?
> >
> > Thanks,
> > Arnaud
> >
> > -Message d'origine-
> > De : LINZ, Arnaud
> > Envoyé : mercredi 30 novembre 2016 08:59 À : user@flink.apache.org 

RE: Collect() freeze on yarn cluster on strange recover/deserialization error

2016-12-08 Thread LINZ, Arnaud
Hi Ufuk,

Yes, I have a large set of data to collect for a data science job that cannot 
be distributed easily. Increasing the akka.framesize size do get rid of the 
collect hang (maybe you should highlight this parameter in the collect() 
documentation, 10Mb si not that big), thanks.

However my job manager now fails with OutOfMemory. 

Despite the fact that I have setup 
jobmanager.heap.mb: 8192

in my flink-conf.yaml, logs shows that it was created with less memory (1374 
Mb) :

2016-12-08 13:50:13,808 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
- 

2016-12-08 13:50:13,809 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  Starting YARN ApplicationMaster / JobManager (Version: 1.1.3, 
Rev:8e8d454, Date:10.10.2016 @ 13:26:32 UTC)
2016-12-08 13:50:13,809 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  Current user: datcrypt
2016-12-08 13:50:13,809 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 
1.7/24.45-b08
2016-12-08 13:50:13,809 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  Maximum heap size: 1374 MiBytes
2016-12-08 13:50:13,810 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  JAVA_HOME: /usr/java/default
2016-12-08 13:50:13,811 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  Hadoop version: 2.6.3
2016-12-08 13:50:13,811 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
-  JVM Options:
2016-12-08 13:50:13,811 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
- -Xmx1434M
2016-12-08 13:50:13,811 INFO  org.apache.flink.yarn.YarnApplicationMasterRunner 
- 
-Dlog.file=/data/1/hadoop/yarn/log/application_1480512120243_3635/container_e17_1480512120243_3635_01_01/jobmanager.log


Is there a command line option of flink / env variable that overrides it or am 
I missing something ?
-- Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : jeudi 8 décembre 2016 10:49
À : LINZ, Arnaud ; user@flink.apache.org
Objet : RE: Collect() freeze on yarn cluster on strange recover/deserialization 
error

I also don't get why the job is recovering, but the oversized message is very 
likely the cause for the freezing collect, because the data set is gather via 
Akka.

You can configure the frame size via "akka.framesize", which defaults to 
10485760b (10 MB).

Is the collected result larger than that? Could you try to increase the frame 
size and report back?

– Ufuk

On 7 December 2016 at 17:57:22, LINZ, Arnaud (al...@bouyguestelecom.fr) wrote:
> Hi,
>  
> Any news? It's maybe caused by an oversized akka payload (many 
> akka.remote.OversizedPayloadException: Discarding oversized payload 
> sent to 
> Actor[akka.tcp://flink@172.21.125.20:39449/user/jobmanager#-1264474132]:
> max allowed size 10485760 bytes, actual size of encoded class 
> org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMess
> age
> was 69074412 bytes in the log)
>  
> How do I set akka's maximum-payload-bytes in my flink cluster?
>  
> https://issues.apache.org/jira/browse/FLINK-2373 is not clear about 
> that. I do not use ExecutionEnvironment.createRemoteEnvironment() but 
> ExecutionEnvironment.getExecutionEnvironment().
>  
> Do I have to change the way I'm doing things ? How ?
>  
> Thanks,
> Arnaud
>  
> -Message d'origine-
> De : LINZ, Arnaud
> Envoyé : mercredi 30 novembre 2016 08:59 À : user@flink.apache.org 
> Objet : RE: Collect() freeze on yarn cluster on strange 
> recover/deserialization error
>  
> Hi,
>  
> Don't think so. I always delete the ZK path before launching the batch 
> (with /usr/bin/zookeeper-client -server $FLINK_HA_ZOOKEEPER_SERVERS 
> rmr $FLINK_HA_ZOOKEEPER_PATH_BATCH), and the "recovery" log line appears only 
> before the collect() phase, not at the beginning.
>  
> Full log is availlable here : 
> https://ftpext.bouyguestelecom.fr/?u=JDhCUdcAImsANZQdys86yID6UNq8H2r
>  
> Thanks,
> Arnaud
>  
>  
> -Message d'origine-
> De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : mardi 29 novembre 
> 2016 18:43 À : LINZ, Arnaud ; user@flink.apache.org Objet : Re: 
> Collect() freeze on yarn cluster on strange recover/deserialization 
> error
>  
> Hey Arnaud,
>  
> could this be a left over job that is recovered from ZooKeeper? 
> Recovery only happens if the configured ZK root contains data.
>  
> A job is removed from ZooKeeper only if it terminates (e.g. finishes, 
> fails terminally w/o restarting, cancelled). If you just shut down the 
> cluster this is treated as a fail

RE: Collect() freeze on yarn cluster on strange recover/deserialization error

2016-12-07 Thread LINZ, Arnaud
Hi,

Any news? It's maybe caused by an oversized akka payload 
(many akka.remote.OversizedPayloadException: Discarding oversized payload sent 
to Actor[akka.tcp://flink@172.21.125.20:39449/user/jobmanager#-1264474132]: max 
allowed size 10485760 bytes, actual size of encoded class 
org.apache.flink.runtime.messages.JobManagerMessages$LeaderSessionMessage was 
69074412 bytes in the log)

How do I set akka's maximum-payload-bytes in my flink cluster? 

https://issues.apache.org/jira/browse/FLINK-2373 is not clear about that. I do 
not use ExecutionEnvironment.createRemoteEnvironment() but 
ExecutionEnvironment.getExecutionEnvironment().

Do I have to change the way I'm doing things ? How ?

Thanks,
Arnaud

-Message d'origine-----
De : LINZ, Arnaud 
Envoyé : mercredi 30 novembre 2016 08:59
À : user@flink.apache.org
Objet : RE: Collect() freeze on yarn cluster on strange recover/deserialization 
error

Hi, 

Don't think so. I always delete the ZK path before launching the batch (with 
/usr/bin/zookeeper-client -server $FLINK_HA_ZOOKEEPER_SERVERS rmr 
$FLINK_HA_ZOOKEEPER_PATH_BATCH), and the "recovery" log line appears only 
before the collect() phase, not at the beginning.

Full log is availlable here : 
https://ftpext.bouyguestelecom.fr/?u=JDhCUdcAImsANZQdys86yID6UNq8H2r 

Thanks,
Arnaud


-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : mardi 29 novembre 2016 18:43 
À : LINZ, Arnaud ; user@flink.apache.org Objet : Re: 
Collect() freeze on yarn cluster on strange recover/deserialization error

Hey Arnaud,

could this be a left over job that is recovered from ZooKeeper? Recovery only 
happens if the configured ZK root contains data.

A job is removed from ZooKeeper only if it terminates (e.g. finishes, fails 
terminally w/o restarting, cancelled). If you just shut down the cluster this 
is treated as a failure.

– Ufuk

The complete JM logs will be helpful to further check what's happening there. 


On 29 November 2016 at 18:15:16, LINZ, Arnaud (al...@bouyguestelecom.fr) wrote:
> Hello,
>  
> I have a Flink 1.1.3 batch application that makes a simple aggregation 
> but freezes when
> collect() is called when the app is deployed on a ha-enabled yarn 
> cluster (it works on a local cluster).
> Just before it hangs, I have the following deserialization error in the logs :
>  
> (...)
> 2016-11-29 15:10:10,422 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> - DataSink (collect()) (1/4) (10cae0de2f4e7b6d71f21209072f7c96)
> switched from DEPLOYING to RUNNING
> 2016-11-29 15:10:13,175 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> - CHAIN Reduce(Reduce at agregation(YarnAnonymiser.java:114)) -> Map 
> (Key Remover)
> (2/4) (c098cf691c28364ca47d322c7a76259a) switched from RUNNING to 
> FINISHED
> 2016-11-29 15:10:17,816 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> - CHAIN Reduce(Reduce at agregation(YarnAnonymiser.java:114)) -> Map 
> (Key Remover)
> (1/4) (aa6953c3c3a7c9d06ff714e13d020e38) switched from RUNNING to 
> FINISHED
> 2016-11-29 15:10:38,060 INFO org.apache.flink.yarn.YarnJobManager - 
> Attempting to recover all jobs.
> 2016-11-29 15:10:38,167 ERROR org.apache.flink.yarn.YarnJobManager - Fatal 
> error:  
> Failed to recover jobs.
> java.io.StreamCorruptedException: invalid type code: 00 at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:199
> 0) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:199
> 0) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at java.util.HashMap.readObject(HashMap.java:1184)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017
> ) at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.jav

RE: Collect() freeze on yarn cluster on strange recover/deserialization error

2016-11-30 Thread LINZ, Arnaud
Hi, 

Don't think so. I always delete the ZK path before launching the batch (with 
/usr/bin/zookeeper-client -server $FLINK_HA_ZOOKEEPER_SERVERS rmr 
$FLINK_HA_ZOOKEEPER_PATH_BATCH), and the "recovery" log line appears only 
before the collect() phase, not at the beginning.

Full log is availlable here : 
https://ftpext.bouyguestelecom.fr/?u=JDhCUdcAImsANZQdys86yID6UNq8H2r 

Thanks,
Arnaud


-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : mardi 29 novembre 2016 18:43
À : LINZ, Arnaud ; user@flink.apache.org
Objet : Re: Collect() freeze on yarn cluster on strange recover/deserialization 
error

Hey Arnaud,

could this be a left over job that is recovered from ZooKeeper? Recovery only 
happens if the configured ZK root contains data.

A job is removed from ZooKeeper only if it terminates (e.g. finishes, fails 
terminally w/o restarting, cancelled). If you just shut down the cluster this 
is treated as a failure.

– Ufuk

The complete JM logs will be helpful to further check what's happening there. 


On 29 November 2016 at 18:15:16, LINZ, Arnaud (al...@bouyguestelecom.fr) wrote:
> Hello,
>  
> I have a Flink 1.1.3 batch application that makes a simple aggregation 
> but freezes when
> collect() is called when the app is deployed on a ha-enabled yarn 
> cluster (it works on a local cluster).
> Just before it hangs, I have the following deserialization error in the logs :
>  
> (...)
> 2016-11-29 15:10:10,422 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> - DataSink (collect()) (1/4) (10cae0de2f4e7b6d71f21209072f7c96) 
> switched from DEPLOYING to RUNNING
> 2016-11-29 15:10:13,175 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> - CHAIN Reduce(Reduce at agregation(YarnAnonymiser.java:114)) -> Map 
> (Key Remover)
> (2/4) (c098cf691c28364ca47d322c7a76259a) switched from RUNNING to 
> FINISHED
> 2016-11-29 15:10:17,816 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph
> - CHAIN Reduce(Reduce at agregation(YarnAnonymiser.java:114)) -> Map 
> (Key Remover)
> (1/4) (aa6953c3c3a7c9d06ff714e13d020e38) switched from RUNNING to 
> FINISHED
> 2016-11-29 15:10:38,060 INFO org.apache.flink.yarn.YarnJobManager - 
> Attempting to recover all jobs.
> 2016-11-29 15:10:38,167 ERROR org.apache.flink.yarn.YarnJobManager - Fatal 
> error:  
> Failed to recover jobs.
> java.io.StreamCorruptedException: invalid type code: 00 at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:199
> 0) at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:199
> 0) at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at java.util.HashMap.readObject(HashMap.java:1184)
> at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017
> ) at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:199
> 0) at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:199
> 0) at 
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:17
> 98) at 
> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> at 
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.
> getState(FileSerializableStateHandle.java:58)
> at 
> org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.
> getState(FileSe

Collect() freeze on yarn cluster on strange recover/deserialization error

2016-11-29 Thread LINZ, Arnaud
Hello,

I have a Flink 1.1.3 batch application that makes a simple aggregation but 
freezes when collect() is called when the app is deployed on a ha-enabled yarn 
cluster (it works on a local cluster).
Just before it hangs, I have the following deserialization error in the logs :

(...)
2016-11-29 15:10:10,422 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- DataSink 
(collect()) (1/4) (10cae0de2f4e7b6d71f21209072f7c96) switched from DEPLOYING to 
RUNNING
2016-11-29 15:10:13,175 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN 
Reduce(Reduce at agregation(YarnAnonymiser.java:114)) -> Map (Key Remover) 
(2/4) (c098cf691c28364ca47d322c7a76259a) switched from RUNNING to FINISHED
2016-11-29 15:10:17,816 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN 
Reduce(Reduce at agregation(YarnAnonymiser.java:114)) -> Map (Key Remover) 
(1/4) (aa6953c3c3a7c9d06ff714e13d020e38) switched from RUNNING to FINISHED
2016-11-29 15:10:38,060 INFO  org.apache.flink.yarn.YarnJobManager  
- Attempting to recover all jobs.
2016-11-29 15:10:38,167 ERROR org.apache.flink.yarn.YarnJobManager  
- Fatal error: Failed to recover jobs.
java.io.StreamCorruptedException: invalid type code: 00
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1377)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at java.util.HashMap.readObject(HashMap.java:1184)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:58)
at 
org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
at 
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraphs(ZooKeeperSubmittedJobGraphStore.java:173)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply$mcV$sp(JobManager.scala:530)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:526)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:526)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:526)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:522)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:522)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnabl

RE: How to stop job through java API

2016-10-03 Thread LINZ, Arnaud
Hi,

I have a similar issue. Here is how I deal with programmatically stopping 
permanent streaming jobs, and I’m interested in knowing if there is a better 
way now.

Currently, I use hand-made streaming sources that periodically check for some 
flag and end if a stop request was made. Stopping the source allow the on-going 
streaming jobs to finish smoothly.
That way, by using a “live file” presence check in my sources, and creating a 
“stopped ok file” when the streaming execution ends ok, I’m able to make  “a 
stopping shell script” that notify the streaming chain of a user cancellation 
by deleting the “live file” and checking for the “stopped ok file” in a waiting 
loop before ending.

This stopping shell script allow me to automate “stop/start” scripts when I 
need to perform some task (like a reference table refresh) that requires the 
streaming job to be stopped (without cancelling it from the rest interface, so 
without relying on the snapshot system to prevent the loss of intermediate 
data, and allowing me to implement clean-up code if needed).

The drawback of this method is not being able to use “off-the-shelf” sources 
directly, I always have to wrap or patch them to implement the live file check.

Best regards,
Arnaud


De : Aljoscha Krettek [mailto:aljos...@apache.org]
Envoyé : mercredi 21 septembre 2016 14:54
À : user@flink.apache.org; Max Michels 
Objet : Re: How to stop job through java API

Hi,
right now this is not possible, I'm afraid.

I'm looping in Max who has done some work in that direction. Maybe he's got 
something to say.

Cheers,
Aljoscha

On Wed, 14 Sep 2016 at 03:54 Will Du 
mailto:will...@gmail.com>> wrote:
Hi folks,
How to stop link job given job_id through java API?
Thanks,
Will



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Flink 1.1.0 : Hadoop 1/2 compatibility mismatch

2016-08-10 Thread LINZ, Arnaud
Hi,
Good for me ; my unit tests all passed with this rc version.
Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : mardi 9 août 2016 18:33
À : Ufuk Celebi 
Cc : user@flink.apache.org; d...@flink.apache.org
Objet : Re: Flink 1.1.0 : Hadoop 1/2 compatibility mismatch

I've started a vote for 1.1.1 containing hopefully fixed artifacts. If you have 
any spare time, would you mind checking whether it fixes your problem?

The artifacts are here: http://home.apache.org/~uce/flink-1.1.1-rc1/

You would have to add the following repository to your Maven project and update 
the Flink version to 1.1.1:



flink-rc
flink-rc
https://repository.apache.org/content/repositories/orgapacheflink-1101

true


false




Would really appreciate it!


On Tue, Aug 9, 2016 at 2:11 PM, Ufuk Celebi  wrote:
> As noted in the other thread, this is a problem with the Maven 
> artifacts of 1.1.0 :-( I've added a warning to the release note and 
> will start a emergency vote for 1.1.1 which only updates the Maven 
> artifacts.
>
> On Tue, Aug 9, 2016 at 9:45 AM, LINZ, Arnaud  wrote:
>> Hello,
>>
>>
>>
>> I’ve switched to 1.1.0, but part of my code doesn’t work any longer.
>>
>>
>>
>> Despite the fact that I have no Hadoop 1 jar in my dependencies 
>> (2.7.1 clients & flink-hadoop-compatibility_2.10 1.1.0), I have a 
>> weird JobContext version mismatch error, that I was unable to understand.
>>
>>
>>
>> Code is a hive table read in a local batch flink cluster using a M/R 
>> job (from good package mapreduce, not mapred).
>>
>>
>>
>> import org.apache.hadoop.mapreduce.InputFormat;
>>
>> import org.apache.hadoop.mapreduce.Job;
>>
>> import org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat;
>>
>> (…)
>>
>> final Job job = Job.getInstance();
>>
>> final InputFormat 
>> hCatInputFormat =
>> (InputFormat) HCatInputFormat.setInput(job, table.getDbName(), 
>> table.getTableName(), filter);
>>
>>
>>
>> final HadoopInputFormat 
>> inputFormat = new HadoopInputFormat>
>> DefaultHCatRecord>(hCatInputFormat, NullWritable.class, 
>> DefaultHCatRecord.class,  job);
>>
>>
>>
>>
>>
>> final HCatSchema inputSchema = 
>> HCatInputFormat.getTableSchema(job.getConfiguration());
>>
>> return cluster
>>
>> .createInput(inputFormat)
>>
>> .flatMap(new RichFlatMapFunction> DefaultHCatRecord>, T>() {
>>
>> @Override
>>
>> public void flatMap(Tuple2> DefaultHCatRecord> value,
>>
>> Collector out) throws Exception { // NOPMD
>>
>> (...)
>>
>> }
>>
>> }).returns(beanClass);
>>
>>
>>
>>
>>
>> Exception is :
>>
>> org.apache.flink.runtime.client.JobExecutionException: Failed to 
>> submit job
>> 69dba7e4d79c05d2967dca4d4a27cf38 (Flink Java Job at Tue Aug 09 
>> 09:19:41 CEST
>> 2016)
>>
>> at
>> org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runti
>> me$jobmanager$JobManager$$submitJob(JobManager.scala:1281)
>>
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage
>> $1.applyOrElse(JobManager.scala:478)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractP
>> artialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFu
>> nction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFu
>> nction.scala:25)
>>
>> at
>> org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$
>> 1.applyOrElse(LeaderSessionMessageFilter.scala:36)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractP
>> artialFunction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFu
>> nction.scala:33)
>>
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFu
>> nction.scala:25)
>>
>> at
>> org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:
>> 33)
>>
>> 

RE: Classloader issue using AvroParquetInputFormat via HadoopInputFormat

2016-08-09 Thread LINZ, Arnaud
Okay,
That would also solve my issue.
Greetings,
Arnaud

De : Stephan Ewen [mailto:se...@apache.org]
Envoyé : mardi 9 août 2016 12:41
À : user@flink.apache.org
Objet : Re: Classloader issue using AvroParquetInputFormat via HadoopInputFormat

Hi Shannon!

It seams that the something in the maven deployment went wrong with this 
release.

There should be:
  - flink-java (the default, with a transitive dependency to hadoop 2.x for 
hadoop compatibility features)
  - flink-java-hadoop1 (with a transitive dependency for hadoop 1.x fir older 
hadoop compatibility features)

Apparently the "flink-java" artifact git overwritten with the 
"flink-java-hadoop1" artifact. Damn.

I think we need to release new artifacts that fix these dependency descriptors.

That needs to be a 1.1.1 release, because maven artifacts cannot be changed 
after they were deployed.

Greetings,
Stephan






On Mon, Aug 8, 2016 at 11:08 PM, Shannon Carey 
mailto:sca...@expedia.com>> wrote:
Correction: I cannot work around the problem. If I exclude hadoop1, I get the 
following exception which appears to be due to flink-java-1.1.0's dependency on 
Hadoop1.

Failed to submit job 4b6366d101877d38ef33454acc6ca500 
(com.expedia.www.flink.jobs.DestinationCountsHistoryJob$)
org.apache.flink.runtime.client.JobExecutionException: Failed to submit job 
4b6366d101877d38ef33454acc6ca500 
(com.expedia.www.flink.jobs.DestinationCountsHistoryJob$)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1281)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:478)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:121)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
caused an error: Found interface org.apache.hadoop.mapreduce.JobContext, but 
class was expected
at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:172)
at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:695)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1178)
... 19 more
Caused by: java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.JobContext, but class was expected
at 
org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:158)
at 
org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:56)
at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:156)
... 21 more

And if I exclude hadoop2, I get the exception from my previous email with 
AvroParquetInputFormat.



From: Shannon Carey mailto:sca...@expedia.com>>
Date: Monday, August 8, 2016 at 2:46 PM
To: "user@flink.apache.org" 
mailto:user@flink.apache.org>>
Subject: Classloader issue using AvroParquetInputFormat via HadoopInputFormat

Hi folks, congrats on 1.1.0!

FYI, after updating to Flink 1.1.0 I get the exception at bottom when 
attempting to run a job that uses AvroParquetInputFormat wrapped in a Flink 
HadoopInputFormat. The ContextUtil.java:71 is trying to execute:

Class.forName("org.apache.hadoop.mapreduce.task.JobContextImpl");

I am using Scala 2.11.7. JobContextImpl is coming from 
flink-shaded-hadoop2:1.1.0. However, its parent class (JobContext) is actually 
being loaded (according to output wi

Flink 1.1.0 : Hadoop 1/2 compatibility mismatch

2016-08-09 Thread LINZ, Arnaud
Hello,

I’ve switched to 1.1.0, but part of my code doesn’t work any longer.

Despite the fact that I have no Hadoop 1 jar in my dependencies (2.7.1 clients 
& flink-hadoop-compatibility_2.10 1.1.0), I have a weird JobContext version 
mismatch error, that I was unable to understand.

Code is a hive table read in a local batch flink cluster using a M/R job (from 
good package mapreduce, not mapred).

import org.apache.hadoop.mapreduce.InputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat;
(…)
final Job job = Job.getInstance();
final InputFormat hCatInputFormat = 
(InputFormat) HCatInputFormat.setInput(job, table.getDbName(), 
table.getTableName(), filter);

final HadoopInputFormat inputFormat = 
new HadoopInputFormat(hCatInputFormat, NullWritable.class, 
DefaultHCatRecord.class,  job);


final HCatSchema inputSchema = 
HCatInputFormat.getTableSchema(job.getConfiguration());
return cluster
.createInput(inputFormat)
.flatMap(new RichFlatMapFunction, T>() {
@Override
public void flatMap(Tuple2 
value,
Collector out) throws Exception { // NOPMD
(...)
}
}).returns(beanClass);


Exception is :
org.apache.flink.runtime.client.JobExecutionException: Failed to submit job 
69dba7e4d79c05d2967dca4d4a27cf38 (Flink Java Job at Tue Aug 09 09:19:41 CEST 
2016)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1281)
at 
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:478)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:36)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at 
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at 
org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at 
scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at 
org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:121)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at 
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.JobException: Creating the input splits 
caused an error: Found interface org.apache.hadoop.mapreduce.JobContext, but 
class was expected
at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:172)
at 
org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:695)
at 
org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1178)
... 23 more
Caused by: java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.JobContext, but class was expected
at 
org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:158)
at 
org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormatBase.createInputSplits(HadoopInputFormatBase.java:56)
at 
org.apache.flink.runtime.executiongraph.ExecutionJobVertex.(ExecutionJobVertex.java:156)
... 25 more

Any idea what has go

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

2016-06-16 Thread LINZ, Arnaud
Okay, is there a way to specify the flink-conf.yaml to use on the ./bin/flink 
command-line? I see no such option. I guess I have to set FLINK_CONF_DIR before 
the call ?

-Message d'origine-
De : Maximilian Michels [mailto:m...@apache.org] 
Envoyé : mercredi 15 juin 2016 18:06
À : user@flink.apache.org
Objet : Re: Yarn batch not working with standalone yarn job manager once a 
persistent, HA job manager is launched ?

Hi Arnaud,

One issue per thread please. That makes things a lot easier for us :)

Something positive first: We are reworking the resuming of existing Flink Yarn 
applications. It'll be much easier to resume a cluster using simply the Yarn ID 
or re-discoering the Yarn session using the properties file.

The dynamic properties are a shortcut to modifying the Flink configuration of 
the cluster _only_ upon startup. Afterwards, they are already set at the 
containers. We might change this for the 1.1.0 release. It should work if you 
put "yarn.properties-file.location:
/custom/location" in your flink-conf.yaml before you execute "./bin/flink".

Cheers,
Max

On Wed, Jun 15, 2016 at 3:14 PM, LINZ, Arnaud  wrote:
> Ooopsss
> My mistake, snapshot/restore do works in a local env, I've had a weird 
> configuration issue!
>
> But I still have the property  file path issue  :)
>
> -Message d'origine-
> De : LINZ, Arnaud
> Envoyé : mercredi 15 juin 2016 14:35
> À : 'user@flink.apache.org'  Objet : RE: Yarn 
> batch not working with standalone yarn job manager once a persistent, HA job 
> manager is launched ?
>
> Hi,
>
> I haven't had the time to investigate the bad configuration file path issue 
> yet (if you have any idea why yarn.properties-file.location is ignored you 
> are welcome) , but I'm facing another HA-problem.
>
> I'm trying to make my custom streaming sources HA compliant by implementing 
> snapshotState() & restoreState().  I would like to test that mechanism in my 
> junit tests, because it can be complex, but I was unable to simulate a 
> "recover" on a local flink environment: snapshotState() is never triggered 
> and launching an exception inside the execution chain does not lead to 
> recovery but ends the execution, despite the 
> streamExecEnv.enableCheckpointing(timeout) call.
>
> Is there a way to locally test this mechanism (other than poorly simulating 
> it by explicitly calling snapshot & restore in a overridden source) ?
>
> Thanks,
> Arnaud
>
> -Message d'origine-
> De : LINZ, Arnaud
> Envoyé : lundi 6 juin 2016 17:53
> À : user@flink.apache.org
> Objet : RE: Yarn batch not working with standalone yarn job manager once a 
> persistent, HA job manager is launched ?
>
> I've deleted the '/tmp/.yarn-properties-user' file created for the persistent 
> containter, and the batches do go into their own right container. However, 
> that's not a workable workaround as I'm no longer able to submit streaming 
> apps in the persistant container that way :) So it's really a problem of 
> flink finding the right property file.
>
> I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the 
> batch command line (also configured in the JVM_ARGS var), with no change of 
> behaviour. Note that I do have a standalone yarn container created, but the 
> job is submitted in the other other one.
>
>  Thanks,
> Arnaud
>
> -Message d'origine-
> De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : lundi 6 juin 2016 16:01 À 
> : user@flink.apache.org Objet : Re: Yarn batch not working with standalone 
> yarn job manager once a persistent, HA job manager is launched ?
>
> Thanks for clarification. I think it might be related to the YARN properties 
> file, which is still being used for the batch jobs. Can you try to delete it 
> between submissions as a temporary workaround to check whether it's related?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud  wrote:
>> Hi,
>>
>> The zookeeper path is only for my persistent container, and I do use a 
>> different one for all my persistent containers.
>>
>> The -Drecovery.mode=standalone was passed inside theJVM_ARGS 
>> ("${JVM_ARGS} -Drecovery.mode=standalone  
>> -Dyarn.properties-file.location=/tmp/flink/batch")
>>
>> I've tried using -yD recovery.mode=standalone on the flink command line too, 
>> but it does not solve the pb; it stills use the pre-existing container.
>>
>> Complete line =
>> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
>> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s 
>> -yD recovery.mode=

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

2016-06-15 Thread LINZ, Arnaud
Ooopsss
My mistake, snapshot/restore do works in a local env, I've had a weird 
configuration issue!

But I still have the property  file path issue  :)

-Message d'origine-
De : LINZ, Arnaud 
Envoyé : mercredi 15 juin 2016 14:35
À : 'user@flink.apache.org' 
Objet : RE: Yarn batch not working with standalone yarn job manager once a 
persistent, HA job manager is launched ?

Hi,

I haven't had the time to investigate the bad configuration file path issue yet 
(if you have any idea why yarn.properties-file.location is ignored you are 
welcome) , but I'm facing another HA-problem.

I'm trying to make my custom streaming sources HA compliant by implementing 
snapshotState() & restoreState().  I would like to test that mechanism in my 
junit tests, because it can be complex, but I was unable to simulate a 
"recover" on a local flink environment: snapshotState() is never triggered and 
launching an exception inside the execution chain does not lead to recovery but 
ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.

Is there a way to locally test this mechanism (other than poorly simulating it 
by explicitly calling snapshot & restore in a overridden source) ?

Thanks,
Arnaud

-Message d'origine-
De : LINZ, Arnaud
Envoyé : lundi 6 juin 2016 17:53
À : user@flink.apache.org
Objet : RE: Yarn batch not working with standalone yarn job manager once a 
persistent, HA job manager is launched ?

I've deleted the '/tmp/.yarn-properties-user' file created for the persistent 
containter, and the batches do go into their own right container. However, 
that's not a workable workaround as I'm no longer able to submit streaming apps 
in the persistant container that way :) So it's really a problem of flink 
finding the right property file.

I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch 
command line (also configured in the JVM_ARGS var), with no change of 
behaviour. Note that I do have a standalone yarn container created, but the job 
is submitted in the other other one.

 Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : lundi 6 juin 2016 16:01 À : 
user@flink.apache.org Objet : Re: Yarn batch not working with standalone yarn 
job manager once a persistent, HA job manager is launched ?

Thanks for clarification. I think it might be related to the YARN properties 
file, which is still being used for the batch jobs. Can you try to delete it 
between submissions as a temporary workaround to check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud  wrote:
> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a 
> different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside theJVM_ARGS 
> ("${JVM_ARGS} -Drecovery.mode=standalone  
> -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, 
> but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s 
> -yD recovery.mode=standalone --class 
> com.bouygtel.kubera.main.segstage.MainGeoSegStage
> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c 
> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -Message d'origine-
> De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : lundi 6 juin 2016
> 14:37 À : user@flink.apache.org Objet : Re: Yarn batch not working 
> with standalone yarn job manager once a persistent, HA job manager is 
> launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root 
> path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the 
> batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud  wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a 
>> persistent flink job manager) that I use for streaming jobs ; and I 
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job 
>> manager and it seems to works ; however my batches are not worki

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

2016-06-15 Thread LINZ, Arnaud
Hi,

I haven't had the time to investigate the bad configuration file path issue yet 
(if you have any idea why yarn.properties-file.location is ignored you are 
welcome) , but I'm facing another HA-problem.

I'm trying to make my custom streaming sources HA compliant by implementing 
snapshotState() & restoreState().  I would like to test that mechanism in my 
junit tests, because it can be complex, but I was unable to simulate a 
"recover" on a local flink environment: snapshotState() is never triggered and 
launching an exception inside the execution chain does not lead to recovery but 
ends the execution, despite the streamExecEnv.enableCheckpointing(timeout) call.

Is there a way to locally test this mechanism (other than poorly simulating it 
by explicitly calling snapshot & restore in a overridden source) ?

Thanks,
Arnaud

-Message d'origine-
De : LINZ, Arnaud 
Envoyé : lundi 6 juin 2016 17:53
À : user@flink.apache.org
Objet : RE: Yarn batch not working with standalone yarn job manager once a 
persistent, HA job manager is launched ?

I've deleted the '/tmp/.yarn-properties-user' file created for the persistent 
containter, and the batches do go into their own right container. However, 
that's not a workable workaround as I'm no longer able to submit streaming apps 
in the persistant container that way :) So it's really a problem of flink 
finding the right property file.

I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch 
command line (also configured in the JVM_ARGS var), with no change of 
behaviour. Note that I do have a standalone yarn container created, but the job 
is submitted in the other other one.

 Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : lundi 6 juin 2016 16:01 À : 
user@flink.apache.org Objet : Re: Yarn batch not working with standalone yarn 
job manager once a persistent, HA job manager is launched ?

Thanks for clarification. I think it might be related to the YARN properties 
file, which is still being used for the batch jobs. Can you try to delete it 
between submissions as a temporary workaround to check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud  wrote:
> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a 
> different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside theJVM_ARGS 
> ("${JVM_ARGS} -Drecovery.mode=standalone  
> -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, 
> but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu
> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s 
> -yD recovery.mode=standalone --class 
> com.bouygtel.kubera.main.segstage.MainGeoSegStage
> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c 
> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone
> -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -Message d'origine-
> De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : lundi 6 juin 2016
> 14:37 À : user@flink.apache.org Objet : Re: Yarn batch not working 
> with standalone yarn job manager once a persistent, HA job manager is 
> launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root 
> path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the 
> batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud  wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a 
>> persistent flink job manager) that I use for streaming jobs ; and I 
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job 
>> manager and it seems to works ; however my batches are not working 
>> any longer because they now execute themselves inside the persistent 
>> container (and fail because it lacks slots) and not in a separate standalone 
>> job manager.
>>
>>
>>
>> My batch launch options:
>>
>>
>>
>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm 
>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

2016-06-06 Thread LINZ, Arnaud
I've deleted the '/tmp/.yarn-properties-user' file created for the persistent 
containter, and the batches do go into their own right container. However, 
that's not a workable workaround as I'm no longer able to submit streaming apps 
in the persistant container that way :)
So it's really a problem of flink finding the right property file.

I've added -yD yarn.properties-file.location=/tmp/flink/batch inside the batch 
command line (also configured in the JVM_ARGS var), with no change of 
behaviour. Note that I do have a standalone yarn container created, but the job 
is submitted in the other other one.

 Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : lundi 6 juin 2016 16:01
À : user@flink.apache.org
Objet : Re: Yarn batch not working with standalone yarn job manager once a 
persistent, HA job manager is launched ?

Thanks for clarification. I think it might be related to the YARN properties 
file, which is still being used for the batch jobs. Can you try to delete it 
between submissions as a temporary workaround to check whether it's related?

– Ufuk

On Mon, Jun 6, 2016 at 3:18 PM, LINZ, Arnaud  wrote:
> Hi,
>
> The zookeeper path is only for my persistent container, and I do use a 
> different one for all my persistent containers.
>
> The -Drecovery.mode=standalone was passed inside theJVM_ARGS 
> ("${JVM_ARGS} -Drecovery.mode=standalone  
> -Dyarn.properties-file.location=/tmp/flink/batch")
>
> I've tried using -yD recovery.mode=standalone on the flink command line too, 
> but it does not solve the pb; it stills use the pre-existing container.
>
> Complete line =
> /usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu 
> batch1 -ys 4 -yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s 
> -yD recovery.mode=standalone --class 
> com.bouygtel.kubera.main.segstage.MainGeoSegStage 
> /usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT
> -allinone.jar  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c 
> /usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg
>
> JVM_ARGS =
> -Drecovery.mode=standalone 
> -Dyarn.properties-file.location=/tmp/flink/batch
>
>
> Arnaud
>
>
> -Message d'origine-
> De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : lundi 6 juin 2016 
> 14:37 À : user@flink.apache.org Objet : Re: Yarn batch not working 
> with standalone yarn job manager once a persistent, HA job manager is 
> launched ?
>
> Hey Arnaud,
>
> The cause of this is probably that both jobs use the same ZooKeeper root 
> path, in which case all task managers connect to the same leading job manager.
>
> I think you forgot to the add the y in the -Drecovery.mode=standalone for the 
> batch jobs, e.g.
>
> -yDrecovery.mode=standalone
>
> Can you try this?
>
> – Ufuk
>
> On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud  wrote:
>> Hi,
>>
>>
>>
>> I use Flink 1.0.0. I have a persistent yarn container set (a 
>> persistent flink job manager) that I use for streaming jobs ; and I 
>> use the “yarn-cluster” mode to launch my batches.
>>
>>
>>
>> I’ve just switched “HA” mode on for my streaming persistent job 
>> manager and it seems to works ; however my batches are not working 
>> any longer because they now execute themselves inside the persistent 
>> container (and fail because it lacks slots) and not in a separate standalone 
>> job manager.
>>
>>
>>
>> My batch launch options:
>>
>>
>>
>> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm 
>> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD 
>> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>>
>> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone 
>> -Dyarn.properties-file.location=/tmp/flink/batch"
>>
>>
>>
>> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA 
>> $JAR_SUPP $listArgs $ACTION
>>
>>
>>
>> My persistent cluster launch option :
>>
>>
>>
>> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
>> -Drecovery.mode=zookeeper
>> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
>> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
>> -Dstate.backend=filesystem
>> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PAT
>> H
>> }/checkpoints
>> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>>
>>
>>
>> $FLINK_DIR/yarn-session.sh
>> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
>> $F

RE: Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

2016-06-06 Thread LINZ, Arnaud
Hi, 

The zookeeper path is only for my persistent container, and I do use a 
different one for all my persistent containers.

The -Drecovery.mode=standalone was passed inside theJVM_ARGS ("${JVM_ARGS} 
-Drecovery.mode=standalone  -Dyarn.properties-file.location=/tmp/flink/batch")

I've tried using -yD recovery.mode=standalone on the flink command line too, 
but it does not solve the pb; it stills use the pre-existing container.

Complete line = 
/usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 8192 -yqu batch1 -ys 4 
-yD yarn.heap-cutoff-ratio=0.3 -yD akka.ask.timeout=300s -yD 
recovery.mode=standalone --class 
com.bouygtel.kubera.main.segstage.MainGeoSegStage 
/usr/users/datcrypt/alinz/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar
  -j /usr/users/datcrypt/alinz/KBR/GOS/log -c 
/usr/users/datcrypt/alinz/KBR/GOS/cfg/KBR_GOS_Config.cfg

JVM_ARGS = 
-Drecovery.mode=standalone -Dyarn.properties-file.location=/tmp/flink/batch


Arnaud


-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : lundi 6 juin 2016 14:37
À : user@flink.apache.org
Objet : Re: Yarn batch not working with standalone yarn job manager once a 
persistent, HA job manager is launched ?

Hey Arnaud,

The cause of this is probably that both jobs use the same ZooKeeper root path, 
in which case all task managers connect to the same leading job manager.

I think you forgot to the add the y in the -Drecovery.mode=standalone for the 
batch jobs, e.g.

-yDrecovery.mode=standalone

Can you try this?

– Ufuk

On Mon, Jun 6, 2016 at 2:19 PM, LINZ, Arnaud  wrote:
> Hi,
>
>
>
> I use Flink 1.0.0. I have a persistent yarn container set (a 
> persistent flink job manager) that I use for streaming jobs ; and I 
> use the “yarn-cluster” mode to launch my batches.
>
>
>
> I’ve just switched “HA” mode on for my streaming persistent job 
> manager and it seems to works ; however my batches are not working any 
> longer because they now execute themselves inside the persistent 
> container (and fail because it lacks slots) and not in a separate standalone 
> job manager.
>
>
>
> My batch launch options:
>
>
>
> CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm 
> $FLINK_MEMORY -yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD 
> yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
>
> JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone 
> -Dyarn.properties-file.location=/tmp/flink/batch"
>
>
>
> $FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA 
> $JAR_SUPP $listArgs $ACTION
>
>
>
> My persistent cluster launch option :
>
>
>
> export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10
> -Drecovery.mode=zookeeper
> -Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS}
> -Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}
> -Dstate.backend=filesystem
> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH
> }/checkpoints 
> -Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"
>
>
>
> $FLINK_DIR/yarn-session.sh 
> -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO
> $FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm 
> $FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink
>
>
>
> I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the 
> container for now, but I lack HA.
>
>
>
> Is it a (un)known bug or am I missing a magic option?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
> 
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses 
> pièces jointes. Toute utilisation ou diffusion non autorisée est 
> interdite. Si vous n'êtes pas destinataire de ce message, merci de le 
> détruire et d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. 
> The company that sent this message cannot therefore be held liable for 
> its content nor attachments. Any unauthorized use or dissemination is 
> prohibited. If you are not the intended recipient of this message, 
> then please delete it and notify the sender.


Yarn batch not working with standalone yarn job manager once a persistent, HA job manager is launched ?

2016-06-06 Thread LINZ, Arnaud
Hi,

I use Flink 1.0.0. I have a persistent yarn container set (a persistent flink 
job manager) that I use for streaming jobs ; and I use the “yarn-cluster” mode 
to launch my batches.

I’ve just switched “HA” mode on for my streaming persistent job manager and it 
seems to works ; however my batches are not working any longer because they now 
execute themselves inside the persistent container (and fail because it lacks 
slots) and not in a separate standalone job manager.

My batch launch options:

CONTAINER_OPTIONS="-m yarn-cluster -yn $FLINK_NBCONTAINERS -ytm $FLINK_MEMORY 
-yqu $FLINK_QUEUE -ys $FLINK_NBSLOTS -yD 
yarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO -yD akka.ask.timeout=300s"
JVM_ARGS="${JVM_ARGS} -Drecovery.mode=standalone 
-Dyarn.properties-file.location=/tmp/flink/batch"

$FLINK_DIR/flink run $CONTAINER_OPTIONS --class $MAIN_CLASS_KUBERA $JAR_SUPP 
$listArgs $ACTION

My persistent cluster launch option :

export FLINK_HA_OPTIONS="-Dyarn.application-attempts=10 
-Drecovery.mode=zookeeper 
-Drecovery.zookeeper.quorum=${FLINK_HA_ZOOKEEPER_SERVERS} 
-Drecovery.zookeeper.path.root=${FLINK_HA_ZOOKEEPER_PATH}  
-Dstate.backend=filesystem 
-Dstate.backend.fs.checkpointdir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/checkpoints
 
-Drecovery.zookeeper.storageDir=hdfs:///tmp/${FLINK_HA_ZOOKEEPER_PATH}/recovery/"

$FLINK_DIR/yarn-session.sh -Dyarn.heap-cutoff-ratio=$FLINK_HEAP_CUTOFF_RATIO 
$FLINK_HA_OPTIONS -st -d -n $FLINK_NBCONTAINERS -s $FLINK_NBSLOTS -tm 
$FLINK_MEMORY -qu $FLINK_QUEUE  -nm ${GANESH_TYPE_PF}_KuberaFlink

I’ve switched back to the FLINK_HA_OPTIONS="" way of launching the container 
for now, but I lack HA.

Is it a (un)known bug or am I missing a magic option?

Best regards,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: TimeWindow not getting last elements any longer with flink 1.0 vs 0.10.1

2016-03-15 Thread LINZ, Arnaud
Hi,

All right… I find this new behavior dangerous since you’ll always miss the last 
elements of a source that does not last forever if you use processing time 
windows.
I’ve created a source wrapper that sleeps at the end of the last element so 
that unit test that use processing time work.

Cheers,
Arnaud


De : Till Rohrmann [mailto:trohrm...@apache.org]
Envoyé : lundi 14 mars 2016 15:11
À : user@flink.apache.org
Objet : Re: TimeWindow not getting last elements any longer with flink 1.0 vs 
0.10.1


Hi Arnaud,

with version 1.0 the behaviour for window triggering in case of a finite stream 
was slightly changed. If you use event time, then all unfinished windows are 
triggered in case that your stream ends. This can be motivated by the fact that 
the end of a stream is equivalent to no elements will arrive until the maximum 
time (infinity) has been reached. This knowledge, allows you to emit a 
Long.MaxValue watermark when an event time stream is finished, which will 
trigger all lingering windows.

In contrast to event time, you cannot say the same about a finished processing 
time stream. There we don’t have logical time but the actual processing time we 
use to reason about windows. When a stream finishes, then we cannot fast 
forward the processing time to a point where the windows will fire. This can 
only happen if we keep the operators alive until the wall clock tells us that 
it’s time to fire the windows. However, there is no such feature implemented 
yet in Flink.

I hope this helps you to understand the failing test cases.

Cheers,
Till
​

On Mon, Mar 14, 2016 at 1:14 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

I’ve switched my Flink version from 0.10.1 to 1.0 and I have a regression in 
some  of my unit tests.

To narrow the problem, here is what I’ve figured out:


-  I use a simple Streaming application with a source defined as 
“fromElements("Element 1", "Element 2", "Element 3")

-  I use a simple time window function with a 3 second window : 
timeWindowAll(Time.seconds(3))

-  I use an apply() function and counts the total number of elements I 
get with a global counter

With the previous version, I got all three elements because, not because they 
are  triggered under 3 seconds, but because the source ends
With the 1.0 version, I don’t get any elements, and that’s annoying because as 
the source ends the application ends even if I sleep 5 seconds after the 
execute() method.

(If I replace fromElement with fromCollection with a 1 element list and 
Time.second(3) with Time.millisecond(1), I get a random number of elements)

Is this behavior wanted ? If yes, how do I get my last elements now ?

Best regards,
Arnaud






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



TimeWindow not getting last elements any longer with flink 1.0 vs 0.10.1

2016-03-14 Thread LINZ, Arnaud
Hello,

I’ve switched my Flink version from 0.10.1 to 1.0 and I have a regression in 
some  of my unit tests.

To narrow the problem, here is what I’ve figured out:


-  I use a simple Streaming application with a source defined as 
“fromElements("Element 1", "Element 2", "Element 3")

-  I use a simple time window function with a 3 second window : 
timeWindowAll(Time.seconds(3))

-  I use an apply() function and counts the total number of elements I 
get with a global counter

With the previous version, I got all three elements because, not because they 
are  triggered under 3 seconds, but because the source ends
With the 1.0 version, I don’t get any elements, and that’s annoying because as 
the source ends the application ends even if I sleep 5 seconds after the 
execute() method.

(If I replace fromElement with fromCollection with a 1 element list and 
Time.second(3) with Time.millisecond(1), I get a random number of elements)

Is this behavior wanted ? If yes, how do I get my last elements now ?

Best regards,
Arnaud






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Quick question about enableObjectReuse()

2016-02-09 Thread LINZ, Arnaud
Hi,

I just want to be sure : when I set enableObjectReuse, I don’t need to create 
copies of objects that I get as input and return as output but which I don’t 
keep inside my user function ?
For instance, if I want to join Tuple2(A,B) with C into Tuple3(A,B,C) using a 
Join function, I can write something like :

public Tuple3 join(Tuple2 first, Object second) {
return Tuple3.of(first.f0, first.f1, second);
}
And not   return Tuple3.of(first.f0.clone(), first.f1.clone(), 
second.clone()) ?


Best regards,
Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Left join with unbalanced dataset

2016-02-02 Thread LINZ, Arnaud
Thanks,

Giving the batch an outrageous amount of memory with a 0.5 heap ratio leads to 
the success of the batch.

I've figured out which dataset is consuming the most memory, I have a big join 
that demultiplies the size of the input set before a group reduce.
I am willing to optimize my code by reducing the join output size upon junction.

The outline of the treatment is :
DataSet A = (K1, K2, V1) where (K1,K2) is the key. A is huge.
DataSet B = (K1, V2)  where there are multiple values V2 for the same K1 (say 5)

I do something like : A.join(B).on(K1).groupBy(K1,K2).reduce()
As B contains 5 lines for one key of A, A.join(B) is 5 times the size of A.

Flink does not start the reduce operation until all lines have been created 
(memory bottleneck is during the collection of all lines) ; but theorically it 
is possible.
I see no "join group" operator that could do something like 
"A.groupBy(K1,K2).join(B).on(K1).reduce()"

Is there a way to do this ?

The other way I see is to load B in memory for all nodes and use a hash map 
upon reduction to get all A.join(B) lines. B is not that small, but I think it 
will still save RAM.

Best regards,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org]
Envoyé : mardi 2 février 2016 15:27
À : user@flink.apache.org
Objet : Re: Left join with unbalanced dataset


> On 02 Feb 2016, at 15:15, LINZ, Arnaud  wrote:
>
> Hi,
>
> Running again with more RAM made the treatement go further, but Yarn still 
> killed one container for memory consumption. I will experiment various memory 
> parameters.

OK, the killing of the container probably triggered the 
RemoteTransportException.

Can you tell me how many containers you are using, how much phyiscal memory the 
machines have and how much the containers get?

You can monitor memory usage by setting

taskmanager.debug.memory.startLogThread: true

in the config. This will periodically log the memory consumption to the task 
manager logs. Can you try this and check the logs for the memory consumption?

You can also have a look at it in the web frontend under the Task Manager tab.

– Ufuk




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Left join with unbalanced dataset

2016-02-02 Thread LINZ, Arnaud
Hi,

Running again with more RAM made the treatement go further, but Yarn still 
killed one container for memory consumption. I will experiment various memory 
parameters.

How do I retrieve the log of a specific task manager post-mortem? I don't use a 
permanent Flink/Yarn container (it's killed upon batch completion).


-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org]
Envoyé : mardi 2 février 2016 14:41
This means that the task at task manager bt1shli2/172.21.125.27:49771 failed 
during the production of the intermediate data. It’s independent of the memory 
problem.

Could you please check the logs of that task manager? Sorry for the 
inconvenience! I hope that we can resolve this shortly.

– Ufuk




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Left join with unbalanced dataset

2016-02-02 Thread LINZ, Arnaud
Hi,

Unfortunalety, it still fails, but with a different error (see below).
Note that I did not rebuild & reinstall flink, I just used a 0.10-SNAPSHOT 
compiled jar submitted as a batch job using the "0.10.0" flink installation.

org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Error at remote task manager 'bt1shli2/172.21.125.27:49771'.
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.decodeMsg(PartitionRequestClientHandler.java:241)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelRead(PartitionRequestClientHandler.java:164)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:744)

Caused by: org.apache.flink.runtime.io.network.partition.ProducerFailedException
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.writeAndFlushNextMessageIfPossible(PartitionRequestQueue.java:164)
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestQueue.userEventTriggered(PartitionRequestQueue.java:96)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:279)
at 
io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:265)
at 
io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:279)
at 
io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:265)
at 
io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:279)
at 
io.netty.channel.AbstractChannelHandlerContext.fireUserEventTriggered(AbstractChannelHandlerContext.java:265)
at 
io.netty.channel.ChannelInboundHandlerAdapter.userEventTriggered(ChannelInboundHandlerAdapter.java:108)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeUserEventTriggered(AbstractChannelHandlerContext.java:279)
at 
io.netty.channel.AbstractChannelHandlerContext.access$500(AbstractChannelHandlerContext.java:32)
at 
io.netty.channel.AbstractChannelHandlerContext$6.run(AbstractChannelHandlerContext.java:270)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)


-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org]
Envoyé : mardi 2 février 2016 13:52
À : user@flink.apache.org
Objet : Re: Left join with unbalanced dataset


> On 02 Feb 2016, at 13:28, LINZ, Arnaud  wrote:
>
> Thanks,
> I’m using the official 0.10 release. I will try to use the 0.10 snapshot.
>
> FYI, setting the heap cut-off ratio to 0.5 lead to the following error :

That’s the error Stephan was referring to. Does the snapshot version fix it for 
you?

I will prepare a 0.10.2 bug fix release, which includes the fix.

– Ufuk




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni d

RE: Left join with unbalanced dataset

2016-02-02 Thread LINZ, Arnaud
apacity(PooledByteBuf.java:111)
 at 
io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
 at 
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)
 at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:228)


De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan 
Ewen
Envoyé : mardi 2 février 2016 11:30
À : user@flink.apache.org
Objet : Re: Left join with unbalanced dataset

Hi Arnaud!

Which version of Flink are you using? In 0.10.1, the Netty library version that 
we use has changed behavior, and allocates a lot of off-heap memory. Would be 
my guess that this is the cause. In 1.0-SNAPSHOT, that should be fixed, also on 
0.10-SNAPSHOT.

If that turns out to be the cause, the good news is that we started discussing 
a 0.10.2 maintenance release that should also have a fix for that.

Greetings,
Stephan


On Tue, Feb 2, 2016 at 11:12 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,

Changing for a outer join did not change the error ; nor balancing the join 
with another dataset ; nor dividing parallelism level by 2 ; nor increasing 
memory by 2.
Heap size & thread number is OK under JvisualVM.  So the problem is elsewhere.

Do Flink uses off-heap memory ? How can I monitor it ?

Thanks,
Arnaud


10:58:53,384 INFO  org.apache.flink.yarn.YarnJobManager 
 - Status of job 8b2ea62e16b82ccc2242bb5549d434a5 (KUBERA-GEO-BRUT2SEGMENT) 
changed to FAILING.
java.lang.Exception: The data preparation for task 'CHAIN GroupReduce 
(GroupReduce at process(TransfoStage2StageOnTaz.java:106)) -> Map (Map at 
writeExternalTable(HiveHCatDAO.java:206))' , caused an error: Error obtaining 
the sorted input: Thread 'SortMerger spilling thread' terminated due to an 
exception: java.io.IOException: I/O channel already closed. Could not fulfill: 
org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5<mailto:org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5>
  at 
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465)
  at 
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
  at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
  at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 
'SortMerger spilling thread' terminated due to an exception: 
java.io.IOException: I/O channel already closed. Could not fulfill: 
org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5<mailto:org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5>
  at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
  at 
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1089)
  at 
org.apache.flink.runtime.operators.GroupReduceDriver.prepare(GroupReduceDriver.java:94)
  at 
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459)
  ... 3 more
Caused by: java.io.IOException: Thread 'SortMerger spilling thread' terminated 
due to an exception: java.io.IOException: I/O channel already closed. Could not 
fulfill: 
org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5<mailto:org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5>
  at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by: com.esotericsoftware.kryo.KryoException: java.io.IOException: I/O 
channel already closed. Could not fulfill: 
org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5<mailto:org.apache.flink.runtime.io.disk.iomanager.SegmentWriteRequest@2327bac5>
  at com.esotericsoftware.kryo.io.Output.flush(Output.java:165)
  at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.serialize(KryoSerializer.java:194)
  at 
org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.copy(KryoSerializer.java:247)
  at 
org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.copy(TupleSerializerBase.java:73)
  at 
org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.copy(TupleSerializerBase.java:73)
  at 
org.apache.flink.api.java.typeutils.runtime.TupleSerializerBase.copy(TupleSerializerBase.java:73)
  at 
org.apache.flink.runtime.operators.sort.NormalizedKeySorter.writeToOutput(NormalizedKeySorter.java:499)
  at 
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$Spil

RE: Left join with unbalanced dataset

2016-02-02 Thread LINZ, Arnaud
 Reason is: [Disassociated].

10:58:54,470 INFO  org.apache.flink.yarn.YarnJobManager 
 - Container container_e11_1453202008841_2794_01_25 is completed with 
diagnostics: Container 
[pid=14331,containerID=container_e11_1453202008841_2794_01_25] is running 
beyond physical memory limits. Current usage: 8.0 GB of 8 GB physical memory 
used; 9.1 GB of 16.8 GB virtual memory used. Killing container.

Dump of the process-tree for container_e11_1453202008841_2794_01_25 :

  |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

  |- 14331 14329 14331 14331 (bash) 0 0 108646400 308 /bin/bash -c 
/usr/java/default/bin/java -Xms5376m -Xmx5376m -XX:MaxDirectMemorySize=5376m  
-Dlog.file=/data/3/hadoop/yarn/log/application_1453202008841_2794/container_e11_1453202008841_2794_01_25/taskmanager.log
 -Dlogback.configurationFile=file:logback.xml 
-Dlog4j.configuration=file:log4j.properties 
org.apache.flink.yarn.YarnTaskManagerRunner --configDir . 1> 
/data/3/hadoop/yarn/log/application_1453202008841_2794/container_e11_1453202008841_2794_01_25/taskmanager.out
 2> 
/data/3/hadoop/yarn/log/application_1453202008841_2794/container_e11_1453202008841_2794_01_25/taskmanager.err
 --streamingMode batch

  |- 14348 14331 14331 14331 (java) 565583 11395 9636184064 2108473 
/usr/java/default/bin/java -Xms5376m -Xmx5376m -XX:MaxDirectMemorySize=5376m 
-Dlog.file=/data/3/hadoop/yarn/log/application_1453202008841_2794/container_e11_1453202008841_2794_01_25/taskmanager.log
 -Dlogback.configurationFile=file:logback.xml 
-Dlog4j.configuration=file:log4j.properties 
org.apache.flink.yarn.YarnTaskManagerRunner --configDir . --streamingMode batch



Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143



10:58:54,471 INFO  org.apache.flink.yarn.YarnJobManager



De : LINZ, Arnaud
Envoyé : lundi 1 février 2016 09:40
À : user@flink.apache.org
Objet : RE: Left join with unbalanced dataset

Hi,
Thanks, I can’t believe I missed the outer join operators… Will try them and 
will keep you informed.
I use the “official” 0.10 release from the maven repo. The off-heap memory I 
use is the one HDFS I/O uses (codec, DFSOutputstream threads…), but I don’t 
have many open files at once, and doubling the amount of memory did not solve 
the problem.
Arnaud


De : ewenstep...@gmail.com<mailto:ewenstep...@gmail.com> 
[mailto:ewenstep...@gmail.com] De la part de Stephan Ewen
Envoyé : dimanche 31 janvier 2016 20:57
À : user@flink.apache.org<mailto:user@flink.apache.org>
Objet : Re: Left join with unbalanced dataset

Hi!

YARN killing the application seems strange. The memory use that YARN sees 
should not change even when one node gets a lot or data.

Can you share what version of Flink (plus commit hash) you are using and 
whether you use off-heap memory or not?

Thanks,
Stephan


On Sun, Jan 31, 2016 at 10:47 AM, Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:
Hi Arnaud,

the unmatched elements of A will only end up on the same worker node if they 
all share the same key. Otherwise, they will be evenly spread out across your 
cluster. However, I would also recommend you to use Flink's leftOuterJoin.

Cheers,
Till

On Sun, Jan 31, 2016 at 5:27 AM, Chiwan Park 
mailto:chiwanp...@apache.org>> wrote:
Hi Arnaud,

To join two datasets, the community recommends using join operation rather than 
cogroup operation. For left join, you can use leftOuterJoin method. Flink’s 
optimizer decides distributed join execution strategy using some statistics of 
the datasets such as size of the dataset. Additionally, you can set join hint 
to help optimizer decide the strategy.

In transformations section [1] of Flink documentation, you can find about outer 
join operation in detail.

I hope this helps.

[1]: 
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations

Regards,
Chiwan Park

> On Jan 30, 2016, at 6:43 PM, LINZ, Arnaud 
> mailto:al...@bouyguestelecom.fr>> wrote:
>
> Hello,
>
> I have a very big dataset A to left join with a dataset B that is half its 
> size. That is to say, half of A records will be matched with one record of B, 
> and the other half with null values.
>
> I used a CoGroup for that, but my batch fails because yarn kills the 
> container due to memory problems.
>
> I guess that’s because one worker will get half of A dataset (the unmatched 
> ones), and that’s too much for a single JVM
>
> Am I right in my diagnostic ? Is there a better way to left join unbalanced 
> datasets ?
>
> Best regards,
>
> Arnaud
>
>
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
> jointes. Toute u

RE: Left join with unbalanced dataset

2016-02-01 Thread LINZ, Arnaud
Hi,
Thanks, I can’t believe I missed the outer join operators… Will try them and 
will keep you informed.
I use the “official” 0.10 release from the maven repo. The off-heap memory I 
use is the one HDFS I/O uses (codec, DFSOutputstream threads…), but I don’t 
have many open files at once, and doubling the amount of memory did not solve 
the problem.
Arnaud


De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan 
Ewen
Envoyé : dimanche 31 janvier 2016 20:57
À : user@flink.apache.org
Objet : Re: Left join with unbalanced dataset

Hi!

YARN killing the application seems strange. The memory use that YARN sees 
should not change even when one node gets a lot or data.

Can you share what version of Flink (plus commit hash) you are using and 
whether you use off-heap memory or not?

Thanks,
Stephan


On Sun, Jan 31, 2016 at 10:47 AM, Till Rohrmann 
mailto:trohrm...@apache.org>> wrote:
Hi Arnaud,

the unmatched elements of A will only end up on the same worker node if they 
all share the same key. Otherwise, they will be evenly spread out across your 
cluster. However, I would also recommend you to use Flink's leftOuterJoin.

Cheers,
Till

On Sun, Jan 31, 2016 at 5:27 AM, Chiwan Park 
mailto:chiwanp...@apache.org>> wrote:
Hi Arnaud,

To join two datasets, the community recommends using join operation rather than 
cogroup operation. For left join, you can use leftOuterJoin method. Flink’s 
optimizer decides distributed join execution strategy using some statistics of 
the datasets such as size of the dataset. Additionally, you can set join hint 
to help optimizer decide the strategy.

In transformations section [1] of Flink documentation, you can find about outer 
join operation in detail.

I hope this helps.

[1]: 
https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations

Regards,
Chiwan Park

> On Jan 30, 2016, at 6:43 PM, LINZ, Arnaud 
> mailto:al...@bouyguestelecom.fr>> wrote:
>
> Hello,
>
> I have a very big dataset A to left join with a dataset B that is half its 
> size. That is to say, half of A records will be matched with one record of B, 
> and the other half with null values.
>
> I used a CoGroup for that, but my batch fails because yarn kills the 
> container due to memory problems.
>
> I guess that’s because one worker will get half of A dataset (the unmatched 
> ones), and that’s too much for a single JVM
>
> Am I right in my diagnostic ? Is there a better way to left join unbalanced 
> datasets ?
>
> Best regards,
>
> Arnaud
>
>
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
> n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
> l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The 
> company that sent this message cannot therefore be held liable for its 
> content nor attachments. Any unauthorized use or dissemination is prohibited. 
> If you are not the intended recipient of this message, then please delete it 
> and notify the sender.




Left join with unbalanced dataset

2016-01-30 Thread LINZ, Arnaud
Hello,

I have a very big dataset A to left join with a dataset B that is half its 
size. That is to say, half of A records will be matched with one record of B, 
and the other half with null values.

I used a CoGroup for that, but my batch fails because yarn kills the container 
due to memory problems.

I guess that’s because one worker will get half of A dataset (the unmatched 
ones), and that’s too much for a single JVM

Am I right in my diagnostic ? Is there a better way to left join unbalanced 
datasets ?

Best regards,

Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-12-14 Thread LINZ, Arnaud
Hi,
I’ve just run into another exception, a java.lang.IndexOutOfBoundsException  in 
the zlib library this time.
Therefore I suspect a problem in the hadoop’s codec pool usage. I’m 
investigating, and will keep you informed.

Thanks,
Arnaud


De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan 
Ewen
Envoyé : lundi 14 décembre 2015 10:54
À : user@flink.apache.org
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a 
memory leak ?

Hi!

That is curious. Can you tell us a bit more about your setup?

  - Did you set Flink to use off-heap memory in the config?
  - What parallelism do you run the job with?
  - What Java and Flink versions are you using?

Even better, can you paste the first part of the TaskManager's log (where it 
prints the environment) here?

Thanks,
Stephan


On Mon, Dec 14, 2015 at 9:57 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

I did have an off-heap memory leak in my streaming application, due to :
https://issues.apache.org/jira/browse/HADOOP-12007.

Now that I use the CodecPool to close that leak, I get under load the following 
error :

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
java.lang.OutOfMemoryError: Direct buffer memory
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:153)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at 
io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:737)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:310)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:744)
Caused by: io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: 
Direct buffer memory
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:234)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
... 9 more
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at 
io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:651)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:358)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:111)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at 
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:228)
... 10 more


But the JVM Heap is ok (monitored by JVisualVM) and the memory size of the JVM 
process is half what it was with the memory leak when Yarn killed the container.

Note that I have added a “PartitionBy” in my stream process before the sink and 
my app is no longer a simple “mapper style” app.

Do you known the cause of the error 

RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-12-14 Thread LINZ, Arnaud
Hello,

I did have an off-heap memory leak in my streaming application, due to :
https://issues.apache.org/jira/browse/HADOOP-12007.

Now that I use the CodecPool to close that leak, I get under load the following 
error :

org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: 
java.lang.OutOfMemoryError: Direct buffer memory
at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:153)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at 
io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:224)
at 
io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:246)
at 
io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:737)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:310)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at java.lang.Thread.run(Thread.java:744)
Caused by: io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError: 
Direct buffer memory
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:234)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
... 9 more
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at 
io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:651)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:358)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:111)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at 
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:92)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:228)
... 10 more


But the JVM Heap is ok (monitored by JVisualVM) and the memory size of the JVM 
process is half what it was with the memory leak when Yarn killed the container.

Note that I have added a “PartitionBy” in my stream process before the sink and 
my app is no longer a simple “mapper style” app.

Do you known the cause of the error and how to correct it ?

Best regards,

Arnaud



De : LINZ, Arnaud
Envoyé : vendredi 13 novembre 2015 15:49
À : 'user@flink.apache.org' 
Objet : RE: Crash in a simple "mapper style" streaming app likely due to a 
memory leak ?

Hi Robert,

Thanks, it works with 50% -- at least way past the previous crash point.

In my opinion (I lack real metrics), the part that uses the most memory is the 
M2 mapper, instantiated once per slot.
The most complex part is the Sink (it does use a lot of hdfs files, flushing 
threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only 
once per slot ? I’m really surprised by that memory usage, I will try using a 
monitoring app on the yarn jvm to understand.

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application 
? I don’t want to modify the “root-protected” flink-conf.yaml for all the users 
& flink jobs with that value.

Regards,
Arn

RE: HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
Hi,
I’ve tried to put that parameter in the JVM_ARGS, but not with much success.

# JVM_ARGS :  -DCluster.Parallelisme=150  -Drecovery.mode=standalone 
-Dyarn.properties-file.location=/tmp/flink/batch
(…)
2015:12:03 15:25:42 (ThrdExtrn) - INFO - (...)jobs.exec.ExecutionProcess$1.run 
- > Found YARN properties file /tmp/.yarn-properties-voyager

Arnaud


De : Robert Metzger [mailto:rmetz...@apache.org]
Envoyé : jeudi 3 décembre 2015 14:03
À : user@flink.apache.org
Objet : Re: HA Mode and standalone containers compatibility ?

There is a configuration parameter called "yarn.properties-file.location" which 
allows setting a custom path for the properties file.
If the batch and streaming jobs are using different configuration files, it 
should work.

On Thu, Dec 3, 2015 at 1:51 PM, Ufuk Celebi 
mailto:u...@apache.org>> wrote:
I opened an issue for it and it will fixed with the next 0.10.2 release.

@Robert: are you aware of another workaround for the time being?

On Thu, Dec 3, 2015 at 1:20 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,
It works fine with that file renamed.  Is there a way to specify its path for a 
specific execution to have a proper workaround ?
Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org<mailto:u...@apache.org>]
Envoyé : jeudi 3 décembre 2015 11:53
À : user@flink.apache.org<mailto:user@flink.apache.org>
Objet : Re: HA Mode and standalone containers compatibility ?

> On 03 Dec 2015, at 11:47, LINZ, Arnaud 
> mailto:al...@bouyguestelecom.fr>> wrote:
>
> Oopss... False joy.

OK, I think this is a bug in the YARN Client and the way it uses the 
.properties files to submit jobs.

As a work around: Can you mv the /tmp/.yarn-properties-voyager file and submit 
the batch job?

mv /tmp/.yarn-properties-voyager /tmp/.bak.yarn-properties-voyager

– Ufuk



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.




RE: HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
Hi,
It works fine with that file renamed.  Is there a way to specify its path for a 
specific execution to have a proper workaround ?
Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org]
Envoyé : jeudi 3 décembre 2015 11:53
À : user@flink.apache.org
Objet : Re: HA Mode and standalone containers compatibility ?


> On 03 Dec 2015, at 11:47, LINZ, Arnaud  wrote:
>
> Oopss... False joy.

OK, I think this is a bug in the YARN Client and the way it uses the 
.properties files to submit jobs.

As a work around: Can you mv the /tmp/.yarn-properties-voyager file and submit 
the batch job?

mv /tmp/.yarn-properties-voyager /tmp/.bak.yarn-properties-voyager

– Ufuk




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
..
11:39:24,206 INFO  org.apache.flink.yarn.ApplicationClient  
 - Received address of new leader 
akka.tcp://flink@172.21.125.16:59907/user/jobmanager with session ID null.
11:39:24,206 INFO  org.apache.flink.yarn.ApplicationClient  
 - Disconnect from JobManager null.
11:39:24,210 INFO  org.apache.flink.yarn.ApplicationClient  
 - Trying to register at JobManager 
akka.tcp://flink@172.21.125.16:59907/user/jobmanager.
11:39:24,377 INFO  org.apache.flink.yarn.ApplicationClient  
 - Successfully registered at the JobManager 
Actor[akka.tcp://flink@172.21.125.16:59907/user/jobmanager#-801507205]
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (0/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (12/48)
TaskManager status (46/48)
TaskManager status (46/48)
TaskManager status (46/48)
TaskManager status (46/48)
All TaskManagers are connected
Using the parallelism provided by the remote cluster (192). To use another 
parallelism, set it at the ./bin/flink client.
12/03/2015 11:39:55  Job execution switched to status RUNNING.
12/03/2015 11:39:55  CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:508) 
(com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at 
readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150) switched to 
SCHEDULED 
12/03/2015 11:39:55  CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:508) 
(com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at 
readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1)(1/150) switched to 
DEPLOYING
=> The job starts

Then it crashes :

org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not 
enough free slots available to run the job. You can decrease the operator 
parallelism or increase the number of slots per TaskManager in the 
configuration. Task to schedule: < Attempt #0 (CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:508) 
(com.bouygtel.kuberasdk.hive.HiveHCatDAO$1)) -> FlatMap (FlatMap at 
readTable(HiveHCatDAO.java:120)) -> Map (Key Extractor 1) (5/150)) @ 
(unassigned) - [SCHEDULED] > with groupID < 7b9e554a93d3ea946d13d239a99bb6ae > 
in sharing group < SlotSharingGroup [0c9285747d113d8dd85962602b674497, 
9f30db9a30430385e1cd9d0f5010ed9e, 36b825566212059be3f888e3bbdf0d96, 
f95ba68c3916346efe497b937393eb49, e73522cce11e699022c285180fd1024d, 
988b776310ef3d8a2a3875227008a30e, 7b9e554a93d3ea946d13d239a99bb6ae, 
08af3a01b9cb49b76e6aedcd57d57788, 3f91660c6ab25f0f77d8e55d54397b01] >. 
Resources available to scheduler: Number of instances=6, total number of 
slots=24, available slots=0

Stating that I have only 24 slots on my 48 container cluster !




-Message d'origine-
De : LINZ, Arnaud 
Envoyé : jeudi 3 décembre 2015 11:26
À : user@flink.apache.org
Objet : RE: HA Mode and standalone containers compatibility ?

Hi,

The batch job does not need to be HA.
I stopped everything, cleaned the temp files, added -Drecovery.mode=standalone 
and it seems to work now !
Strange, but good for me for now.

Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] Envoyé : jeudi 3 décembre 2015 11:11 
À : user@flink.apache.org Objet : Re: HA Mode and standalone containers 
compatibility ?

Hey Arnaud,

thanks for reporting this. I think Till’s suggestion will help to debug this 
(checking whether a second YARN application has been started)…

You don’t want to run the batch application in HA mode, correct?

I sounds like the batch job is submitted with the same config keys. Could you 
start the batch job explicitly with -Drecovery.mode=standalone?

If you do want the batch job to be HA as well, you have to configure separate 
Zookeeper root paths:

recovery.zookeeper.path.root: /flink-streaming-1 # for the streaming session

recovery.zookeeper.path.root: /flink-batch # for the batch session

– Ufuk

> On 03 Dec 2015, at 11:01, LINZ, Arnaud  wrote:
> 
> Yes, it does interfere, I do have additional task managers. My batch 
> application comes in my streaming cluster Flink’s GUI instead of creating its 
> own container with its own GUI despite the –m yarn-cluster option.
>  
> De : Till Rohrmann [mailto:trohrm...@apache.org] Envoyé : jeudi 3 
> décembre 2015 10:36 À : user@flink.apache.org Objet : Re: HA Mode and 
> standalone containers compatibility ?
>  
> Hi Arnaud,
>  
> as long as you don't have HA activated for your batch jobs, HA shouldn't have 
> an influence on the 

RE: HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
Hi,

The batch job does not need to be HA.
I stopped everything, cleaned the temp files, added -Drecovery.mode=standalone 
and it seems to work now !
Strange, but good for me for now.

Thanks,
Arnaud

-Message d'origine-
De : Ufuk Celebi [mailto:u...@apache.org] 
Envoyé : jeudi 3 décembre 2015 11:11
À : user@flink.apache.org
Objet : Re: HA Mode and standalone containers compatibility ?

Hey Arnaud,

thanks for reporting this. I think Till’s suggestion will help to debug this 
(checking whether a second YARN application has been started)…

You don’t want to run the batch application in HA mode, correct?

I sounds like the batch job is submitted with the same config keys. Could you 
start the batch job explicitly with -Drecovery.mode=standalone?

If you do want the batch job to be HA as well, you have to configure separate 
Zookeeper root paths:

recovery.zookeeper.path.root: /flink-streaming-1 # for the streaming session

recovery.zookeeper.path.root: /flink-batch # for the batch session

– Ufuk

> On 03 Dec 2015, at 11:01, LINZ, Arnaud  wrote:
> 
> Yes, it does interfere, I do have additional task managers. My batch 
> application comes in my streaming cluster Flink’s GUI instead of creating its 
> own container with its own GUI despite the –m yarn-cluster option.
>  
> De : Till Rohrmann [mailto:trohrm...@apache.org] Envoyé : jeudi 3 
> décembre 2015 10:36 À : user@flink.apache.org Objet : Re: HA Mode and 
> standalone containers compatibility ?
>  
> Hi Arnaud,
>  
> as long as you don't have HA activated for your batch jobs, HA shouldn't have 
> an influence on the batch execution. If it interferes, then you should see 
> additional task manager connected to the streaming cluster when you execute 
> the batch job. Could you check that? Furthermore, could you check that 
> actually a second yarn application is started when you run the batch jobs?
>  
> Cheers,
> Till
>  
> On Thu, Dec 3, 2015 at 9:57 AM, LINZ, Arnaud  wrote:
> Hello,
> 
>  
> 
> I have both streaming applications & batch applications. Since the memory 
> needs are not the same, I was using a long-living container for my streaming 
> apps and new short-lived containers for hosting each batch execution.
> 
>  
> 
> For that, I submit streaming jobs with "flink run"  and batch jobs with 
> "flink run -m yarn-cluster"
> 
>  
> 
> This was working fine until I turned zookeeper HA mode on for my streaming 
> applications.
> 
> Even if I don't set it up in the yaml flink configuration file, but with -D 
> options on the yarn_session.sh command line, now my batch jobs try to run in 
> the streaming container, and fails because of the lack of ressources.
> 
>  
> 
> My HA options are :
> 
> -Dyarn.application-attempts=10 -Drecovery.mode=zookeeper 
> -Drecovery.zookeeper.quorum=h1r1en01:2181 
> -Drecovery.zookeeper.path.root=/flink  -Dstate.backend=filesystem 
> -Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints 
> -Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/
> 
>  
> 
> Am I missing something ?
> 
>  
> 
> Best regards,
> 
> Aranud
> 
>  
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
> n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
> l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. The 
> company that sent this message cannot therefore be held liable for its 
> content nor attachments. Any unauthorized use or dissemination is prohibited. 
> If you are not the intended recipient of this message, then please delete it 
> and notify the sender.



RE: HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
More details :

Command =
/usr/lib/flink/bin/flink run -m yarn-cluster -yn 48 -ytm 5120 -yqu batch1 -ys 4 
--class com.bouygtel.kubera.main.segstage.MainGeoSegStage 
/home/voyager/KBR/GOS/lib/KUBERA-GEO-SOURCE-0.0.1-SNAPSHOT-allinone.jar  -j 
/home/voyager/KBR/GOS/log -c /home/voyager/KBR/GOS/cfg/KBR_GOS_Config.cfg


Start of trace is :
Found YARN properties file /tmp/.yarn-properties-voyager
YARN properties set default parallelism to 24
Using JobManager address from YARN properties 
bt1shli3.bpa.bouyguestelecom.fr/172.21.125.28:36700
YARN cluster mode detected. Switching Log4j output to console


Content of /tmp/.yarn-properties-voyager
Is related to the streaming cluster :

#Generated YARN properties file
#Thu Dec 03 11:03:06 CET 2015
parallelism=24
dynamicPropertiesString=yarn.heap-cutoff-ratio\=0.6@@yarn.application-attempts\=10@@recovery.mode\=zookeeper@@recovery.zookeeper.quorum\=h1r1en01\:2181@@recovery.zookeeper.path.root\=/flink@@state.backend\=filesystem@@state.backend.fs.checkpointdir\=hdfs\:///tmp/flink/checkpoints@@recovery.zookeeper.storageDir\=hdfs\:///tmp/flink/recovery/
jobManager=172.21.125.28\:36700




De : LINZ, Arnaud
Envoyé : jeudi 3 décembre 2015 11:01
À : user@flink.apache.org
Objet : RE: HA Mode and standalone containers compatibility ?

Yes, it does interfere, I do have additional task managers. My batch 
application comes in my streaming cluster Flink’s GUI instead of creating its 
own container with its own GUI despite the –m yarn-cluster option.

De : Till Rohrmann [mailto:trohrm...@apache.org]
Envoyé : jeudi 3 décembre 2015 10:36
À : user@flink.apache.org<mailto:user@flink.apache.org>
Objet : Re: HA Mode and standalone containers compatibility ?

Hi Arnaud,

as long as you don't have HA activated for your batch jobs, HA shouldn't have 
an influence on the batch execution. If it interferes, then you should see 
additional task manager connected to the streaming cluster when you execute the 
batch job. Could you check that? Furthermore, could you check that actually a 
second yarn application is started when you run the batch jobs?

Cheers,
Till

On Thu, Dec 3, 2015 at 9:57 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:

Hello,



I have both streaming applications & batch applications. Since the memory needs 
are not the same, I was using a long-living container for my streaming apps and 
new short-lived containers for hosting each batch execution.



For that, I submit streaming jobs with "flink run"  and batch jobs with "flink 
run -m yarn-cluster"



This was working fine until I turned zookeeper HA mode on for my streaming 
applications.

Even if I don't set it up in the yaml flink configuration file, but with -D 
options on the yarn_session.sh command line, now my batch jobs try to run in 
the streaming container, and fails because of the lack of ressources.



My HA options are :

-Dyarn.application-attempts=10 -Drecovery.mode=zookeeper 
-Drecovery.zookeeper.quorum=h1r1en01:2181 -Drecovery.zookeeper.path.root=/flink 
 -Dstate.backend=filesystem 
-Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints 
-Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/



Am I missing something ?



Best regards,

Aranud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



RE: HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
Yes, it does interfere, I do have additional task managers. My batch 
application comes in my streaming cluster Flink’s GUI instead of creating its 
own container with its own GUI despite the –m yarn-cluster option.

De : Till Rohrmann [mailto:trohrm...@apache.org]
Envoyé : jeudi 3 décembre 2015 10:36
À : user@flink.apache.org
Objet : Re: HA Mode and standalone containers compatibility ?

Hi Arnaud,

as long as you don't have HA activated for your batch jobs, HA shouldn't have 
an influence on the batch execution. If it interferes, then you should see 
additional task manager connected to the streaming cluster when you execute the 
batch job. Could you check that? Furthermore, could you check that actually a 
second yarn application is started when you run the batch jobs?

Cheers,
Till

On Thu, Dec 3, 2015 at 9:57 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:

Hello,



I have both streaming applications & batch applications. Since the memory needs 
are not the same, I was using a long-living container for my streaming apps and 
new short-lived containers for hosting each batch execution.



For that, I submit streaming jobs with "flink run"  and batch jobs with "flink 
run -m yarn-cluster"



This was working fine until I turned zookeeper HA mode on for my streaming 
applications.

Even if I don't set it up in the yaml flink configuration file, but with -D 
options on the yarn_session.sh command line, now my batch jobs try to run in 
the streaming container, and fails because of the lack of ressources.



My HA options are :

-Dyarn.application-attempts=10 -Drecovery.mode=zookeeper 
-Drecovery.zookeeper.quorum=h1r1en01:2181 -Drecovery.zookeeper.path.root=/flink 
 -Dstate.backend=filesystem 
-Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints 
-Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/



Am I missing something ?



Best regards,

Aranud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



HA Mode and standalone containers compatibility ?

2015-12-03 Thread LINZ, Arnaud
Hello,



I have both streaming applications & batch applications. Since the memory needs 
are not the same, I was using a long-living container for my streaming apps and 
new short-lived containers for hosting each batch execution.



For that, I submit streaming jobs with "flink run"  and batch jobs with "flink 
run -m yarn-cluster"



This was working fine until I turned zookeeper HA mode on for my streaming 
applications.

Even if I don't set it up in the yaml flink configuration file, but with -D 
options on the yarn_session.sh command line, now my batch jobs try to run in 
the streaming container, and fails because of the lack of ressources.



My HA options are :

-Dyarn.application-attempts=10 -Drecovery.mode=zookeeper 
-Drecovery.zookeeper.quorum=h1r1en01:2181 -Drecovery.zookeeper.path.root=/flink 
 -Dstate.backend=filesystem 
-Dstate.backend.fs.checkpointdir=hdfs:///tmp/flink/checkpoints 
-Drecovery.zookeeper.storageDir=hdfs:///tmp/flink/recovery/



Am I missing something ?



Best regards,

Aranud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Way to get accumulators values *during* job execution ?

2015-12-02 Thread LINZ, Arnaud
Hello,

I use Grafana/Graphite to monitor my applications. The Flink GUI is really 
nice, but it disappears after the job completes and consequently is not 
suitable to long-term monitoring.

For batch applications, I simply send the accumulator’s values at the end of 
the job to my Graphite base.
For streaming applications, it’s more complex as the job never ends. It would 
be nice to have a way of getting current accumulator values (like in the GUI)  
to push it periodically to Graphite in a monitoring thread. Is there any API to 
get the values during execution ?

Best regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Flink Streaming Core 0.10 in maven repos

2015-11-23 Thread LINZ, Arnaud
Hello,

Small question: I can't find the Streaming Core component in 0.10 version in 
the maven repo :
http://mvnrepository.com/artifact/org.apache.flink/flink-streaming-core

Thus in my pom file this artifact is the only part of my Flink's dependencies 
to stay in 0.10-SNAPSHOT version.
Is there something wrong with that component's publication in 0.10 version ?

Greetings,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Crash in a simple "mapper style" streaming app likely due to a memory leak ?

2015-11-13 Thread LINZ, Arnaud
Hi Robert,

Thanks, it works with 50% -- at least way past the previous crash point.

In my opinion (I lack real metrics), the part that uses the most memory is the 
M2 mapper, instantiated once per slot.
The most complex part is the Sink (it does use a lot of hdfs files, flushing 
threads etc.) ; but I expect the “RichSinkFunction” to be instantiated only 
once per slot ? I’m really surprised by that memory usage, I will try using a 
monitoring app on the yarn jvm to understand.

How do I set this yarn.heap-cutoff-ratio  parameter for a specific application 
? I don’t want to modify the “root-protected” flink-conf.yaml for all the users 
& flink jobs with that value.

Regards,
Arnaud

De : Robert Metzger [mailto:rmetz...@apache.org]
Envoyé : vendredi 13 novembre 2015 15:16
À : user@flink.apache.org
Objet : Re: Crash in a simple "mapper style" streaming app likely due to a 
memory leak ?

Hi Arnaud,

can you try running the job again with the configuration value of 
"yarn.heap-cutoff-ratio" set to 0.5.
As you can see, the container has been killed because it used more than 12 GB : 
"12.1 GB of 12 GB physical memory used;"
You can also see from the logs, that we limit the JVM Heap space to 9.2GB: 
"java -Xms9216m -Xmx9216m"

In an ideal world, we would tell the JVM to limit its memory usage to 12 GB, 
but sadly, the heap space is not the only memory the JVM is allocating. Its 
allocating direct memory, and other stuff outside. Therefore, we use only 75% 
of the container memory to the heap.
In your case, I assume that each JVM is having multiple HDFS clients, a lot of 
local threads etc that's why the memory might not suffice.
With a cutoff ratio of 0.5, we'll only use 6 GB for the heap.

That value might be a bit too high .. but I want to make sure that we first 
identify the issue.
If the job is running with 50% cutoff, you can try to reduce it again towards 
25% (that's the default value, unlike the documentation says).

I hope that helps.

Regards,
Robert


On Fri, Nov 13, 2015 at 2:58 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

I use the brand new 0.10 version and I have problems running a streaming 
execution. My topology is linear : a custom source SC scans a directory and 
emits hdfs file names ; a first mapper M1 opens the file and emits its lines ; 
a filter F filters lines ; another mapper M2 transforms them ; and a 
mapper/sink M3->SK stores them in HDFS.

SC->M1->F->M2->M3->SK

The M2 transformer uses a bit of RAM because when it opens it loads a 11M row 
static table inside a hash map to enrich the lines. I use 55 slots on Yarn, 
using 11 containers of 12Gb x 5 slots

To my understanding, I should not have any memory problem since each record is 
independent : no join, no key, no aggregation, no window => it’s a simple flow 
mapper, with RAM simply used as a buffer. However, if I submit enough input 
data, I systematically crash my app with “Connection unexpectedly closed by 
remote task manager” exception, and the first error in YARN log shows that “a 
container is running beyond physical memory limits”.

If I increase the container size, I simply need to feed in more data to get the 
crash happen.

Any idea?

Greetings,
Arnaud

_
Exceptions in Flink dashboard detail :

Root Exception :
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: 
Connection unexpectedly closed by remote task manager 
'bt1shli6/172.21.125.31:33186<http://172.21.125.31:33186>'. This might indicate 
that the remote task manager was lost.
   at 
org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.channelInactive(PartitionRequestClientHandler.java:119)
(…)



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Join Stream with big ref table

2015-11-13 Thread LINZ, Arnaud
Hello,

I’ve worked around my problem by not using the HiveServer2 JDBC driver to read 
the ref table. Apparently, despite all the good options passed to the Statement 
object, it poorly handles RAM, since converting the table into textformat and 
directly reading the hdfs works without any problem and with a lot of free mem…

Greetings,
Arnaud

De : LINZ, Arnaud
Envoyé : jeudi 12 novembre 2015 17:48
À : 'user@flink.apache.org' 
Objet : Join Stream with big ref table

Hello,

I have to enrich a stream with a big reference table (11,000,000 rows). I 
cannot use “join” because I cannot window the stream ; so in the “open()” 
function of each mapper I read the content of the table and put it in a HashMap 
(stored on the heap).

11M rows is quite big but it should take less than 100Mb in RAM, so it’s 
supposed to be easy. However, I systematically run into a Java Out Of Memory 
error, even with huge 64Gb containers (5 slots / container).

Path, ID

Data Port

Last Heartbeat

All Slots

Free Slots

CPU Cores

Physical Memory

Free Memory

Flink Managed Memory

akka.tcp://flink@172.21.125.28:43653/user/taskmanager
4B4D0A725451E933C39E891AAE80B53B

41982

2015-11-12, 17:46:14

5

5

32

126.0 GB

46.0 GB

31.5 GB


I don’t clearly understand why this happens and how to fix it. Any clue?






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Join Stream with big ref table

2015-11-12 Thread LINZ, Arnaud
Hello,

I have to enrich a stream with a big reference table (11,000,000 rows). I 
cannot use “join” because I cannot window the stream ; so in the “open()” 
function of each mapper I read the content of the table and put it in a HashMap 
(stored on the heap).

11M rows is quite big but it should take less than 100Mb in RAM, so it’s 
supposed to be easy. However, I systematically run into a Java Out Of Memory 
error, even with huge 64Gb containers (5 slots / container).

Path, ID

Data Port

Last Heartbeat

All Slots

Free Slots

CPU Cores

Physical Memory

Free Memory

Flink Managed Memory

akka.tcp://flink@172.21.125.28:43653/user/taskmanager
4B4D0A725451E933C39E891AAE80B53B

41982

2015-11-12, 17:46:14

5

5

32

126.0 GB

46.0 GB

31.5 GB


I don’t clearly understand why this happens and how to fix it. Any clue?






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Multiple keys in reduceGroup ?

2015-10-22 Thread LINZ, Arnaud
Hi,

Thanks a lot for the explanation. I cannot even say that it wasn’t stated in 
the documentation, I’ve simply missed the iterator part :


“by default, user defined functions (like map() or reduce()) are getting new 
objects on each call (or through an iterator). So it is possible to keep 
references to the objects inside the function (for example in a List).
There is a switch at the ExectionConfig which allows users to enable the object 
reuse mode:
env.getExecutionConfig().enableObjectReuse()
For mutable types, Flink will reuse object instances. In practice that means 
that a map() function will always receive the same object instance (with its 
fields set to new values). The object reuse mode will lead to better 
performance because fewer objects are created, but the user has to manually 
take care of what they are doing with the object references.”
Greetings,
Arnaud

De : Till Rohrmann [mailto:trohrm...@apache.org]
Envoyé : jeudi 22 octobre 2015 13:45
À : user@flink.apache.org
Objet : Re: Multiple keys in reduceGroup ?


You don’t modify the objects, however, the ReusingKeyGroupedIterator, which is 
the iterator you have in your reduce function, does. Internally it uses two 
objects, in your case of type Tuple2, to deserialize 
the input records. These two objects are alternately returned when you call 
next on the iterator. Since you only store references to these two objects in 
your ArrayList, you will see any changes made to these two objects.

However, this only explains why the values of your elements change and not the 
key. To understand why you observe different keys in your group you have to 
know that the ReusingKeyGroupedIterator does a look ahead to see whether the 
next element has the same key value. The look ahead is stored in one of the two 
objects. When the iterator detects that the next element has a new key, then it 
will finish the iterator. However, you’ll will see the key value of the next 
group in half of your elements.

If you want to accumulate input data while using reuse object mode you should 
copy the input elements.
​

On Thu, Oct 22, 2015 at 1:30 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,

I was using primitive types, and EnableObjectReuse was turned on.  My next move 
was to turn it off, and it did solved the problem.
It also increased execution time by 10%, but it’s hard to say if this overhead 
is due to the copy or to the change of behavior of the reduceGroup algorithm 
once it get the right data.

Since I never modify my objects, why object reuse isn’t working ?

Best regards,
Arnaud


De : Till Rohrmann [mailto:trohrm...@apache.org<mailto:trohrm...@apache.org>]
Envoyé : jeudi 22 octobre 2015 12:36
À : user@flink.apache.org<mailto:user@flink.apache.org>
Objet : Re: Multiple keys in reduceGroup ?

If not, could you provide us with the program and test data to reproduce the 
error?

Cheers,
Till

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek 
mailto:aljos...@apache.org>> wrote:
Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed 
and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the 
ExecutionEnvironment?

Cheers,
Aljoscha
> On 22 Oct 2015, at 12:31, Stephan Ewen 
> mailto:se...@apache.org>> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with 
> "equals()" ?
>
> The key objects will most certainly be different in each record (as they are 
> deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud 
> mailto:al...@bouyguestelecom.fr>> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up 
> adding “useless” controls in my code and came with what seems to me a bug. I 
> group my dataset according to a key, but in the reduceGroup function I am 
> passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2 -> 
> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction() {
>
> @Override
>
>public void reduce(Iterable< Tuple2> values,  
> Collector out) throws Exception {
>
>  // Issue : all values do not share the same key
>
>   final List> listValues = new 
> ArrayList>();
>
>  for (final Tuple2value : values) { 
> listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.g

RE: Multiple keys in reduceGroup ?

2015-10-22 Thread LINZ, Arnaud
Hi,

I was using primitive types, and EnableObjectReuse was turned on.  My next move 
was to turn it off, and it did solved the problem.
It also increased execution time by 10%, but it’s hard to say if this overhead 
is due to the copy or to the change of behavior of the reduceGroup algorithm 
once it get the right data.

Since I never modify my objects, why object reuse isn’t working ?

Best regards,
Arnaud


De : Till Rohrmann [mailto:trohrm...@apache.org]
Envoyé : jeudi 22 octobre 2015 12:36
À : user@flink.apache.org
Objet : Re: Multiple keys in reduceGroup ?

If not, could you provide us with the program and test data to reproduce the 
error?

Cheers,
Till

On Thu, Oct 22, 2015 at 12:34 PM, Aljoscha Krettek 
mailto:aljos...@apache.org>> wrote:
Hi,
but he’s comparing it to a primitive long, so shouldn’t the Long key be unboxed 
and the comparison still be valid?

My question is whether you enabled object-reuse-mode on the 
ExecutionEnvironment?

Cheers,
Aljoscha
> On 22 Oct 2015, at 12:31, Stephan Ewen 
> mailto:se...@apache.org>> wrote:
>
> Hi!
>
> You are checking for equality / inequality with "!=" - can you check with 
> "equals()" ?
>
> The key objects will most certainly be different in each record (as they are 
> deserialized individually), but they should be equal.
>
> Stephan
>
>
> On Thu, Oct 22, 2015 at 12:20 PM, LINZ, Arnaud 
> mailto:al...@bouyguestelecom.fr>> wrote:
> Hello,
>
>
>
> Trying to understand why my code was giving strange results, I’ve ended up 
> adding “useless” controls in my code and came with what seems to me a bug. I 
> group my dataset according to a key, but in the reduceGroup function I am 
> passed values with different keys.
>
>
>
> My code has the following pattern (mix of java & pseudo-code in []) :
>
>
>
> inputDataSet [of InputRecord]
>
> .joinWithTiny(referencesDataSet [of Reference])
>
> .where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])
>
>
> .groupBy([PrimaryKeySelector : Tuple2 -> 
> value.f0.getPrimaryKey()])
>
> .sortGroup([DateKeySelector], Order.ASCENDING)
>
> .reduceGroup(new ReduceFunction() {
>
> @Override
>
>public void reduce(Iterable< Tuple2> values,  
> Collector out) throws Exception {
>
>  // Issue : all values do not share the same key
>
>   final List> listValues = new 
> ArrayList>();
>
>  for (final Tuple2value : values) { 
> listValues.add(value); }
>
>
>
> final long primkey = listValues.get(0).f0.getPrimaryKey();
>
>for (int i = 1; i < listValues.size(); i++) {
>
> if (listValues.get(i).f0.getPrimaryKey() != primkey) {
>
>   throw new IllegalStateException(primkey + " != " + 
> listValues.get(i).f0.getPrimaryKey());
>
> è This exception is fired !
>
>}
>
> }
>
> }
>
> }) ;
>
>
>
> I use the current 0.10 snapshot. The issue appears in local cluster mode unit 
> tests as well as in yarn mode (however it’s ok when I test it with very few 
> elements).
>
>
>
> The sortGroup is not the cause of the problem, as I do get the same error 
> without it.
>
>
>
> Have I misunderstood the grouping concept or is it really an awful bug?
>
>
>
> Best regards,
>
> Arnaud
>
>
>
>
>
>
>
>
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
> n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
> l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The 
> company that sent this message cannot therefore be held liable for its 
> content nor attachments. Any unauthorized use or dissemination is prohibited. 
> If you are not the intended recipient of this message, then please delete it 
> and notify the sender.
>



Multiple keys in reduceGroup ?

2015-10-22 Thread LINZ, Arnaud
Hello,

Trying to understand why my code was giving strange results, I’ve ended up 
adding “useless” controls in my code and came with what seems to me a bug. I 
group my dataset according to a key, but in the reduceGroup function I am 
passed values with different keys.

My code has the following pattern (mix of java & pseudo-code in []) :

inputDataSet [of InputRecord]
.joinWithTiny(referencesDataSet [of Reference])
.where([InputRecord SecondaryKeySelector]).equalTo([Reference KeySelector])

.groupBy([PrimaryKeySelector : Tuple2 -> 
value.f0.getPrimaryKey()])
.sortGroup([DateKeySelector], Order.ASCENDING)
.reduceGroup(new ReduceFunction() {
@Override
   public void reduce(Iterable< Tuple2> values,  
Collector out) throws Exception {
 // Issue : all values do not share the same key
  final List> listValues = new 
ArrayList>();
 for (final Tuple2value : values) { 
listValues.add(value); }

final long primkey = listValues.get(0).f0.getPrimaryKey();
   for (int i = 1; i < listValues.size(); i++) {
if (listValues.get(i).f0.getPrimaryKey() != primkey) {
  throw new IllegalStateException(primkey + " != " + 
listValues.get(i).f0.getPrimaryKey());
==> This exception is fired !
   }
}
}
}) ;

I use the current 0.10 snapshot. The issue appears in local cluster mode unit 
tests as well as in yarn mode (however it’s ok when I test it with very few 
elements).

The sortGroup is not the cause of the problem, as I do get the same error 
without it.

Have I misunderstood the grouping concept or is it really an awful bug?

Best regards,
Arnaud






L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Flink batch runs OK but Yarn container fails in batch mode with -m yarn-cluster

2015-10-20 Thread LINZ, Arnaud
Hi, 
Sorry for the long delay, I've missed this mail.
I was using the 0.10 snapshot. I've upgraded it today and it seems to work now, 
I do have a SUCCEEDED too.

Best regards,
Arnaud

-Message d'origine-
De : Maximilian Michels [mailto:m...@apache.org] 
Envoyé : jeudi 8 octobre 2015 14:34
À : user@flink.apache.org; LINZ, Arnaud 
Objet : Re: Flink batch runs OK but Yarn container fails in batch mode with -m 
yarn-cluster

Hi Arnaud,

I've looked into the problem but I couldn't reproduce it using Flink 0.9.0, 
Flink 0.9.1 and the current master snapshot (f332fa5). I always ended up with 
the final state SUCCEEDED.

Which version of Flink were you using?

Best regards,
Max

On Thu, Sep 3, 2015 at 10:48 AM, Robert Metzger  wrote:
> Hi Arnaud,
>
> I think that's a bug ;)
> I'll file a JIRA to fix it for the next release.
>
> On Thu, Sep 3, 2015 at 10:26 AM, LINZ, Arnaud 
> 
> wrote:
>>
>> Hi,
>>
>>
>>
>> I am wondering why, despite the fact that my java main() methods runs 
>> OK and exit with 0 code value, the Yarn container status set by the 
>> englobing flink execution is FAILED with diagnostic "Flink YARN 
>> Client requested shutdown."?
>>
>>
>>
>> Command line :
>>
>> flink run -m yarn-cluster -yn 20 -ytm 8192 -yqu batch1 -ys 8 --class 
>>   
>>
>>
>>
>> End of yarn log :
>>
>>
>>
>> Status of job 6ac47ddc8331ffd0b1fa9a3b5a551f86 
>> (KUBERA-GEO-BRUT2SEGMENT) changed to FINISHED.
>>
>> 10:03:00,618 INFO
>> org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1- Stopping
>> YARN JobManager with status FAILED and diagnostic Flink YARN Client 
>> requested shutdown.
>>
>> 10:03:00,625 INFO  
>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl
>> - Waiting for application to be successfully unregistered.
>>
>> 10:03:00,874 INFO
>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolPro
>> xy  - Closing proxy : h1r2dn12.bpa.bouyguestelecom.fr:45454
>>
>> (… more closing proxy …)
>>
>> 10:03:00,877 INFO
>> org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolPro
>> xy  - Closing proxy : h1r2dn01.bpa.bouyguestelecom.fr:45454
>>
>> 10:03:00,883 INFO
>> org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1- Stopping
>> JobManager akka://flink/user/jobmanager#1737010364.
>>
>> 10:03:00,895 INFO  
>> akka.remote.RemoteActorRefProvider$RemotingTerminator
>> - Shutting down remote daemon.
>>
>> 10:03:00,896 INFO  
>> akka.remote.RemoteActorRefProvider$RemotingTerminator
>> - Remote daemon shut down; proceeding with flushing remote transports.
>>
>> 10:03:00,918 INFO  
>> akka.remote.RemoteActorRefProvider$RemotingTerminator
>> - Remoting shut down.
>>
>>
>>
>> End of log4j log:
>>
>>
>>
>> 2015:09:03 10:03:00 (main) - INFO -
>> com.bouygtel.kuberasdk.main.Application.mainMethod - Fin ok 
>> traitement
>>
>> 2015:09:03 10:03:00 (Thread-14) - INFO - Classe Inconnue.Methode 
>> Inconnue
>> - Shutting down FlinkYarnCluster from the client shutdown hook
>>
>> 2015:09:03 10:03:00 (Thread-14) - INFO - Classe Inconnue.Methode 
>> Inconnue
>> - Sending shutdown request to the Application Master
>>
>> 2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-2) - INFO - 
>> Classe Inconnue.Methode Inconnue - Sending StopYarnSession request to 
>> ApplicationMaster.
>>
>> 2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-2) - INFO - 
>> Classe Inconnue.Methode Inconnue - Remote JobManager has been stopped 
>> successfully. Stopping local application client
>>
>> 2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-2) - INFO - 
>> Classe Inconnue.Methode Inconnue - Stopped Application client.
>>
>> 2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-15) - INFO - 
>> Classe Inconnue.Methode Inconnue - Shutting down remote daemon.
>>
>> 2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-15) - INFO - 
>> Classe Inconnue.Methode Inconnue - Remote daemon shut down; 
>> proceeding with flushing remote transports.
>>
>> 2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-15) - INFO - 
>> Classe Inconnue.Methode Inconnue - Remoting shut down.
>>
>> 2015:09:03 10:03:00 (Thread-14) - INFO - Classe Inconnue.Methode 
>> Inconnue
>> - Deleting files in
>> hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/datcrypt/.flink/appl
>> ication_1441011714087_0730
>>
>&

Flink batch runs OK but Yarn container fails in batch mode with -m yarn-cluster

2015-09-03 Thread LINZ, Arnaud
Hi,



I am wondering why, despite the fact that my java main() methods runs OK and 
exit with 0 code value, the Yarn container status set by the englobing flink 
execution is FAILED with diagnostic "Flink YARN Client requested shutdown."?



Command line :

flink run -m yarn-cluster -yn 20 -ytm 8192 -yqu batch1 -ys 8 --class  
 



End of yarn log :



Status of job 6ac47ddc8331ffd0b1fa9a3b5a551f86 (KUBERA-GEO-BRUT2SEGMENT) 
changed to FINISHED.

10:03:00,618 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1   
 - Stopping YARN JobManager with status FAILED and diagnostic Flink YARN Client 
requested shutdown.

10:03:00,625 INFO  org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl
 - Waiting for application to be successfully unregistered.

10:03:00,874 INFO  
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
Closing proxy : h1r2dn12.bpa.bouyguestelecom.fr:45454

(… more closing proxy …)

10:03:00,877 INFO  
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy  - 
Closing proxy : h1r2dn01.bpa.bouyguestelecom.fr:45454

10:03:00,883 INFO  org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1   
 - Stopping JobManager akka://flink/user/jobmanager#1737010364.

10:03:00,895 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator
 - Shutting down remote daemon.

10:03:00,896 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator
 - Remote daemon shut down; proceeding with flushing remote transports.

10:03:00,918 INFO  akka.remote.RemoteActorRefProvider$RemotingTerminator
 - Remoting shut down.



End of log4j log:



2015:09:03 10:03:00 (main) - INFO - 
com.bouygtel.kuberasdk.main.Application.mainMethod - Fin ok traitement

2015:09:03 10:03:00 (Thread-14) - INFO - Classe Inconnue.Methode Inconnue - 
Shutting down FlinkYarnCluster from the client shutdown hook

2015:09:03 10:03:00 (Thread-14) - INFO - Classe Inconnue.Methode Inconnue - 
Sending shutdown request to the Application Master

2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-2) - INFO - Classe 
Inconnue.Methode Inconnue - Sending StopYarnSession request to 
ApplicationMaster.

2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-2) - INFO - Classe 
Inconnue.Methode Inconnue - Remote JobManager has been stopped successfully. 
Stopping local application client

2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-2) - INFO - Classe 
Inconnue.Methode Inconnue - Stopped Application client.

2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-15) - INFO - Classe 
Inconnue.Methode Inconnue - Shutting down remote daemon.

2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-15) - INFO - Classe 
Inconnue.Methode Inconnue - Remote daemon shut down; proceeding with flushing 
remote transports.

2015:09:03 10:03:00 (flink-akka.actor.default-dispatcher-15) - INFO - Classe 
Inconnue.Methode Inconnue - Remoting shut down.

2015:09:03 10:03:00 (Thread-14) - INFO - Classe Inconnue.Methode Inconnue - 
Deleting files in 
hdfs://h1r1nn01.bpa.bouyguestelecom.fr:8020/user/datcrypt/.flink/application_1441011714087_0730

2015:09:03 10:03:00 (Thread-15) - INFO - Classe Inconnue.Methode Inconnue - 
Application application_1441011714087_0730 finished with state FINISHED and 
final state FAILED at 1441267380623

2015:09:03 10:03:00 (Thread-14) - WARN - Classe Inconnue.Methode Inconnue - The 
short-circuit local reads feature cannot be used because libhadoop cannot be 
loaded.

2015:09:03 10:03:01 (Thread-14) - INFO - Classe Inconnue.Methode Inconnue - 
YARN Client is shutting down



Greetings,

Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: How to force the parallelism on small streams?

2015-09-02 Thread LINZ, Arnaud
Hi,

You are right, but in fact it does not solve my problem, since I have 100 
parallelism everywhere. Each of my 100 sources gives only a few lines (say 14 
max), and only the first 14 next nodes will receive data.
Same problem by replacing rebalance() with shuffle().

But I found a workaround: setting parallelism to 1 for the source (I don't need 
a 100 directory scanners anyway), it forces the rebalancing evenly between the 
mappers.

Greetings,
Arnaud


-Message d'origine-
De : Matthias J. Sax [mailto:mj...@apache.org] 
Envoyé : mercredi 2 septembre 2015 17:56
À : user@flink.apache.org
Objet : Re: How to force the parallelism on small streams?

Hi,

If I understand you correctly, you want to have 100 mappers. Thus you need to 
apply the .setParallelism() after .map()

> addSource(myFileSource).rebalance().map(myFileMapper).setParallelism(1
> 00)

The order of commands you used, set the dop for the source to 100 (which might 
be ignored, if the provided source function "myFileSource" does not implements 
"ParallelSourceFunction" interface). The dop for the mapper should be the 
default value.

Using .rebalance() is absolutely correct. It distributes the emitted tuples in 
a round robin fashion to all consumer tasks.

-Matthias

On 09/02/2015 05:41 PM, LINZ, Arnaud wrote:
> Hi,
> 
>  
> 
> I have a source that provides few items since it gives file names to 
> the mappers. The mapper opens the file and process records. As the 
> files are huge, one input line (a filename) gives a consequent work to the 
> next stage.
> 
> My topology looks like :
> 
> addSource(myFileSource).rebalance().setParallelism(100).map(myFileMapp
> er)
> 
> If 100 mappers are created, about 85 end immediately and only a few 
> process the files (for hours). I suspect an optimization making that 
> there is a minimum number of lines to pass to the next node or it is 
> “shutdown” ; but in my case I do want the lines to be evenly 
> distributed to each mapper.
> 
> How to enforce that ?
> 
>  
> 
> Greetings,
> 
> Arnaud
> 
> 
> --
> --
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses 
> pièces jointes. Toute utilisation ou diffusion non autorisée est 
> interdite. Si vous n'êtes pas destinataire de ce message, merci de le 
> détruire et d'avertir l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. 
> The company that sent this message cannot therefore be held liable for 
> its content nor attachments. Any unauthorized use or dissemination is 
> prohibited. If you are not the intended recipient of this message, 
> then please delete it and notify the sender.



How to force the parallelism on small streams?

2015-09-02 Thread LINZ, Arnaud
Hi,

I have a source that provides few items since it gives file names to the 
mappers. The mapper opens the file and process records. As the files are huge, 
one input line (a filename) gives a consequent work to the next stage.
My topology looks like :
addSource(myFileSource).rebalance().setParallelism(100).map(myFileMapper)
If 100 mappers are created, about 85 end immediately and only a few process the 
files (for hours). I suspect an optimization making that there is a minimum 
number of lines to pass to the next node or it is “shutdown” ; but in my case I 
do want the lines to be evenly distributed to each mapper.
How to enforce that ?

Greetings,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: Best way for simple logging in jobs?

2015-08-31 Thread LINZ, Arnaud
Hi,
For unknown reasons, the stdout/stderr output of my jobs wasn’t retrieved by 
Yarn. Same thing for slf4j logger : outside local cluster mode, I could not see 
any trace from the nodes.  I’ve spent a few hours trying to find why, but I 
gave up.
Since I need “real time” logging & monitoring in the main driver application, 
I’ve ended up implementing a simple push/pull socket message queue using JeroMQ 
(pure-Java 0MQ). That allows me to easily publish log & metrology information 
from the worker nodes to the driver program.
Arnaud


De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan 
Ewen
Envoyé : lundi 31 août 2015 11:05
À : user@flink.apache.org
Objet : Re: Best way for simple logging in jobs?

@Arnaud

Are you looking for a separate user log file next to the system log file, or 
would Robert's suggestion work?

On Fri, Aug 28, 2015 at 4:20 PM, Robert Metzger 
mailto:rmetz...@apache.org>> wrote:
Hi,

Creating a slf4j logger like this:

private static final Logger LOG = 
LoggerFactory.getLogger(PimpedKafkaSink.class);

Works for me. The messages also end up in the regular YARN logs.

Also system out should end up in YARN actually (when retrieving the logs from 
the log aggregation).

Regards,

Robert

On Fri, Aug 28, 2015 at 3:55 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:

Hi,



I am wondering if it’s possible to get my own logs inside the job functions 
(sources, mappers, sinks…).  It would be nice if I could get those logs in the 
Yarn’s logs, but writing System.out/System.err has no effect.



For now I’m using a “StringBuffer” accumulator but it does not work in 
streaming apps before v0.10, and only show results at the end.



I’ll probably end up using a HDFS logging system but there is maybe a smarter 
way ?



Greetings,

Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.




RE: "Flink YARN Client requested shutdown" in flink -m yarn-cluster mode?

2015-08-28 Thread LINZ, Arnaud
Hi Robert,

As seen together, my mistake was to launch the job in detached mode (-yd) when 
my main function was not waiting after execution and was immediately ending. 
Sorry for my misunderstanding of this option.

Best regards,
Arnaud

De : Robert Metzger [mailto:rmetz...@apache.org]
Envoyé : vendredi 28 août 2015 11:03
À : user@flink.apache.org
Objet : Re: "Flink YARN Client requested shutdown" in flink -m yarn-cluster 
mode?

Is the log from 0.9-SNAPSHOT or 0.10-SNAPSHOT?

Can you send me (if you want privately as well) the full log of the yarn 
application:

yarn logs -applicationId .

We need to find out why the TaskManagers are shutting down. That is most likely 
logged in the TaskManager logs.


On Fri, Aug 28, 2015 at 10:57 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:

Hello,



I’ve moved my version from 0.9.0 and tried both 0.9-SNAPSHOT & 0.10-SNAPSHOT to 
continue my batch execution on my secured cluster thanks to [FLINK-2555].

My application works nicely in local mode and also in yarn mode using a job 
container started with yarn-session.sh, but it fails in –m yarn-cluster mode



Yarn logs indicate that  “Flink YARN Client requested shutdown” but I did 
nothing like that (or not intentionally). The nodes are not even starting and 
the exec() does not return any JobExecutionResult.



My command line was :

flink run -m yarn-cluster -yd -yn 2 -ytm 1500 -yqu default -ys 4 --class 
  



Any idea what I’ve done wrong?



Greetings,

Arnaud



PS - Yarn log extract :

(…)

09:56:29,111 INFO  org.apache.flink.yarn.YarnTaskManager
 - Successful registration at JobManager 
(akka.tcp://flink@172.19.115.51:54806/user/jobmanager<http://flink@172.19.115.51:54806/user/jobmanager>),
 starting network stack and library cache.

09:56:29,817 INFO  org.apache.flink.runtime.io.network.netty.NettyClient
 - Successful initialization (took 73 ms).

09:56:29,889 INFO  org.apache.flink.runtime.io.network.netty.NettyServer
 - Successful initialization (took 55 ms). Listening on SocketAddress 
/172.19.115.52:41920<http://172.19.115.52:41920>.

09:56:29,890 INFO  org.apache.flink.yarn.YarnTaskManager
 - Determined BLOB server address to be 
/172.19.115.51:38505<http://172.19.115.51:38505>. Starting BLOB cache.

09:56:29,893 INFO  org.apache.flink.runtime.blob.BlobCache  
 - Created BLOB cache storage directory 
/tmp/blobStore-7150f7d7-f7a3-4c4c-9cda-3877da5aacd6

09:56:52,367 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (1/3)

09:56:52,375 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (3/3)

09:56:52,383 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (1/3)

09:56:52,387 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (3/3)

09:56:52,394 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (1/3)

09:56:52,402 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (1/3)

09:56:52,425 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (2/3)

09:56:52,429 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink

Best way for simple logging in jobs?

2015-08-28 Thread LINZ, Arnaud
Hi,



I am wondering if it’s possible to get my own logs inside the job functions 
(sources, mappers, sinks…).  It would be nice if I could get those logs in the 
Yarn’s logs, but writing System.out/System.err has no effect.



For now I’m using a “StringBuffer” accumulator but it does not work in 
streaming apps before v0.10, and only show results at the end.



I’ll probably end up using a HDFS logging system but there is maybe a smarter 
way ?



Greetings,

Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


"Flink YARN Client requested shutdown" in flink -m yarn-cluster mode?

2015-08-28 Thread LINZ, Arnaud
Hello,



I’ve moved my version from 0.9.0 and tried both 0.9-SNAPSHOT & 0.10-SNAPSHOT to 
continue my batch execution on my secured cluster thanks to [FLINK-2555].

My application works nicely in local mode and also in yarn mode using a job 
container started with yarn-session.sh, but it fails in –m yarn-cluster mode



Yarn logs indicate that  “Flink YARN Client requested shutdown” but I did 
nothing like that (or not intentionally). The nodes are not even starting and 
the exec() does not return any JobExecutionResult.



My command line was :

flink run -m yarn-cluster -yd -yn 2 -ytm 1500 -yqu default -ys 4 --class 
  



Any idea what I’ve done wrong?



Greetings,

Arnaud



PS - Yarn log extract :

(…)

09:56:29,111 INFO  org.apache.flink.yarn.YarnTaskManager
 - Successful registration at JobManager 
(akka.tcp://flink@172.19.115.51:54806/user/jobmanager), starting network stack 
and library cache.

09:56:29,817 INFO  org.apache.flink.runtime.io.network.netty.NettyClient
 - Successful initialization (took 73 ms).

09:56:29,889 INFO  org.apache.flink.runtime.io.network.netty.NettyServer
 - Successful initialization (took 55 ms). Listening on SocketAddress 
/172.19.115.52:41920.

09:56:29,890 INFO  org.apache.flink.yarn.YarnTaskManager
 - Determined BLOB server address to be /172.19.115.51:38505. Starting BLOB 
cache.

09:56:29,893 INFO  org.apache.flink.runtime.blob.BlobCache  
 - Created BLOB cache storage directory 
/tmp/blobStore-7150f7d7-f7a3-4c4c-9cda-3877da5aacd6

09:56:52,367 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (1/3)

09:56:52,375 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (3/3)

09:56:52,383 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (1/3)

09:56:52,387 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (3/3)

09:56:52,394 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (1/3)

09:56:52,402 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (1/3)

09:56:52,425 INFO  org.apache.flink.yarn.YarnTaskManager
 - Received task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (2/3)

09:56:52,429 INFO  org.apache.flink.runtime.taskmanager.Task
 - Loading JAR files for task CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 2) (2/3)

09:56:52,454 INFO  org.apache.flink.yarn.YarnTaskManager
 - Stopping YARN TaskManager with final application status FAILED and 
diagnostics: Flink YARN Client requested shutdown

09:56:52,480 INFO  org.apache.flink.yarn.YarnTaskManager
 - Stopping TaskManager akka://flink/user/taskmanager#2116513584.

09:56:52,483 INFO  org.apache.flink.yarn.YarnTaskManager
 - Cancelling all computations and discarding all cached data.

09:56:52,486 INFO  org.apache.flink.runtime.taskmanager.Task
 - Attempting to fail task externally CHAIN DataSource (at 
createInput(ExecutionEnvironment.java:502) 
(org.apache.flink.api.java.hadoop.mapreduce.HadoopInputFormat)) -> FlatMap 
(FlatMap at readTable(HiveDAO.java:107)) -> Map (Key Extractor 1) (3/3)

09:56:52,486 INFO  org.apache

RE: HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

2015-08-27 Thread LINZ, Arnaud
Hi,

Ok, I’ve created  FLINK-2580 to track this issue (and FLINK-2579, which is 
totally unrelated).

I think I’m going to set up my dev environment to start contributing a little 
more than just complaining ☺.

Best regards,
Arnaud

De : ewenstep...@gmail.com [mailto:ewenstep...@gmail.com] De la part de Stephan 
Ewen
Envoyé : mercredi 26 août 2015 20:12
À : user@flink.apache.org
Objet : Re: HadoopDataOutputStream maybe does not expose enough methods of 
org.apache.hadoop.fs.FSDataOutputStream

I think that is a very good idea.

Originally, we wrapped the Hadoop FS classes for convenience (they were 
changing, we wanted to keep the system independent of Hadoop), but these are no 
longer relevant reasons, in my opinion.

Let's start with your proposal and see if we can actually get rid of the 
wrapping in a way that is friendly to existing users.

Would you open an issue for this?

Greetings,
Stephan


On Wed, Aug 26, 2015 at 6:23 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,

I’ve noticed that when you use org.apache.flink.core.fs.FileSystem to write 
into a hdfs file, calling 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(), it returns a  
HadoopDataOutputStream that wraps a org.apache.hadoop.fs.FSDataOutputStream 
(under its org.apache.hadoop.hdfs.client .HdfsDataOutputStream wrappper).

However, FSDataOutputStream exposes many methods like flush,   getPos etc, but 
HadoopDataOutputStream only wraps write & close.

For instance, flush() calls the default, empty implementation of OutputStream 
instead of the hadoop one, and that’s confusing. Moreover, because of the 
restrictive OutputStream interface, hsync() and hflush() are not exposed to 
Flink ; maybe having a getWrappedStream() would be convenient.

(For now, that prevents me from using Flink FileSystem object, I directly use 
hadoop’s one).

Regards,
Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



HadoopDataOutputStream maybe does not expose enough methods of org.apache.hadoop.fs.FSDataOutputStream

2015-08-26 Thread LINZ, Arnaud
Hi,

I’ve noticed that when you use org.apache.flink.core.fs.FileSystem to write 
into a hdfs file, calling 
org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(), it returns a  
HadoopDataOutputStream that wraps a org.apache.hadoop.fs.FSDataOutputStream 
(under its org.apache.hadoop.hdfs.client .HdfsDataOutputStream wrappper).

However, FSDataOutputStream exposes many methods like flush,   getPos etc, but 
HadoopDataOutputStream only wraps write & close.

For instance, flush() calls the default, empty implementation of OutputStream 
instead of the hadoop one, and that’s confusing. Moreover, because of the 
restrictive OutputStream interface, hsync() and hflush() are not exposed to 
Flink ; maybe having a getWrappedStream() would be convenient.

(For now, that prevents me from using Flink FileSystem object, I directly use 
hadoop’s one).

Regards,
Arnaud







L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: [0.10-SNAPSHOT ] When naming yarn application (yarn-session -nm), flink run without -m fails.

2015-08-26 Thread LINZ, Arnaud
Ooops… Seems it was rather a write problem on the conf dir…
Sorry, it works!

BTW, it’s not really nice to have an application write the configuration dir ; 
it’s often a root protected directory in usr/lib/flink. Is there a parameter to 
put that file elsewhere ?


De : Robert Metzger [mailto:rmetz...@apache.org]
Envoyé : mercredi 26 août 2015 14:42
À : user@flink.apache.org
Objet : Re: [0.10-SNAPSHOT ] When naming yarn application (yarn-session -nm), 
flink run without -m fails.

Hi Arnaud,

usually, you don't have to manually specify the JobManager address manually 
with the -m argument, because it is reading it from the 
conf/.yarn-session.properties file.

Give me a few minutes to reproduce the issue.

On Wed, Aug 26, 2015 at 2:39 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi,
Using last nightly build, it seems that if you call yarn-session.sh with -nm 
option to give a nice application name, then you cannot submit a job with flink 
run without specify the ever changing -m  address since it does not 
find it any longer.

Regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.



[0.10-SNAPSHOT ] When naming yarn application (yarn-session -nm), flink run without -m fails.

2015-08-26 Thread LINZ, Arnaud
Hi,
Using last nightly build, it seems that if you call yarn-session.sh with -nm 
option to give a nice application name, then you cannot submit a job with flink 
run without specify the ever changing -m  address since it does not 
find it any longer.

Regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Source & job parallelism

2015-08-25 Thread LINZ, Arnaud
Hi,

I have a streaming source that extends RichParallelSourceFunction, but for some 
reason I don’t want parallelism at the source level, so I use :
Env.setSource(mySource).setParrellelism(1).map(mymapper)

I do want parallelism at the mapper level, because it’s a long task, and I 
would like the source to dispatch data to several mappers.

It seems that I don’t get parallelism on the mapper, it seems that the 
setParallelism() does not apply only to the source.
Is that right? If yes, how can I mix my parallelism levels ?

Best regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


Using HadoopInputFormat files from Flink/Yarn in a secure cluster gives an error

2015-08-20 Thread LINZ, Arnaud
Hello,

My application handles as input and output some HDFS files in the jobs and in 
the driver application.
It works in local cluster mode, but when I’m trying to submit it to a yarn 
client, when I try to use a HadoopInputFormat (that comes from a HCatalog 
request), I have the following error: Delegation Token can be issued only with 
kerberos or web authentication (full stack trace below).

Code which I believe causes the error (It’s not clear in the stack trace, as 
the nearest point in my code is “execEnv.execute()”) :

public synchronized DataSet readTable(String dbName, String tableName, 
String filter, ExecutionEnvironment cluster,
final HiveBeanFactory factory) throws IOException {

// login kerberos if needed (via 
UserGroupInformation.loginUserFromKeytab(getKerberosPrincipal(), 
getKerberosKeytab());)
HdfsTools.getFileSystem();

// Create M/R job and configure it
final Job job = Job.getInstance();
job.setJobName("Flink source for Hive Table " + dbName + "." + 
tableName);

// Crée la source
@SuppressWarnings({ "unchecked", "rawtypes" })
final HadoopInputFormat inputFormat = 
new HadoopInputFormat(// CHECKSTYLE:ON
(InputFormat) HCatInputFormat.setInput(job, dbName, tableName, 
filter), //
NullWritable.class, //
DefaultHCatRecord.class, //
job);

final HCatSchema inputSchema = 
HCatInputFormat.getTableSchema(job.getConfiguration());
@SuppressWarnings("serial")
final DataSet dataSet = cluster
// Read the table
.createInput(inputFormat)
// map bean (key is useless)
.flatMap(new FlatMapFunction, T>() {
@Override
public void flatMap(Tuple2 
value, Collector out) throws Exception {  // NOPMD
final T record = factory.fromHive(value.f1, inputSchema);
if (record != null) {
out.collect(record);
}
}
}).returns(beanClass);

return dataSet;
}

Maybe I need to explicitely get a token on each node in the initialization of 
HadoopInputFormat() (overriding configure()) ? That would be difficult since 
the keyfile is on the driver’s local drive…

StackTrace :

Found YARN properties file /usr/lib/flink/bin/../conf/.yarn-properties
Using JobManager address from YARN properties 
bt1svlmw.bpa.bouyguestelecom.fr/172.19.115.52:50494
Secure Hadoop environment setup detected. Running in secure context.
2015:08:20 15:04:17 (main) - INFO - 
com.bouygtel.kuberasdk.main.Application.mainMethod - Dï¿œbut traitement
15:04:18,005 INFO  org.apache.hadoop.security.UserGroupInformation  
 - Login successful for user alinz using keytab file 
/usr/users/alinz/alinz.keytab
15:04:20,139 WARN  org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory  
 - The short-circuit local reads feature cannot be used because libhadoop 
cannot be loaded.
Error : Execution Kubera KO : java.lang.IllegalStateException: Error while 
executing Flink application
com.bouygtel.kuberasdk.main.ApplicationBatch.execCluster(ApplicationBatch.java:84)
com.bouygtel.kubera.main.segment.ApplicationGeoSegment.batchExec(ApplicationGeoSegment.java:68)
com.bouygtel.kuberasdk.main.ApplicationBatch.exec(ApplicationBatch.java:51)
com.bouygtel.kuberasdk.main.Application.mainMethod(Application.java:81)
com.bouygtel.kubera.main.segment.MainGeoSmooth.main(MainGeoSmooth.java:44)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
org.apache.flink.client.program.Client.run(Client.java:315)
org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:584)
org.apache.flink.client.CliFrontend.run(CliFrontend.java:290)
org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:873)
org.apache.flink.client.CliFrontend$2.run(CliFrontend.java:870)
org.apache.flink.runtime.security.SecurityUtils$1.run(SecurityUtils.java:50)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
org.apache.flink.runtime.security.SecurityUtils.runSecured(SecurityUtils.java:47)
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:870)
org.apache.flink.client.CliFrontend.main(CliFrontend.java:922)

Caused by: org.apache.flink.client.program.ProgramInvocationException: The 
program execution failed: Failed to submit job dddaf104260eb0f56ff336727ceeb49e 
(KUBERA-GEO-BRUT2SEGMENT)
org.

Streaming window : count with timeout ?

2015-07-17 Thread LINZ, Arnaud
Hello,

The data in my stream have a timestamp that may be slightly out of order, but I 
need to process the data in the proper order. To do this, I use a windowing 
function and sort the items in a flatMap.

However, the source may sometimes send data in “bulk batches” and sometimes “on 
the fly”. If I choose a time window, it will suits well the “on the fly” 
behavior but when processing bulks I may have too many elements to sort in the 
time interval specified.

If I choose a “count.of” window, I will process batches efficiently but I may 
need to wait forever in the “on the fly” behavior until the count is reached.

What I need is then a “count window with time out” or a “time window with max 
element” : I would like to specify both a max count and a max time to fit the 
source behavior.

Do you have any idea how I can do that ?

Best regards,
Arnaud




L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


getIndexOfThisSubtask : starts at 0 or 1 ?

2015-07-16 Thread LINZ, Arnaud
Hello,

According to the documentation, getIndexOfThisSubtask starts from 1;

/**
 * Gets the number of the parallel subtask. The numbering starts from 1 
and goes up to the parallelism,
 * as returned by {@link #getNumberOfParallelSubtasks()}.
 *
 * @return The number of the parallel subtask.
 */
int getIndexOfThisSubtask();

but in my code in 0.9.0 it starts at 0 and goes up to 
getNumberOfParallelSubtasks()-1

I suppose that the doc is wrong, then.

Best regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


No accumulator results in streaming

2015-07-16 Thread LINZ, Arnaud
Hello,

I’m struggling with this simple issue for hours now : I am unable to get the 
accumulator result of a streaming context result, the accumulator map in the 
JobExecutionResult is always empty.

Simple test code (directly inspired from the documentation) :

My source =

public static class oneRandomNumberSource implements 
SourceFunction, Serializable {

@Override
public void 
run(org.apache.flink.streaming.api.functions.source.SourceFunction.SourceContext
 ctx)
throws Exception {
final Random rnd = new Random(29172);
ctx.collect(rnd.nextInt());
}

@Override
public void cancel() {
}
}

My exec program =
   public static final String COUNTER_NBLINE = "num-lines";

void test() {
final StreamExecutionEnvironment env = getCluster();
final SourceFunction source = new 
oneRandomNumberSource();
env.addSource(source).addSink(new RichSinkFunction() {

private IntCounter numLines = new IntCounter();

@Override
public void open(Configuration parameters) throws Exception 
{ // NOPMD
getRuntimeContext().addAccumulator(COUNTER_NBLINE, 
this.numLines);
}
@Override
public void invoke(Integer value) throws Exception {
System.err.println(value);
numLines.add(1);
}
});
try {
final JobExecutionResult result = env.execute();

System.out.println(result.getAccumulatorResult(COUNTER_NBLINE)); // Problem : 
always null
}
catch (Exception e) {
e.printStackTrace();
}
}
Console output :

07/16/2015 14:11:58 Job execution switched to status RUNNING.
07/16/2015 14:11:58 Custom Source(1/1) switched to SCHEDULED
07/16/2015 14:11:58 Custom Source(1/1) switched to DEPLOYING
07/16/2015 14:11:58 Stream Sink(1/4) switched to SCHEDULED
07/16/2015 14:11:58 Stream Sink(1/4) switched to DEPLOYING
07/16/2015 14:11:58 Stream Sink(2/4) switched to SCHEDULED
07/16/2015 14:11:58 Stream Sink(2/4) switched to DEPLOYING
07/16/2015 14:11:58 Stream Sink(3/4) switched to SCHEDULED
07/16/2015 14:11:58 Stream Sink(3/4) switched to DEPLOYING
07/16/2015 14:11:58 Stream Sink(4/4) switched to SCHEDULED
07/16/2015 14:11:58 Stream Sink(4/4) switched to DEPLOYING
07/16/2015 14:11:58 Custom Source(1/1) switched to RUNNING
07/16/2015 14:11:58 Stream Sink(1/4) switched to RUNNING
07/16/2015 14:11:58 Stream Sink(2/4) switched to RUNNING
07/16/2015 14:11:58 Stream Sink(4/4) switched to RUNNING
07/16/2015 14:11:58 Stream Sink(3/4) switched to RUNNING
07/16/2015 14:11:58 Custom Source(1/1) switched to FINISHED
07/16/2015 14:11:58 Stream Sink(4/4) switched to FINISHED
07/16/2015 14:11:58 Stream Sink(3/4) switched to FINISHED
-329782788
07/16/2015 14:11:58 Stream Sink(2/4) switched to FINISHED
07/16/2015 14:11:58 Stream Sink(1/4) switched to FINISHED
07/16/2015 14:11:58 Job execution switched to status FINISHED.
null

What am I doing wrong ?
Flink version is 0.9.0

Best regards,
Arnaud



L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.


RE: How to cancel a Flink DataSource from the driver code?

2015-07-15 Thread LINZ, Arnaud
Hi Roger,

In fact I am implementing another use case than the one you know about, with 
more sources than Kafka: we now also use Flink in the BI team (which I belong 
to).

The problem with the web interface is that it is not easily scriptable and to 
my understanding it does not allow cleaning code to be called upon 
cancellation. I would have liked to integrate with the standard BI production 
environment of my company, which requires to be able to call start, status & 
stop scripts.

I think I will implement such a mechanism by periodically testing in my Source 
for the existence of a specific “heart beet” HDFS file, and quit the run() 
method if this file no longer exists because it has been deleted by a stop 
script.

Arnaud

De : Robert Metzger [mailto:rmetz...@apache.org]
Envoyé : jeudi 2 juillet 2015 09:48
À : user@flink.apache.org
Objet : Re: How to cancel a Flink DataSource from the driver code?

Hi Arnaud,

when using the PersistentKafkaSource, you can always cancel the job in the web 
interface and start it again. We will continue reading from Kafka where you 
left off.
You can probably also send the cancel request manually to the web interface, to 
that URL: 
http://localhost:8081/jobsInfo?get=cancel&job=68c53a77f11d34695ac1aea4f098af82

But I don't think there is a way to submit a topology in a non-blocking way, so 
that env.execute() returns immediately with the JobId.


On Thu, Jul 2, 2015 at 9:35 AM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hi Stephan,

I think that clean shutdown is a major feature to build a complex persistent 
service that use Flink Streaming for a data-quality critical task, and I’ll 
mark my code with a // FIXME comment  waiting for this feature to be available !

Greetings,
Arnaud



De : ewenstep...@gmail.com<mailto:ewenstep...@gmail.com> 
[mailto:ewenstep...@gmail.com<mailto:ewenstep...@gmail.com>] De la part de 
Stephan Ewen
Envoyé : mercredi 1 juillet 2015 15:58
À : user@flink.apache.org<mailto:user@flink.apache.org>
Objet : Re: How to cancel a Flink DataSource from the driver code?

Hi Arnaud!

There is a pending issue and pull request that is adding a "cancel()" call to 
the command line interface.

https://github.com/apache/flink/pull/750

It would be possible to extend that such that the driver can also cancel the 
program.

Greetings,
Stephan


On Wed, Jul 1, 2015 at 3:33 PM, LINZ, Arnaud 
mailto:al...@bouyguestelecom.fr>> wrote:
Hello,

I really looked in the documentation but unfortunately I could not find the 
answer: how do you cancel your data SourceFunction from your “driver” code 
(i.e., from a monitoring thread that can initiate a proper shutdown) ? Calling 
“cancel()” on the object passed to the addSource() has no effect since it does 
not apply to the marshalled distributed object(s).

Best regards,
Arnaud





L'intégrité de ce message n'étant pas assurée sur internet, la société 
expéditrice ne peut être tenue responsable de son contenu ni de ses pièces 
jointes. Toute utilisation ou diffusion non autorisée est interdite. Si vous 
n'êtes pas destinataire de ce message, merci de le détruire et d'avertir 
l'expéditeur.

The integrity of this message cannot be guaranteed on the Internet. The company 
that sent this message cannot therefore be held liable for its content nor 
attachments. Any unauthorized use or dissemination is prohibited. If you are 
not the intended recipient of this message, then please delete it and notify 
the sender.




  1   2   >