Re: Flink doesn't free YARN slots after restarting

2017-08-10 Thread Bowen Li
Hi Till, Any idea why it happened? I've tried different configurations for configuring our Flink cluster, but the cluster always fails after 4 or 5 hours. According to the log, looks like the total number of slots becomes 0 at the end, and YarnClusterClient shuts down application master

stream partitioning to avoid network overhead

2017-08-10 Thread Karthik Deivasigamani
Hi, I have a use case where we read messages from a Kafka topic and invoke a webservice. The web-service call can take a take couple of seconds and then gives us back on avg 800KB of data. This data is set to another operator which does the parsing and then it gets sent to sink which saves the

How can I cancel a Flink job safely without a special stop message in the stream?

2017-08-10 Thread Zor X.L.
Hi, What we want to do is cancelling the Flink job after all upstream data were processed. We use Kafka as our input and output, and use the SQL capability of Table API by the way. A possible solution is: * embed a stop message at the tail of upstream * do what should be done in

Re: Keyed CEP checkpoint fails

2017-08-10 Thread Daiqing Li
Hi Dawid, After rewriting dashcode with Objects.hash for all the fields, I still get the same error. One thing special is checkpoints always fail at 428, after trying many times. Does it mean anything? > On Aug 10, 2017, at 9:14 AM, Dawid Wysakowicz > wrote: > >

Re: Can i use lot of keyd states or should i use 1 big key state.

2017-08-10 Thread shashank agarwal
Thanks Aljoscha and Stephan for clearing the doubt. On Wed, Aug 9, 2017 at 7:37 PM, Aljoscha Krettek wrote: > Hi, > > If you have one keyed state, say "count by email id", and many different > keys you will only have one column in RocksDB (or one HashTable). Actually, >

Re: difference between checkpoints & savepoints

2017-08-10 Thread Stefan Richter
I most of the things you are asking for are already there: you can configure checkpoint interval + externalized cp, the backend, and the location for savepoints and externalized checkpoints. You can restart from savepoints and externalized checkpoints from the CLI. One point I am not entirely

Re: Using latency markers

2017-08-10 Thread Kien Truong
Hi, I just want to say we're having the same issues. There's no metric for latency when we attempted to export the metrics through graphite either. Regards, Kien On 8/10/2017 7:36 PM, Aljoscha Krettek wrote: Hi, I must admit that I've never used this but I'll try and look into it.

Re: Delay between job manager recovery and job recovery (1.3.2)

2017-08-10 Thread Gyula Fóra
Sweet, thanks Aljoscha for the quick help. Gyula Aljoscha Krettek ezt írta (időpont: 2017. aug. 10., Cs, 15:33): > Don't worry! :-) I found that you can configure this via > "high-availability.job.delay" > (in HighAvailabilityOptions). > > Best, > Aljoscha > > On 10. Aug

Re: Classloader and removal of native libraries

2017-08-10 Thread Aljoscha Krettek
Hi Conrad, I'm afraid you're running in the same problem that we already encountered with loading the native RocksDB library:

Re: Using Hadoop 2.8.0 in Flink Project for S3A Path Style Access

2017-08-10 Thread Aljoscha Krettek
I created an issue for this: https://issues.apache.org/jira/browse/FLINK-7413 > On 10. Aug 2017, at 16:05, Mustafa AKIN wrote: > > Yes, it would probably work. I cloned master repo and compiled with 2.8.0 and > it worked.

Re: Using Hadoop 2.8.0 in Flink Project for S3A Path Style Access

2017-08-10 Thread Mustafa AKIN
Yes, it would probably work. I cloned master repo and compiled with 2.8.0 and it worked. It would be nice to have 2.8 binaries since Hadoop 2.8.1 is also released Mustafa Akın www.mustafaak.in On Wed, Aug 9, 2017 at 9:00 PM, Eron Wright wrote: > For reference:

Re: difference between checkpoints & savepoints

2017-08-10 Thread Henri Heiskanen
Hi, But I still need to resolve the latest checkpoint and pass that as an argument. My question still is that why all this can not be handled by Flink core? Why not just have parameters enable savepoints, location of savepoints and state backend and system would then automatically do checkpoints

Classloader and removal of native libraries

2017-08-10 Thread Conrad Crampton
Hi, First time posting here so ‘hi’. I have been using Flink (1.31 now) for a couple of months now and loving it. My deployment is to JobManager running as a long running session on Yarn. I have a problem where I have a Flink streaming job that involves loading native libraries as part of one of

Re: Delay between job manager recovery and job recovery (1.3.2)

2017-08-10 Thread Aljoscha Krettek
Don't worry! :-) I found that you can configure this via "high-availability.job.delay" (in HighAvailabilityOptions). Best, Aljoscha > On 10. Aug 2017, at 15:13, Gyula Fóra wrote: > > Here is actually the whole log for the relevant parts at least: >

Re: Keyed CEP checkpoint fails

2017-08-10 Thread Dawid Wysakowicz
Yes, with the information I have, the conclusion would be the same, that I think the reason is problem with hashcode. Without some data to reproduce it unfortunately I won’t be able to help you further. I could just advise you to debug the method SharedBuffer#serialize and pay attention to the

Re: Delay between job manager recovery and job recovery (1.3.2)

2017-08-10 Thread Gyula Fóra
Here is actually the whole log for the relevant parts at least: https://gist.github.com/gyfora/b70dd18c048b862751b194f613514300 Sorry for not pasting it earlier. Gyula Gyula Fóra ezt írta (időpont: 2017. aug. 10., Cs, 15:04): > Oh, I found this in the log that seems to

Re: Delay between job manager recovery and job recovery (1.3.2)

2017-08-10 Thread Gyula Fóra
Oh, I found this in the log that seems to explain it: 2017-08-10 13:13:56,795 INFO org.apache.flink.yarn.YarnJobManager - Delaying recovery of all jobs by 12 milliseconds. I wonder why is this... Aljoscha Krettek ezt írta (időpont: 2017. aug. 10., Cs, 14:41): > Hi, > >

Re: Keyed CEP checkpoint fails

2017-08-10 Thread Daiqing Li
Oh sorry, the data in {} is not empty because I hide private information about my model. Do you have that same conclusion? > On Aug 10, 2017, at 8:52 AM, Dawid Wysakowicz > wrote: > > You are right, I won’t be able to reproduce this problem without data. One >

Aggregation based on Timestamp

2017-08-10 Thread Madhukar Thota
Hi We have use case where we have thousands of Telegraf agents sending data to kafka( some of them are sending 10s interval, 15s interval and 30s interval). We would like to aggregate the incoming data to 1 minuter interval based on the hostname as key before we write into influxdb. Is it

Re: Keyed CEP checkpoint fails

2017-08-10 Thread Dawid Wysakowicz
You are right, I won’t be able to reproduce this problem without data. One thing I can tell though that I think the problem is indeed with the hashcode. Unforunately I don’t know Gson, but one strange thing I noticed is the exception message: SharedBufferEntry(ValueTimeWrapper({},

Re: Delay between job manager recovery and job recovery (1.3.2)

2017-08-10 Thread Aljoscha Krettek
Hi, Let me also investigate that? Did you observe this in 1.3.2 and not in 1.3.0 and/or 1.3.1 or did you directly go from 1.2.x to 1.3.2? Best, Aljoscha > On 10. Aug 2017, at 13:31, Gyula Fóra wrote: > > Hi! > In some cases it seems to take a long time for the job to

Re: Using latency markers

2017-08-10 Thread Aljoscha Krettek
Hi, I must admit that I've never used this but I'll try and look into it. Best, Aljoscha > On 10. Aug 2017, at 11:10, Gyula Fóra wrote: > > Hi all! > > Does anyone have a working example of using the latency markers to test for > the topology latency? > We are using

Re: Keyed CEP checkpoint fails

2017-08-10 Thread Daiqing Li
Hi, Here is the code. But I am not sure if you can reproduce the problem without data source. Best, Daiqing On Thu, Aug 10, 2017 at 8:15 AM, Dawid Wysakowicz < wysakowicz.da...@gmail.com> wrote: > As @Kostas asked in your previous thread would be possible for you to > share your code for that

Re: difference between checkpoints & savepoints

2017-08-10 Thread Stefan Richter
Hi, but I think this is exactly the case for externalized checkpoints. Periodic savepoints are problematic because, their lifecycle is meant to be under the control of the user and Flink can not make any assumptions when they can be dropped. So in the conservative scenario, savepoints would

Re: Keyed CEP checkpoint fails

2017-08-10 Thread Dawid Wysakowicz
As @Kostas asked in your previous thread would be possible for you to share your code for that job or at least a minimal example to reproduce this behaviour. I fear we won’t be able to help you without any further info. Regards, Dawid > On 10 Aug 2017, at 14:10, Daiqing Li

Keyed CEP checkpoint fails

2017-08-10 Thread Daiqing Li
Hi Flink user, I am using FsStateBackend and AWS EMR YARN to run a CEP job. I got this exception after running for a while. Could anyone give me some help to debug this? I try parallelism 1, and it has the same problem. I also try reimplemented hashcode and equals method. I use UUID as hashcode

Delay between job manager recovery and job recovery (1.3.2)

2017-08-10 Thread Gyula Fóra
Hi! In some cases it seems to take a long time for the job to start the zookeeper based job recovery after recovering from a JM failure. Looking at the logs there is a 2 minute gap between the last recovered TM was started successfully and the job recovery: 2017-08-10 13:14:06,369 INFO

[streaming] mappers(reducers) read database again to get the changed config

2017-08-10 Thread Sendoh
Hi Flink users, We have an usecase that streaming Flink jobs reading small configuration(no more than 100 rows/10 columns) to transform data source according to the config. When there is change, database should be read again and the config object is changed. Did someone has similar usage or

Re: difference between checkpoints & savepoints

2017-08-10 Thread Henri Heiskanen
Hi, It would be super helpful if Flink would provide out of the box functionality for writing automatic savepoints and then starting from the latest savepoint. If external checkpoints would support rescaling then 1st requirement is met, but one would still need to e.g. find the latest checkpoint

Using latency markers

2017-08-10 Thread Gyula Fóra
Hi all! Does anyone have a working example of using the latency markers to test for the topology latency? We are using Flink 1.3.2 and it seems like however we tune it, whatever job we use all we get is NaN in the metrics. Maybe we are completely missing something... Thanks! Gyula

Re: difference between checkpoints & savepoints

2017-08-10 Thread Stefan Richter
Hi, I would explain the main conceptual difference as follows: - Checkpoints are periodically triggered by the system for fault tolerance. They are used to automatically recover from failures. Because of their automatic and periodical nature, they should be lightweight to produce and will

Re: Experiencing long latency while using sockets

2017-08-10 Thread Fabian Hueske
Great! Thanks for reporting back :-) 2017-08-09 22:52 GMT+02:00 Chao Wang : > It seems that the observed long latencies were due to certain one-time > internal mechanism that only occurred after Flink has received the first > message. Based on my measurement that mechanism

Re: Synchronized Kafka sources

2017-08-10 Thread 魏偉哲
Hi Yunus, I'm not sure if there is a way to synchronize two Kafka sources in Flink, but I have another opinion on this question. How about adjust number of shards and parallelism of consumers on A and B? For example, making A have higher parallelism and B have lower parallelism so that you can