Re: Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar
Yes, thanks! -- Dhruv Kumar PhD Candidate Department of Computer Science and Engineering University of Minnesota www.dhruvkumar.me > On May 15, 2018, at 21:31, Xingcan Cui wrote: > > Yes, that makes sense and maybe you could

Re: Replaying logs with microsecond delay

2018-05-15 Thread Xingcan Cui
Yes, that makes sense and maybe you could also generate dynamic intervals according to the time spans. Thanks, Xingcan > On May 16, 2018, at 9:41 AM, Dhruv Kumar wrote: > > As a part of my PhD research, I have been working on few optimization > algorithms which try to

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread sihua zhou
Hi Juho, if I'm not misunderstand, you saied your're rescaling the job from the checkpoint? If yes, I think that behavior is not guaranteed yet, you can find this on the doc https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html#difference-to-savepoints. So, I

Re: Better way to clean up state when connect

2018-05-15 Thread Chengzhi Zhao
Thanks again Xingcan! Appreciate your help! On Tue, May 15, 2018, 9:31 PM Xingcan Cui wrote: > Hi Chengzhi, > > more details about partitioning mechanisms can be found at > https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#physical-partitioning > .

Re: Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar
As a part of my PhD research, I have been working on few optimization algorithms which try to jointly optimize delay and traffic (WAN traffic) in a geo-distributed streaming analytics setting. So, to show that the optimization actually works in real life, I am trying to implement these

Re: Better way to clean up state when connect

2018-05-15 Thread Xingcan Cui
Hi Chengzhi, more details about partitioning mechanisms can be found at https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#physical-partitioning . Best, Xingcan > On

Re: Replaying logs with microsecond delay

2018-05-15 Thread Xingcan Cui
Hi Dhruv, since there are timestamps associated with each record, I was wondering why you try to replay them with a fixed interval. Can you give a little explanation about that? Thanks, Xingcan > On May 16, 2018, at 2:11 AM, Ted Yu wrote: > > Please see the following: >

Flink recovers even cancelled jobs after zookeeper failure

2018-05-15 Thread Yunus Olgun
Hi, We are using Flink 1.4.0 at zookeeper high availability mode and with externalized checkpoints. Today after we have restarted a zookeeper node, several Flink clusters have lost connection to the zookeeper. This triggered a leader election at effected clusters. After the leader election, the

Recovering from 1 of the nodes/slots of a Task Manager failing without resetting entire state during Recovery

2018-05-15 Thread Vijay Balakrishnan
Hi, I have been going through the book "Real time streaming with Apache Flink". How do I recover state for just a single node/slot in a TaskManager without having the recovery reset the application state for all the Task Managers ? They mention the following: *Reset the state of the whole

Re: Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar
Thanks a lot, Ted. Appreciate your help! The approaches specified in the below links, are giving a very good level of accuracy. Solves my problem for now. Thanks -- Dhruv Kumar PhD Candidate Department of Computer Science and Engineering

Re: Better way to clean up state when connect

2018-05-15 Thread Chengzhi Zhao
Hi Xingcan, Thanks a lot for providing your inputs on the possible solutions here. Can you please clarify on how to broadcasted in Flink? Appreciate your help!! Best, Chengzhi On Tue, May 15, 2018 at 10:22 AM, Xingcan Cui wrote: > Hi Chengzhi, > > currently, it's

Re: Replaying logs with microsecond delay

2018-05-15 Thread Ted Yu
Please see the following: http://www.rationaljava.com/2015/10/measuring-microsecond-in-java.html https://stackoverflow.com/questions/11498585/how-to-suspend-a-java-thread-for-a-small-period-of-time-like-100-nanoseconds On Tue, May 15, 2018 at 10:40 AM, Dhruv Kumar wrote:

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Juho Autio
I was able to reproduce this error. I just happened to notice an important detail about the original failure: - checkpoint was created with a 1-node cluster (parallelism=8) - restored on a 2-node cluster (parallelism=16), caused that null exception I tried restoring again from the problematic

Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar
Hi I am trying to replay a log file in which each record has a timestamp associated with it. The time difference between the records is of the order of microseconds. I am trying to replay this log maintaining the same delay between the records (using Thread.sleep()) and sending it to a socket.

Re: Checkpoint is not triggering as per configuration

2018-05-15 Thread Tao Xia
Great, I will give a try. Thanks, Tao On Tue, May 15, 2018 at 12:50 AM, Piotr Nowojski wrote: > Hi, > > This one: https://issues.apache.org/jira/browse/FLINK-2491 > > 1. What if you set `org.apache.flink.streaming.api.functions.source. >

Re: AvroInputFormat Serialisation Issue

2018-05-15 Thread Timo Walther
Flink should not interact poorly with heavily nested schemas. So this might be another bug that is worth investigating. Can you share an example that reproduces your issues with us? Which Flink version are you using? Contributors are always welcome :) I will also take a look into the

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Jason Kania
Thanks for your help. The job manager launch on two nodes of the cluster is provided as well as the logs for the task managers, one that worked and one that could not seem to find the find an address which I am assuming is for the job manager. The logs are from nodes aaa-1 and aaa-2. Thanks,

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Till Rohrmann
Hi Jason, the client logs would indeed be very interesting to further debug this problem. What you have to make sure is that the client has the same HA configuration settings as the cluster because the client needs to talk to your ZooKeeper quorum in order to retrieve the leader address. When

Re: data enrichment with SQL use case

2018-05-15 Thread miki haiat
HI guys , This is how i tried to solve my enrichment case https://gist.github.com/miko-code/d615aa05b65579f4366ba9fe8a8275fd Currently we need to use *keyby()* before the process function. My concern is if i have in flight N

Re: Better way to clean up state when connect

2018-05-15 Thread Xingcan Cui
Hi Chengzhi, currently, it's impossible to process both a stream and a (dynamically updated) dataset in a single job. I'll provide you with some workarounds, all of which are based on that the file for active test names is not so large. (1) You may define your own stream source[1] which should

Re: Async Source Function in Flink

2018-05-15 Thread Timo Walther
Hi Frederico, Flink's AsyncFunction is meant for enriching a record with information that needs to be queried externally. So I guess you can't use it for your use case because an async call is initiated by the input. However, your custom SourceFunction could implement a similar asynchronous

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Timo Walther
Can you change the log level to DEBUG and share the logs with us? Maybe Till (in CC) has some idea? Regards, Timo Am 15.05.18 um 15:18 schrieb Jason Kania: Hi Timo, Thanks for the response. Yes, we are running with a cloud provider, a cloud system provided by our national government for R

Zookeeper DR backup needed for Flink HA mode?

2018-05-15 Thread David Corley
We're looking at DR scenarios for our Flink cluster. We already use Zookeeper for JM HA. We use a HDFS cluster that's replicated off-site, and our high-availability.zookeeper.storageDir property is configure to use HDFS. However, in the event of a site-failure, is it also essential that we have a

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Jason Kania
Hi Timo, Thanks for the response. Yes, we are running with a cloud provider, a cloud system provided by our national government for R purposes. The thing is that we also have Kafka and Cassandra on the same nodes and they have no issues in this environment, it is just Flink in an HA

Re: Akka heartbeat configurations

2018-05-15 Thread Timo Walther
Hi, increasing the time to detect a dead task manager usually increases the amount of elements that need to be reprocessed in case of a failure. Once a dead task manager is identified, the entire application is rolled back to the latest successful checkpointed/consistent state of the

Re: AvroInputFormat Serialisation Issue

2018-05-15 Thread Timo Walther
Hi Padarn, usually people are using the AvroInputFormat with the Avro class generated by an Avro schema. But after looking into the implementation, one should also be able to use the GenericRecord class as a parameter. So your exception seems to be a bug if it works locally but not

Re: Flink does not read from some Kafka Partitions

2018-05-15 Thread Timo Walther
Hi Ruby, which Flink version are you using? When looking into the code of the org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase you can see that the behavior for using partition discovery or not depends on the Flink version. Regards, Timo Am 15.05.18 um 02:01 schrieb Ruby

Re: minPauseBetweenCheckpoints for failed checkpoints

2018-05-15 Thread Timo Walther
Hi Dmitry, I think the minPauseBetweenCheckpoints is intended for pausing between successful checkpoints. Usually a user wants to get a successful checkpoint as quickly as possible again. Stefan (in CC) might know more about. Regards, Timo Am 15.05.18 um 03:28 schrieb Dmitry Minaev:

Re: Question about the behavior of TM when it lost the zookeeper client session in HA mode

2018-05-15 Thread Piotr Nowojski
Hi, It looks like there was an error in asynchronous job of sending the records to Kafka. Probably this is a collateral damage of loosing connection to zookeeper. Piotrek > On 15 May 2018, at 13:33, Ufuk Celebi wrote: > > Hey Tony, > > thanks for the detailed report. > >

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Timo Walther
Hi Jason, this sounds more like a network connection/firewall issue to me. Can you tell us a bit more about your environment? Are you running your Flink cluster on a cloud provider? Regards, Timo Am 15.05.18 um 05:15 schrieb Jason Kania: Hi, I am using the 1.4.2 release on ubuntu and

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
Hi, > Am 15.05.2018 um 10:34 schrieb Juho Autio : > > Ok, that should be possible to provide. Are there any specific packages to > set on trace level? Maybe just go with org.apache.flink.* on TRACE? The following packages would be helpful:

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Juho Autio
> Btw having a trace level log of a restart from a problematic checkpoint could actually be helpful Ok, that should be possible to provide. Are there any specific packages to set on trace level? Maybe just go with org.apache.flink.* on TRACE? > did the „too many open files“ problem only happen

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
Btw having a trace level log of a restart from a problematic checkpoint could actually be helpful if we cannot find the problem from the previous points. This can give a more detailed view of what checkpoint files are mapped to which operator. I am having one more question: did the „too many

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
What I would like to see from the logs is (also depending a bit on your log level): - all exceptions. - in which context exactly the „too many open files“ problem occurred, because I think for checkpoint consistency it should not matter as a checkpoint with such a problem should never succeed.

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Juho Autio
Thanks all. I'll have to see about sharing the logs & configuration.. Is there something special that you'd like to see from the logs? It may be easier for me to get specific lines and obfuscate sensitive information instead of trying to do that for the full logs. We basically have:

Re: Checkpoint is not triggering as per configuration

2018-05-15 Thread Piotr Nowojski
Hi, This one: https://issues.apache.org/jira/browse/FLINK-2491 1. What if you set `org.apache.flink.streaming.api.functions.source.FileProcessingMode#PROCESS_CONTINUOUSLY`? This will prevent split source from finishing, so checkpointing

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter
Hi, I agree, this looks like a bug. Can you tell us your exact configuration of the state backend, e.g. if you are using incremental checkpoints or not. Are you using the local recovery feature? Are you restarting the job from a checkpoint or a savepoint? Can you provide logs for both the job

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Aljoscha Krettek
Hi Juho, As Sihua said, this shouldn't happen and indicates a bug. Did you only encounter this once or can you easily reproduce the problem? Best, Aljoscha > On 15. May 2018, at 05:57, sihua zhou wrote: > > Hi Juho, > in fact, from your code I can't see any possible that