date:20180515

Re: Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar

Yes, thanks! -- Dhruv Kumar PhD Candidate Department of Computer Science and Engineering University of Minnesota www.dhruvkumar.me > On May 15, 2018, at 21:31, Xingcan Cui wrote: > > Yes, that makes sense and maybe you could

Re: Replaying logs with microsecond delay

2018-05-15 Thread Xingcan Cui

Yes, that makes sense and maybe you could also generate dynamic intervals according to the time spans. Thanks, Xingcan > On May 16, 2018, at 9:41 AM, Dhruv Kumar wrote: > > As a part of my PhD research, I have been working on few optimization > algorithms which try to

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread sihua zhou

Hi Juho, if I'm not misunderstand, you saied your're rescaling the job from the checkpoint? If yes, I think that behavior is not guaranteed yet, you can find this on the doc https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html#difference-to-savepoints. So, I

Re: Better way to clean up state when connect

2018-05-15 Thread Chengzhi Zhao

Thanks again Xingcan! Appreciate your help! On Tue, May 15, 2018, 9:31 PM Xingcan Cui wrote: > Hi Chengzhi, > > more details about partitioning mechanisms can be found at > https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#physical-partitioning > .

Re: Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar

As a part of my PhD research, I have been working on few optimization algorithms which try to jointly optimize delay and traffic (WAN traffic) in a geo-distributed streaming analytics setting. So, to show that the optimization actually works in real life, I am trying to implement these

Re: Better way to clean up state when connect

2018-05-15 Thread Xingcan Cui

Hi Chengzhi, more details about partitioning mechanisms can be found at https://ci.apache.org/projects/flink/flink-docs-master/dev/stream/operators/#physical-partitioning . Best, Xingcan > On

Re: Replaying logs with microsecond delay

2018-05-15 Thread Xingcan Cui

Hi Dhruv, since there are timestamps associated with each record, I was wondering why you try to replay them with a fixed interval. Can you give a little explanation about that? Thanks, Xingcan > On May 16, 2018, at 2:11 AM, Ted Yu wrote: > > Please see the following: >

Flink recovers even cancelled jobs after zookeeper failure

2018-05-15 Thread Yunus Olgun

Hi, We are using Flink 1.4.0 at zookeeper high availability mode and with externalized checkpoints. Today after we have restarted a zookeeper node, several Flink clusters have lost connection to the zookeeper. This triggered a leader election at effected clusters. After the leader election, the

Recovering from 1 of the nodes/slots of a Task Manager failing without resetting entire state during Recovery

2018-05-15 Thread Vijay Balakrishnan

Hi, I have been going through the book "Real time streaming with Apache Flink". How do I recover state for just a single node/slot in a TaskManager without having the recovery reset the application state for all the Task Managers ? They mention the following: *Reset the state of the whole

Re: Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar

Thanks a lot, Ted. Appreciate your help! The approaches specified in the below links, are giving a very good level of accuracy. Solves my problem for now. Thanks -- Dhruv Kumar PhD Candidate Department of Computer Science and Engineering

Re: Better way to clean up state when connect

2018-05-15 Thread Chengzhi Zhao

Hi Xingcan, Thanks a lot for providing your inputs on the possible solutions here. Can you please clarify on how to broadcasted in Flink? Appreciate your help!! Best, Chengzhi On Tue, May 15, 2018 at 10:22 AM, Xingcan Cui wrote: > Hi Chengzhi, > > currently, it's

Re: Replaying logs with microsecond delay

2018-05-15 Thread Ted Yu

Please see the following: http://www.rationaljava.com/2015/10/measuring-microsecond-in-java.html https://stackoverflow.com/questions/11498585/how-to-suspend-a-java-thread-for-a-small-period-of-time-like-100-nanoseconds On Tue, May 15, 2018 at 10:40 AM, Dhruv Kumar wrote:

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Juho Autio

I was able to reproduce this error. I just happened to notice an important detail about the original failure: - checkpoint was created with a 1-node cluster (parallelism=8) - restored on a 2-node cluster (parallelism=16), caused that null exception I tried restoring again from the problematic

Replaying logs with microsecond delay

2018-05-15 Thread Dhruv Kumar

Hi I am trying to replay a log file in which each record has a timestamp associated with it. The time difference between the records is of the order of microseconds. I am trying to replay this log maintaining the same delay between the records (using Thread.sleep()) and sending it to a socket.

Re: Checkpoint is not triggering as per configuration

2018-05-15 Thread Tao Xia

Great, I will give a try. Thanks, Tao On Tue, May 15, 2018 at 12:50 AM, Piotr Nowojski wrote: > Hi, > > This one: https://issues.apache.org/jira/browse/FLINK-2491 > > 1. What if you set `org.apache.flink.streaming.api.functions.source. >

Re: AvroInputFormat Serialisation Issue

2018-05-15 Thread Timo Walther

Flink should not interact poorly with heavily nested schemas. So this might be another bug that is worth investigating. Can you share an example that reproduces your issues with us? Which Flink version are you using? Contributors are always welcome :) I will also take a look into the

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Jason Kania

Thanks for your help. The job manager launch on two nodes of the cluster is provided as well as the logs for the task managers, one that worked and one that could not seem to find the find an address which I am assuming is for the job manager. The logs are from nodes aaa-1 and aaa-2. Thanks,

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Till Rohrmann

Hi Jason, the client logs would indeed be very interesting to further debug this problem. What you have to make sure is that the client has the same HA configuration settings as the cluster because the client needs to talk to your ZooKeeper quorum in order to retrieve the leader address. When

Re: data enrichment with SQL use case

2018-05-15 Thread miki haiat

HI guys , This is how i tried to solve my enrichment case https://gist.github.com/miko-code/d615aa05b65579f4366ba9fe8a8275fd Currently we need to use *keyby()* before the process function. My concern is if i have in flight N

Re: Better way to clean up state when connect

2018-05-15 Thread Xingcan Cui

Hi Chengzhi, currently, it's impossible to process both a stream and a (dynamically updated) dataset in a single job. I'll provide you with some workarounds, all of which are based on that the file for active test names is not so large. (1) You may define your own stream source[1] which should

Re: Async Source Function in Flink

2018-05-15 Thread Timo Walther

Hi Frederico, Flink's AsyncFunction is meant for enriching a record with information that needs to be queried externally. So I guess you can't use it for your use case because an async call is initiated by the input. However, your custom SourceFunction could implement a similar asynchronous

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Timo Walther

Can you change the log level to DEBUG and share the logs with us? Maybe Till (in CC) has some idea? Regards, Timo Am 15.05.18 um 15:18 schrieb Jason Kania: Hi Timo, Thanks for the response. Yes, we are running with a cloud provider, a cloud system provided by our national government for R

Zookeeper DR backup needed for Flink HA mode?

2018-05-15 Thread David Corley

We're looking at DR scenarios for our Flink cluster. We already use Zookeeper for JM HA. We use a HDFS cluster that's replicated off-site, and our high-availability.zookeeper.storageDir property is configure to use HDFS. However, in the event of a site-failure, is it also essential that we have a

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Jason Kania

Hi Timo, Thanks for the response. Yes, we are running with a cloud provider, a cloud system provided by our national government for R purposes. The thing is that we also have Kafka and Cassandra on the same nodes and they have no issues in this environment, it is just Flink in an HA

Re: Akka heartbeat configurations

2018-05-15 Thread Timo Walther

Hi, increasing the time to detect a dead task manager usually increases the amount of elements that need to be reprocessed in case of a failure. Once a dead task manager is identified, the entire application is rolled back to the latest successful checkpointed/consistent state of the

Re: AvroInputFormat Serialisation Issue

2018-05-15 Thread Timo Walther

Hi Padarn, usually people are using the AvroInputFormat with the Avro class generated by an Avro schema. But after looking into the implementation, one should also be able to use the GenericRecord class as a parameter. So your exception seems to be a bug if it works locally but not

Re: Flink does not read from some Kafka Partitions

2018-05-15 Thread Timo Walther

Hi Ruby, which Flink version are you using? When looking into the code of the org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase you can see that the behavior for using partition discovery or not depends on the Flink version. Regards, Timo Am 15.05.18 um 02:01 schrieb Ruby

Re: minPauseBetweenCheckpoints for failed checkpoints

2018-05-15 Thread Timo Walther

Hi Dmitry, I think the minPauseBetweenCheckpoints is intended for pausing between successful checkpoints. Usually a user wants to get a successful checkpoint as quickly as possible again. Stefan (in CC) might know more about. Regards, Timo Am 15.05.18 um 03:28 schrieb Dmitry Minaev:

Re: Question about the behavior of TM when it lost the zookeeper client session in HA mode

2018-05-15 Thread Piotr Nowojski

Hi, It looks like there was an error in asynchronous job of sending the records to Kafka. Probably this is a collateral damage of loosing connection to zookeeper. Piotrek > On 15 May 2018, at 13:33, Ufuk Celebi wrote: > > Hey Tony, > > thanks for the detailed report. > >

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-15 Thread Timo Walther

Hi Jason, this sounds more like a network connection/firewall issue to me. Can you tell us a bit more about your environment? Are you running your Flink cluster on a cloud provider? Regards, Timo Am 15.05.18 um 05:15 schrieb Jason Kania: Hi, I am using the 1.4.2 release on ubuntu and

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter

Hi, > Am 15.05.2018 um 10:34 schrieb Juho Autio : > > Ok, that should be possible to provide. Are there any specific packages to > set on trace level? Maybe just go with org.apache.flink.* on TRACE? The following packages would be helpful:

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Juho Autio

> Btw having a trace level log of a restart from a problematic checkpoint could actually be helpful Ok, that should be possible to provide. Are there any specific packages to set on trace level? Maybe just go with org.apache.flink.* on TRACE? > did the „too many open files“ problem only happen

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter

Btw having a trace level log of a restart from a problematic checkpoint could actually be helpful if we cannot find the problem from the previous points. This can give a more detailed view of what checkpoint files are mapped to which operator. I am having one more question: did the „too many

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter

What I would like to see from the logs is (also depending a bit on your log level): - all exceptions. - in which context exactly the „too many open files“ problem occurred, because I think for checkpoint consistency it should not matter as a checkpoint with such a problem should never succeed.

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Juho Autio

Thanks all. I'll have to see about sharing the logs & configuration.. Is there something special that you'd like to see from the logs? It may be easier for me to get specific lines and obfuscate sensitive information instead of trying to do that for the full logs. We basically have:

Re: Checkpoint is not triggering as per configuration

2018-05-15 Thread Piotr Nowojski

Hi, This one: https://issues.apache.org/jira/browse/FLINK-2491 1. What if you set `org.apache.flink.streaming.api.functions.source.FileProcessingMode#PROCESS_CONTINUOUSLY`? This will prevent split source from finishing, so checkpointing

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Stefan Richter

Hi, I agree, this looks like a bug. Can you tell us your exact configuration of the state backend, e.g. if you are using incremental checkpoints or not. Are you using the local recovery feature? Are you restarting the job from a checkpoint or a savepoint? Can you provide logs for both the job

Re: Missing MapState when Timer fires after restored state

2018-05-15 Thread Aljoscha Krettek

Hi Juho, As Sihua said, this shouldn't happen and indicates a bug. Did you only encounter this once or can you easily reproduce the problem? Best, Aljoscha > On 15. May 2018, at 05:57, sihua zhou wrote: > > Hi Juho, > in fact, from your code I can't see any possible that

38 matches

Mail list logo