Efficient Stateful Processing with Time-Series Data and Enrichments

2018-05-23 Thread Mike Urbach
Hi, I have a two-part question related to processing and storing large amounts of time-series data. The first part is related to the preferred way to keep state on the time-series data in an efficient way, and the second part is about how to further enrich the processed data and feed it back into

Re: increasing parallelism increases the end2end latency in flink sql

2018-05-23 Thread Yan Zhou [FDS Science]
The BoundedOutOfOrdernessTimestampExtractor is assigned to datastream after kafka consumer. The graph is like: KafkaSource-> map2Pojo -> BoundedOutOfOrdernessTimestampExtractor -> Table -> .. From: Yan Zhou [FDS Science] Sent:

increasing parallelism increases the end2end latency in flink sql

2018-05-23 Thread Yan Zhou [FDS Science]
Hi, My application assigned timestamp to kafka event with BoundedOutOfOrdernessTimestampExtractor then converted them to a table. Finally flink SQL over-window aggregation is run against the table. When I double the parallelism of my flink application, the end2end latency is doubled. What

Starting beam pipeline from savepoint

2018-05-23 Thread borisbolvig
Is it possible to start Beam pipelines from savepoints when running on Flink (v.1.4)? I am running flink in a remote environment, and executing the pipelines as .jars specifying the flink jobmanager via cmd line arguments. This is opposed to passing the .jar to `flink run`. In this way the jar

Re: Multiple hdfs

2018-05-23 Thread Rong Rong
+1 on viewfs, I was going to add that :-) To add to this, viewfs can be use as a federation layer for supporting multiple HDFS clusters for checkpoint/savepoint HA purpose as well. -- Rong On Wed, May 23, 2018 at 6:29 AM, Stephan Ewen wrote: > I think that Hadoop recommends

Re: Decrease initial source read speed

2018-05-23 Thread Andrei Shumanski
Hi Piotr, Thanks! I will try it. It is a bit ugly solution, but it may work :) > On 23 May 2018, at 16:11, Piotr Nowojski wrote: > > One more remark. Currently there is unwritten assumption in Flink, that time > to process records is proportional number of bytes. As

Re: Decrease initial source read speed

2018-05-23 Thread Piotr Nowojski
One more remark. Currently there is unwritten assumption in Flink, that time to process records is proportional number of bytes. As you noted, this brakes in case of mixed workloads (especially with file paths sent as records). There is interesting workaround this problem though. You could

Re: Decrease initial source read speed

2018-05-23 Thread Piotr Nowojski
Hi, Yes if you have mixed workload in your pipeline, it is matter of finding a right balance. Situation will be better in Flink 1.5, but the underlying issue will remain as well - in 1.5.0 there also will be no way to change network buffers configuration between stages of the single job.

Re: Multiple hdfs

2018-05-23 Thread Stephan Ewen
I think that Hadoop recommends to solve such setups with a viewfs:// that spans both HDFS clusters and then the two different clusters look like different paths within on file system. Similar as mounting different file systems into one directory tree in unix. On Tue, May 22, 2018 at 4:41 PM, Kien

Re: Decrease initial source read speed

2018-05-23 Thread Andrei Shumanski
Hi Piotr, Thank you very much for your response. I will try the new feature of Flink 1.5 when it is released. But I am not sure minimising buffers sizes will work in all scenarios. If I understand correctly these settings are affecting the whole Flink instance. We might have a flow like this:

Re: When is 1.5 coming out

2018-05-23 Thread Fabian Hueske
Hi Vishal, Release candidate 5 (RC5) has been published and the voting period ends later today. Unless we find a blocking issue, we can push the release out today. FYI, if you are interested in the release progress, you can subscribe to the dev mailing list (or just check out the archives at

Re: Leader Retrieval Timeout with HA Job Manager

2018-05-23 Thread Till Rohrmann
Hi Jason, sorry for the late response. The logs look indeed strange because both JMs are granted leadership without the other getting its leadership revoked. What would be interesting is to take a look at the Znode under `/flink/flink-ha/leader//job_manager_lock`

When is 1.5 coming out

2018-05-23 Thread Vishal Santoshi
It has been some time and would want to know when it is officially a release.

Re: Decrease initial source read speed

2018-05-23 Thread Piotr Nowojski
Hi, Yes, Flink 1.5.0 will come with better tools to handle this problem. Namely you will be able to limit the “in flight” data, by controlling the number of assigned credits per channel/input gate. Even without any configuring Flink 1.5.0 will out of the box buffer less data, thus mitigating

Re: strange behavior with jobmanager.rpc.address on standalone HA cluster

2018-05-23 Thread Till Rohrmann
Alright, try to grab the logs if you see this problem reoccurring. I would be interested in understanding why this happens. Cheers, Till On Fri, May 18, 2018 at 9:45 PM, Derek VerLee wrote: > Till, > > Thanks for the response. Sorry for the delayed reply. > > The flink

Re: Fwd: Decrease initial source read speed

2018-05-23 Thread Fabian Hueske
Hi Andrei, With the current version of Flink, there is no general solution to this problem. The upcoming version 1.5.0 of Flink adds a feature called credit-based flow control which might help here. I'm adding @Piotr to this thread who knows more about the details of this new feature. Best,

Re: Order of events with chanined keyBy calls of same key

2018-05-23 Thread Fabian Hueske
Hi, I've posted an answer on SO. Best, Fabian 2018-05-22 18:11 GMT+02:00 Shimony, Shay : > Hi everyone, > > > > I have this question in StackOverflow, and would be happy if you could > answer. > > https://stackoverflow.com/questions/50340107/order-of- >

Re: program args size for running jobs

2018-05-23 Thread Fabian Hueske
Hi Esteban, If you need the parameters to configure specific operators (instead of the over all flow), you could pass the parameters as a file using the distributed cache [1]. Note, the docs point to the DataSet (batch) API, but the feature works the same way for DataStream programs as well.

Re: How to restore state from savepoint with flink SQL

2018-05-23 Thread Fabian Hueske
Hi, At the moment, you can only restore a query from a savepoint if the query is not modified and the same Flink version is used. Since SQL queries are automatically translated into data flows, it is not transparent to the user, which operators will be created. We would need to expose an

Re: Limitations with Retract Streams on SQL

2018-05-23 Thread Fabian Hueske
Hi Gregory, Rong's analysis is correct. The UNION with duplicate elimination is translated into a UNION ALL and a subsequent grouping operator on all attributes without an aggregation function. Flink assumes that all grouping operators can produce retractions (updates) and window-grouped

General forum of the event processing ?

2018-05-23 Thread Esa Heikkinen
Hi This is little bit out of topic in this mail list, but do you know any good general forum or mail list for the (complex) event processing ? Best, Esa