Union of streams performance issue (10x)

2019-07-13 Thread Peter Zende
Hi all We have a pipeline (runs on YARN, Flink v1.7.1) which consumes a union of Kafka and HDFS sources. We remarked that the throughput is 10 times higher if only one of these sources is consumed. While trying to identify the problem I implemented a no-op source which was unioned with one of the

Re: Retain metrics counters across task restarts

2019-04-16 Thread Peter Zende
Hi Zhijiang Thanks for the clarification we were thinking about the very same solution, we'll then go in this direction. Best Peter zhijiang ezt írta (időpont: 2019. ápr. 15., H, 4:28): > Hi Peter, > > The lifecycle of these metrics are coupled with lifecycle of task, So the > metrics would be

Retain metrics counters across task restarts

2019-04-13 Thread Peter Zende
Hi all We're exposing Prometheus metrics from our Flink (v1.7.1) pipeline to Prometheus, e.g: the total number of processed records. This works fine until any of the tasks is restarted within this yarn application. Then the counter is reset and it starts incrementing values from 0. How can we reta

Sharing savepoints between pipelines

2019-04-13 Thread Peter Zende
Hi all Our intention is to create a savepoint from the current prod pipeline (running on Flink 1.7.1) and bring up another one behind the scenes using this savepoint to avoid reprocessing of all data and create the local state again. It looks like it's technically possible but we're unsure about t

Re: Stopping of a streaming job empties state store on HDFS

2018-06-15 Thread Peter Zende
state.. Thanks, Peter 2018-06-11 11:31 GMT+02:00 Stefan Richter : > Hi, > > > Am 08.06.2018 um 01:16 schrieb Peter Zende : > > > > Hi all, > > > > We have a streaming pipeline (Flink 1.4.2) for which we implemented > stoppable sources to be able to grac

Stopping of a streaming job empties state store on HDFS

2018-06-07 Thread Peter Zende
Hi all, We have a streaming pipeline (Flink 1.4.2) for which we implemented stoppable sources to be able to gracefully exit from the job with Yarn state "finished/succeeded". This works fine, however after creating a savepoint, stopping the job (stop event) and restarting it we remarked that the

MapWithState for two keyed stream

2018-05-09 Thread Peter Zende
Hi all, Is it possible to define two DataStream sources - one which reads from Kafka, the other reads from HDFS - and apply mapWithState with CoFlatMapFunction? The idea would be to read historical data from HDFS along with the live stream from Kafka and based on some business write the output t

Init RocksDB state backend during startup

2018-05-04 Thread Peter Zende
Hi, We use RocksDB with FsStateBackend (HDFS) to store state used by the mapWithState operator. Is it possible to initialize / populate this state during the streaming application startup? Our intention is to reprocess the historical data from HDFS in a batch job and save the latest state of the

Wiring batch and stream together

2018-05-02 Thread Peter Zende
Hi, We have a Flink streaming pipeline (1.4.2) which reads from Kafka, uses mapWithState with RocksDB and writes the updated states to Cassandra. We also would like to reprocess the ingested records from HDFS. For this we consider computing the latest state of the records over the whole dataset in