Re: kafka consumer parallelism

2017-10-02 Thread Timo Walther
Hi, I'm not a Kafka expert but I think you need to have more than 1 Kafka partition to process multiple documents at the same time. Make also sure to send the documents to different partitions. Regards, Timo Am 10/2/17 um 6:46 PM schrieb r. r.: Hello I'm running a job with "flink run -p5"

Re: Session Window set max timeout

2017-10-02 Thread Timo Walther
Hi, I would recommend to implement your custom trigger in this case. You can override the default trigger of your window: https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/windows.html#triggers This is the interface where you can control the triggering:

Re: Flink Watermark and timing

2017-10-02 Thread Timo Walther
Hi Björn, I don't know if I get your example correctly, but I think your explanation "All events up to and equal to watermark should be handled in the prevoius window" is not 100% correct. Watermarks just indicate the progress ("until here we have seen all events with lower timestamp than

Re: Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Piotr Nowojski
We are planning to work on this clean shut down after releasing Flink 1.4. Implementing this properly would require some work, for example: - adding some checkpoint options to add information about “closing”/“shutting down” event - add clean shutdown to source functions API - implement handling

kafka consumer parallelism

2017-10-02 Thread r. r.
Hello I'm running a job with "flink run -p5" and additionally set env.setParallelism(5). The source of the stream is Kafka, the job uses FlinkKafkaConsumer010. In Flink UI though I notice that if I send 3 documents to Kafka, only one 'instance' of the consumer seems to receive Kafka's record and

Re: Enriching data from external source with cache

2017-10-02 Thread Derek VerLee
Thanks Timo, watching the video now. I did try out the method with iteration in a simple prototype and it works.  But you are right, combining it with the other requirements into a single process function has so far resulted in more complexity than I'd like, and

Re: Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Antoine Philippot
Thanks Piotr for your answer, we sadly can't use kafka 0.11 for now (and until a while). We can not afford tens of thousands of duplicated messages for each application upgrade, can I help by working on this feature ? Do you have any hint or details on this part of that "todo list" ? Le lun. 2

Re: Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Piotr Nowojski
Hi, For failures recovery with Kafka 0.9 it is not possible to avoid duplicated messages. Using Flink 1.4 (unreleased yet) combined with Kafka 0.11 it will be possible to achieve exactly-once end to end semantic when writing to Kafka. However this still a work in progress:

Session Window set max timeout

2017-10-02 Thread ant burton
Is it possible to limit session windowing to a max of n seconds/hours etc? i.e. I would like a session window, but if a session runs for an unacceptable amount of time, I would like to close it. Thanks,

At end of complex parallel flow, how to force end step with parallel=1?

2017-10-02 Thread Garrett Barton
I have a complex alg implemented using the DataSet api and by default it runs with parallel 90 for good performance. At the end I want to perform a clustering of the resulting data and to do that correctly I need to pass all the data through a single thread/process. I read in the docs that as

Avoid duplicate messages while restarting a job for an application upgrade

2017-10-02 Thread Antoine Philippot
Hi, I'm working on a flink streaming app with a kafka09 to kafka09 use case which handles around 100k messages per seconds. To upgrade our application we used to run a flink cancel with savepoint command followed by a flink run with the previous saved savepoint and the new application fat jar as

Fwd: Consult about flink on mesos cluster

2017-10-02 Thread Bo Yu
Hello all, This is Bo, I met some problems when I tried to use flink in my mesos cluster (1 master, 2 slaves (cpu has 32 cores)). I tried to start the mesos-appmaster.sh in marathon, the job manager is started without problem. mesos-appmaster.sh -Djobmanager.heap.mb=1024

Re: how many 'run -c' commands to start?

2017-10-02 Thread r. r.
Thanks, Chesnay, that was indeed the problem. It also explains why -p5 was not working for me from the cmdline Best regards Robert > Оригинално писмо >От: Chesnay Schepler ches...@apache.org >Относно: Re: how many 'run -c' commands to start? >До: "r. r."

Re: How flink monitor source stream task(Time Trigger) is running?

2017-10-02 Thread yunfan123
Thank you. "If SourceFunction.run methods returns without an exception Flink assumes that it has cleanly shutdown and that there were simply no more elements to collect/create by this task. " This sentence solve my confusion. -- Sent from:

Re: How to deal with blocking calls inside a Sink?

2017-10-02 Thread Federico D'Ambrosio
Hi Timo, thank you for your response. Just yesterday I tried using the jdbc connector and unfortunately I found out that HivePreparedStatement and HiveStatement implementations still don't have an addBatch implementation, whose interface is being used in the connector. The first dirty solution

Flink Watermark and timing

2017-10-02 Thread Björn Zachrisson
Hi, I have a question regarding timing of events. According to; https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/event_time.html#event-time-and-watermarks All events up to and equal to watermark should be handled in "the prevoius window". In my case I use event-timestamp. I'm

In-memory cache

2017-10-02 Thread Marchant, Hayden
We have an operator in our streaming application that needs to access 'reference data' that is updated by another Flink streaming application. This reference data has about ~10,000 entries and has a small footprint. This reference data needs to be updated ~ every 100 ms. The required latency

Re: How to deal with blocking calls inside a Sink?

2017-10-02 Thread Timo Walther
Hi Federico, would it help to buffer events first and perform batches of insertions for better throughtput? I saw some similar work recently here: https://tech.signavio.com/2017/postgres-flink-sink But I would first try the AsyncIO approach, because actually this is a use case it was made

Re: Enriching data from external source with cache

2017-10-02 Thread Timo Walther
Hi Derek, maybe the following talk can inspire you, how to do this with joins and async IO: https://www.youtube.com/watch?v=Do7C4UJyWCM (around the 17th min). Basically, you split the stream and wait for an Async IO result in a downstream operator. But I think having a transient guava cache

Re: ArrayIndexOutOfBoundExceptions while processing valve output watermark and while applying ReduceFunction in reducing state

2017-10-02 Thread Federico D'Ambrosio
As a followup: the flink job has currently an uptime of almost 24 hours, with no checkpoint failed or restart whereas, with async snapshots, it would have already crashed 50 or so times. Regards, Federico 2017-09-30 19:01 GMT+02:00 Federico D'Ambrosio < federico.dambro...@smartlab.ws>: > Thank

How to deal with blocking calls inside a Sink?

2017-10-02 Thread Federico D'Ambrosio
Hi, I've implemented a sink for Hive as a RichSinkFunction, but once I've integrated it in my current flink job, I noticed that the processing of the events slowed down really bad, I guess because of some blocking calls that need to be when interacting with hive streaming api. So, what can be

Re: Windowing isn't applied per key

2017-10-02 Thread Timo Walther
Hi Marcus, from a first glance your pipeline looks correct. It should not be executed with a parallelism of one, if not specified explicitly. Which time semantics are you using? If it is event-time, I would check your timestamps and watermarks assignment. Maybe you can also check in the web