Re: Maintaining watermarks per key, instead of per operator instance

Fabian Hueske Mon, 14 Nov 2016 06:28:21 -0800

Hi Stephan,

I'm skeptical about two things:
- using processing time will result in inaccurately bounded aggregates (or
do you want to group by event time in a processing time window?)
- writing to and reading from Cassandra might be expensive (not sure what
you mean by cheaper in the end) and it is not integrated with Flink's
checkpointing mechanism for fault-tolerance.


To me, the stateful FlatMapOperator looks like the best approach. There is
an upcoming feature for registering timers in user-functions, i.e., a
function is called after the timer exceeds. This could be helpful to
overcome the problem of closing the window without new data.

Best,
Fabian

2016-11-14 8:39 GMT+01:00 Stephan Epping <stephan.epp...@zweitag.de>:

> Hello Fabian,
>
> Thank you very much. What is your opinion on the following solution:
>
> - Window data per time window, e.g. 15 minutes
> - using processing time as trigger, e.g. 15 minutes
> - which results in an aggregate over sensor values
> - then use cassandra to select the previous aggregate (as there can be
> multiple for the time window due to processing time)
> - then update the aggregate and put it into a cassandra sink again
>
> The cassandra select will be a bit slower than using an in memory/flink
> state, but will be cheaper in the end. Further, what does this have for
> consequences?
> For example, replaying events will be more difficult, right? Also, what
> about Snapshots? Will they work with the mentioned design?
>
> kind regards,
> Stephan
>
>
> On 11 Nov 2016, at 00:39, Fabian Hueske <fhue...@gmail.com> wrote:
>
> Hi Stephan,
>
> I just wrote an answer to your SO question.
>
> Best, Fabian
>
> 2016-11-10 11:01 GMT+01:00 Stephan Epping <stephan.epp...@zweitag.de>:
>
>> Hello,
>>
>> I found this question in the Nabble archive (
>> http://apache-flink-user-mailing-list-archive.2336050.n4.
>> nabble.com/Maintaining-watermarks-per-key-instead-of-per-
>> operator-instance-tp7288.html) but was unable/dont know how to reply.
>>
>> Here is my question regarding the mentioned thread:
>>
>> Hello,
>>
>> I have similar requirements (see StackOverflor http://stac
>> koverflow.com/questions/40465335/apache-flink-multiple-
>> window-aggregations-and-late-data). I am pretty new to flink, could you
>> elaborate on a possible solution? We can guarantee good ordering by
>> sensor_id, thus watermarking by key would be the only reasonable way for us
>> (*sensorData.keyBy('id').timeWindow(1.minute).sum('value')*), could I do
>> my own watermarking after*sensorData.keyBy('id').overwriteWatermarking()*...
>> per key? Or maybe using custom state plus a custom trigger? What happens if
>> a sensor dies or is being removed completely, how can this be detected as
>> watermarks would be ignored for window garbage collection. Or could we
>> dynamically schedule a job of each sensor? Which would result in 1000 Jobs.
>>
>>
>> Thanks,
>> Stephan
>>
>>
>>
>
>

Re: Maintaining watermarks per key, instead of per operator instance

Reply via email to