Re: can we do Flink CEP on event stream or batch or both?

2019-06-19 Thread kant kodali
Hi Fabian,

Actually, now that I had gone through my use case I can say that the
equality matches are more like expressions.

for example the *sum(col1, col2) of datasetA = col3 datasetB.*

And these expressions can include, sum, if & else, trim, substring,
absolute_value etc.. and they are submitted by the user in Adhoc fashion.
My job is to apply these expressions on two different streams and identify
the breaks and report.

Any suggestions would be appreciated.

Thanks!



On Tue, Apr 30, 2019 at 2:20 AM Fabian Hueske  wrote:

> Hi,
>
> Stateful streaming applications are typically designed to run continuously
> (i.e., until forever or until they are not needed anymore or replaced).
> May jobs run for weeks or months.
>
> IMO, using CEP for "simple" equality matches would add too much complexity
> for a use case that can be easily solved with a stateful function.
> If your task is to ensure that two streams have the same events, I'd
> recommend to implement a custom DataStream application with a stateful
> ProcessFunction.
> Holding state for two years is certainly possible if you know exactly
> which events to keep, i.e., you do not store the full stream but only those
> few events that have not had a match yet.
>
> If you need to run the same logic also on batch data, you might want to
> check if you can use SQL or the Table API which are designed to work on
> static and streaming data with the same processing semantics.
>
> Best,
> Fabian
>
>
> Am Di., 30. Apr. 2019 um 06:49 Uhr schrieb kant kodali  >:
>
>> Hi All,
>>
>> I have the following questions.
>>
>> 1) can we do Flink CEP on event stream or batch?
>> 2) If we can do streaming I wonder how long can we keep the stream
>> stateful? I also wonder if anyone successfully had done any stateful
>> streaming for days or months(with or without CEP)? or is stateful streaming
>> is mainly to keep state only for a few hours?
>>
>> I have a use case where events are ingested from multiple sources and in
>> theory, the sources are supposed to have the same events however in
>> practice the sources will not have the same events so when the events are
>> ingested from multiple sources the goal is to detect where the "breaks"
>> are(meaning the missing events like exists in one source but not in other)?
>> so I realize this is the typical case for CEP.
>>
>> Also, in this particular use case events that supposed to come 2 years
>> ago can come today and if so, need to update those events also in real time
>> or near real time. Sure there wouldn't be a lot of events that were missed
>> 2 years ago but there will be a few. What would be the best approach?
>>
>> One solution I can think of is to do Stateful CEP with a window of one
>> day or whatever short time period where most events will occur and collect
>> the events that fall beyond that time period(The late ones) into some Kafka
>> topic and have a separate stream analyze the time period of the late ones,
>> construct the corresponding NFA and run through it again.  Please let me
>> know how this sounds or if there is a better way to do it.
>>
>> Thanks!
>>
>>
>>
>>


Re: can we do Flink CEP on event stream or batch or both?

2019-04-30 Thread Fabian Hueske
Hi,

Stateful streaming applications are typically designed to run continuously
(i.e., until forever or until they are not needed anymore or replaced).
May jobs run for weeks or months.

IMO, using CEP for "simple" equality matches would add too much complexity
for a use case that can be easily solved with a stateful function.
If your task is to ensure that two streams have the same events, I'd
recommend to implement a custom DataStream application with a stateful
ProcessFunction.
Holding state for two years is certainly possible if you know exactly which
events to keep, i.e., you do not store the full stream but only those few
events that have not had a match yet.

If you need to run the same logic also on batch data, you might want to
check if you can use SQL or the Table API which are designed to work on
static and streaming data with the same processing semantics.

Best,
Fabian


Am Di., 30. Apr. 2019 um 06:49 Uhr schrieb kant kodali :

> Hi All,
>
> I have the following questions.
>
> 1) can we do Flink CEP on event stream or batch?
> 2) If we can do streaming I wonder how long can we keep the stream
> stateful? I also wonder if anyone successfully had done any stateful
> streaming for days or months(with or without CEP)? or is stateful streaming
> is mainly to keep state only for a few hours?
>
> I have a use case where events are ingested from multiple sources and in
> theory, the sources are supposed to have the same events however in
> practice the sources will not have the same events so when the events are
> ingested from multiple sources the goal is to detect where the "breaks"
> are(meaning the missing events like exists in one source but not in other)?
> so I realize this is the typical case for CEP.
>
> Also, in this particular use case events that supposed to come 2 years ago
> can come today and if so, need to update those events also in real time or
> near real time. Sure there wouldn't be a lot of events that were missed 2
> years ago but there will be a few. What would be the best approach?
>
> One solution I can think of is to do Stateful CEP with a window of one day
> or whatever short time period where most events will occur and collect the
> events that fall beyond that time period(The late ones) into some Kafka
> topic and have a separate stream analyze the time period of the late ones,
> construct the corresponding NFA and run through it again.  Please let me
> know how this sounds or if there is a better way to do it.
>
> Thanks!
>
>
>
>


can we do Flink CEP on event stream or batch or both?

2019-04-29 Thread kant kodali
Hi All,

I have the following questions.

1) can we do Flink CEP on event stream or batch?
2) If we can do streaming I wonder how long can we keep the stream
stateful? I also wonder if anyone successfully had done any stateful
streaming for days or months(with or without CEP)? or is stateful streaming
is mainly to keep state only for a few hours?

I have a use case where events are ingested from multiple sources and in
theory, the sources are supposed to have the same events however in
practice the sources will not have the same events so when the events are
ingested from multiple sources the goal is to detect where the "breaks"
are(meaning the missing events like exists in one source but not in other)?
so I realize this is the typical case for CEP.

Also, in this particular use case events that supposed to come 2 years ago
can come today and if so, need to update those events also in real time or
near real time. Sure there wouldn't be a lot of events that were missed 2
years ago but there will be a few. What would be the best approach?

One solution I can think of is to do Stateful CEP with a window of one day
or whatever short time period where most events will occur and collect the
events that fall beyond that time period(The late ones) into some Kafka
topic and have a separate stream analyze the time period of the late ones,
construct the corresponding NFA and run through it again.  Please let me
know how this sounds or if there is a better way to do it.

Thanks!