On Tue, Oct 11, 2016 at 10:57 AM, Reynold Xin <r...@databricks.com> wrote:

>
> On Tue, Oct 11, 2016 at 10:55 AM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> *Complex event processing and state management:* Several groups I've
>>> talked to want to run a large number (tens or hundreds of thousands now,
>>> millions in the near future) of state machines over low-rate partitions of
>>> a high-rate stream. Covering these use cases translates roughly into a
>>> three sub-requirements: maintaining lots of persistent state efficiently,
>>> feeding tuples to each state machine in the right order, and exposing
>>> convenient programmer APIs for complex event detection and signal
>>> processing tasks.
>>>
>>
>> I've heard this one too, but don't know of anyone actively working on
>> it.  Would be awesome to open a JIRA and start discussing what the APIs
>> would look like.
>>
>
> There is an existing ticket for CEP: https://issues.apache.org
> /jira/browse/SPARK-14745
>
>
>
Yeah, Mario and Sachin opened up that CEP ticket a while back, and they had
an early prototype (https://github.com/apache/spark/pull/12518) on the old
DStream APIs. The project stalled out due to uncertainty about how state
management and streaming query languages would work on Structured
Streaming. The people who were working on it are now focusing on other
issues.

Getting CEP to work efficiently is a whole-stack affair. You need optimizer
support for things like pulling out common subexpressions from event specs
and deciding between eager vs. lazy evaluation for predicates. You need
good fine-grained state management in the engine, including support for
efficiently handling out-of-order event arrival. And CEP workloads with a
large number of interdependent, stateful tasks will put stress on the
scheduling layer.

Fred

Reply via email to