Re: Spark support for Complex Event Processing (CEP)

Mich Talebzadeh Thu, 28 Apr 2016 09:23:46 -0700

In a commerical (C)EP like say StreamBase, or for example its competitor
Apama, the arrival of an input event **immediately** triggers further
downstream processing.


This is admitadly an asynchronous approach, not a synchronous clock-driven
micro-batch approach like Spark's.

I suppose if one wants to split hairs / be philosophical, the clock rate of
the microprocessor chip underlies everything.  But I don't think that
is quite the point.

The point is that an asychonrous event-driven approach is as continuous /
immediate as **the underlying computer hardware will ever allow.**. It
is not limited by an architectural software clock.

So it is asynchronous vs synchronous that is the key issue, not just the
exact speed of the software clock in the synchronous approach.

It isalso indeed true that latencies down to the single digit microseconds
level can sometimes matter in financial trading but rarely.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 28 April 2016 at 12:44, Michael Segel <msegel_had...@hotmail.com> wrote:

> I don’t.
>
> I believe that there have been a  couple of hack-a-thons like one done in
> Chicago a few years back using public transportation data.
>
> The first question is what sort of data do you get from the city?
>
> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).
> Or they could provide more information. Like last stop, distance to next
> stop, avg current velocity…
>
> Then there is the frequency of the updates. Every second? Every 3 seconds?
> 5 or 6 seconds…
>
> This will determine how much work you have to do.
>
> Maybe they provide the routes of the busses via a different API call since
> its relatively static.
>
> This will drive your solution more than the underlying technology.
>
> Oh and whileI focused on bus, there are also rail and other modes of
> public transportation like light rail, trains, etc …
>
> HTH
>
> -Mike
>
>
> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi>
> wrote:
>
>
> Do you know any good examples how to use Spark streaming in tracking
> public transportation systems ?
>
> Or Storm or some other tool example ?
>
> Regards
> Esa Heikkinen
>
> 28.4.2016, 3:16, Michael Segel kirjoitti:
>
> Uhm…
> I think you need to clarify a couple of things…
>
> First there is this thing called analog signal processing…. Is that
> continuous enough for you?
>
> But more to the point, Spark Streaming does micro batching so if you’re
> processing a continuous stream of tick data, you will have more than 50K of
> tics per second while there are markets open and trading.  Even at 50K a
> second, that would mean 1 every .02 ms or 50 ticks a ms.
>
> And you don’t want to wait until you have a batch to start processing, but
> you want to process when the data hits the queue and pull it from the queue
> as quickly as possible.
>
> Spark streaming will be able to pull batches in as little as 500ms. So if
> you pull a batch at t0 and immediately have a tick in your queue, you won’t
> process that data until t0+500ms. And said batch would contain 25,000
> entries.
>
> Depending on what you are doing… that 500ms delay can be enough to be
> fatal to your trading process.
>
> If you don’t like stock data, there are other examples mainly when pulling
> data from real time embedded systems.
>
>
> If you go back and read what I said, if your data flow is >> (much slower)
> than 500ms, and / or the time to process is >> 500ms ( much longer )  you
> could use spark streaming.  If not… and there are applications which
> require that type of speed…  then you shouldn’t use spark streaming.
>
> If you do have that constraint, then you can look at systems like
> storm/flink/samza / whatever where you have a continuous queue and listener
> and no micro batch delays.
> Then for each bolt (storm) you can have a spark context for processing the
> data. (Depending on what sort of processing you want to do.)
>
> To put this in perspective… if you’re using spark streaming / akka / storm
> /etc to handle real time requests from the web, 500ms added delay can be a
> long time.
>
> Choose the right tool.
>
> For the OP’s problem. Sure Tracking public transportation could be done
> using spark streaming. It could also be done using half a dozen other tools
> because the rate of data generation is much slower than 500ms.
>
> HTH
>
>
> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> couple of things.
>
> There is no such thing as Continuous Data Streaming as there is no such
> thing as Continuous Availability.
>
> There is such thing as Discrete Data Streaming and  High Availability  but
> they reduce the finite unavailability to minimum. In terms of business
> needs a 5 SIGMA is good enough and acceptable. Even the candles set to a
> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader
> makes a sell or buy decision on the basis of 2 seconds candlestick
>
> The calculation itself in measurements is subject to finite error as
> defined by their Confidence Level (CL) using Standard Deviation function.
>
> OK so far I have never noticed a tool that requires that details of
> granularity. Those stuff from Flink etc is in practical term is of little
> value and does not make commercial sense.
>
> Now with regard to your needs, Spark micro batching is perfectly adequate.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
> LinkedIn *
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 27 April 2016 at 22:10, Esa Heikkinen < <esa.heikki...@student.tut.fi>
> esa.heikki...@student.tut.fi> wrote:
>
>>
>> Hi
>>
>> Thanks for the answer.
>>
>> I have developed a log file analyzer for RTPIS (Real Time Passenger
>> Information System) system, where buses drive lines and the system try to
>> estimate the arrival times to the bus stops. There are many different log
>> files (and events) and analyzing situation can be very complex. Also
>> spatial data can be included to the log data.
>>
>> The analyzer also has a query (or analyzing) language, which describes a
>> expected behavior. This can be a requirement of system. Analyzer can be
>> think to be also a test oracle.
>>
>> I have published a paper (SPLST'15 conference) about my analyzer and its
>> language. The paper is maybe too technical, but it is found:
>> http://ceur-ws.org/Vol-1525/paper-19.pdf
>>
>> I do not know yet where it belongs. I think it can be some "CEP with
>> delays". Or do you know better ?
>> My analyzer can also do little bit more complex and time-consuming
>> analyzings? There are no a need for real time.
>>
>> And it is possible to do "CEP with delays" reasonably some existing
>> analyzer (for example Spark) ?
>>
>> Regards
>> PhD student at Tampere University of Technology, Finland,
>> <http://www.tut.fi/>www.tut.fi
>> Esa Heikkinen
>>
>> 27.4.2016, 15:51, Michael Segel kirjoitti:
>>
>> Spark and CEP? It depends…
>>
>> Ok, I know that’s not the answer you want to hear, but its a bit more
>> complicated…
>>
>> If you consider Spark Streaming, you have some issues.
>> Spark Streaming isn’t a Real Time solution because it is a micro batch
>> solution. The smallest Window is 500ms.  This means that if your compute
>> time is >> 500ms and/or  your event flow is >> 500ms this could work.
>> (e.g. 'real time' image processing on a system that is capturing 60FPS
>> because the processing time is >> 500ms. )
>>
>> So Spark Streaming wouldn’t be the best solution….
>>
>> However, you can combine spark with other technologies like Storm, Akka,
>> etc .. where you have continuous streaming.
>> So you could instantiate a spark context per worker in storm…
>>
>> I think if there are no class collisions between Akka and Spark, you
>> could use Akka, which may have a better potential for communication between
>> workers.
>> So here you can handle CEP events.
>>
>> HTH
>>
>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh <
>> <mich.talebza...@gmail.com>mich.talebza...@gmail.com> wrote:
>>
>> please see my other reply
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 27 April 2016 at 10:40, Esa Heikkinen < <esa.heikki...@student.tut.fi>
>> esa.heikki...@student.tut.fi> wrote:
>>
>>> Hi
>>>
>>> I have followed with interest the discussion about CEP and Spark. It is
>>> quite close to my research, which is a complex analyzing for log files and
>>> "history" data  (not actually for real time streams).
>>>
>>> I have few questions:
>>>
>>> 1) Is CEP only for (real time) stream data and not for "history" data?
>>>
>>> 2) Is it possible to search "backward" (upstream) by CEP with given time
>>> window? If a start time of the time window is earlier than the current
>>> stream time.
>>>
>>> 3) Do you know any good tools or softwares for "CEP's" using for log
>>> data ?
>>>
>>> 4) Do you know any good (scientific) papers i should read about CEP ?
>>>
>>>
>>> Regards
>>> PhD student at Tampere University of Technology, Finland,
>>> <http://www.tut.fi/>www.tut.fi
>>> Esa Heikkinen
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: <user-unsubscr...@spark.apache.org>
>>> user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: <user-h...@spark.apache.org>
>>> user-h...@spark.apache.org
>>>
>>>
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>
>
>
>

Re: Spark support for Complex Event Processing (CEP)

Reply via email to