Re: Spark support for Complex Event Processing (CEP)

Mich Talebzadeh Thu, 28 Apr 2016 13:58:11 -0700

Also the point about

"First there is this thing called analog signal processing…. Is that
continuous enough for you? "


I agree  that analog signal processing like a sine wave,  an AM radio
signal – is truly continuous. However,  here we are talking about digital
data which will always be sent as bytes and typically with bytes grouped
into messages . In other words when we are sending data it is never truly
continuous.  We are sending discrete messages.


HTH,



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 28 April 2016 at 17:22, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> In a commerical (C)EP like say StreamBase, or for example its competitor
> Apama, the arrival of an input event **immediately** triggers further
> downstream processing.
>
> This is admitadly an asynchronous approach, not a synchronous clock-driven
> micro-batch approach like Spark's.
>
> I suppose if one wants to split hairs / be philosophical, the clock rate
> of the microprocessor chip underlies everything.  But I don't think that
> is quite the point.
>
> The point is that an asychonrous event-driven approach is as continuous /
> immediate as **the underlying computer hardware will ever allow.**. It
> is not limited by an architectural software clock.
>
> So it is asynchronous vs synchronous that is the key issue, not just the
> exact speed of the software clock in the synchronous approach.
>
> It isalso indeed true that latencies down to the single digit microseconds
> level can sometimes matter in financial trading but rarely.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 April 2016 at 12:44, Michael Segel <msegel_had...@hotmail.com>
> wrote:
>
>> I don’t.
>>
>> I believe that there have been a  couple of hack-a-thons like one done in
>> Chicago a few years back using public transportation data.
>>
>> The first question is what sort of data do you get from the city?
>>
>> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).
>>   Or they could provide more information. Like last stop, distance to next
>> stop, avg current velocity…
>>
>> Then there is the frequency of the updates. Every second? Every 3
>> seconds? 5 or 6 seconds…
>>
>> This will determine how much work you have to do.
>>
>> Maybe they provide the routes of the busses via a different API call
>> since its relatively static.
>>
>> This will drive your solution more than the underlying technology.
>>
>> Oh and whileI focused on bus, there are also rail and other modes of
>> public transportation like light rail, trains, etc …
>>
>> HTH
>>
>> -Mike
>>
>>
>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi>
>> wrote:
>>
>>
>> Do you know any good examples how to use Spark streaming in tracking
>> public transportation systems ?
>>
>> Or Storm or some other tool example ?
>>
>> Regards
>> Esa Heikkinen
>>
>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>
>> Uhm…
>> I think you need to clarify a couple of things…
>>
>> First there is this thing called analog signal processing…. Is that
>> continuous enough for you?
>>
>> But more to the point, Spark Streaming does micro batching so if you’re
>> processing a continuous stream of tick data, you will have more than 50K of
>> tics per second while there are markets open and trading.  Even at 50K a
>> second, that would mean 1 every .02 ms or 50 ticks a ms.
>>
>> And you don’t want to wait until you have a batch to start processing,
>> but you want to process when the data hits the queue and pull it from the
>> queue as quickly as possible.
>>
>> Spark streaming will be able to pull batches in as little as 500ms. So if
>> you pull a batch at t0 and immediately have a tick in your queue, you won’t
>> process that data until t0+500ms. And said batch would contain 25,000
>> entries.
>>
>> Depending on what you are doing… that 500ms delay can be enough to be
>> fatal to your trading process.
>>
>> If you don’t like stock data, there are other examples mainly when
>> pulling data from real time embedded systems.
>>
>>
>> If you go back and read what I said, if your data flow is >> (much
>> slower) than 500ms, and / or the time to process is >> 500ms ( much longer
>> )  you could use spark streaming.  If not… and there are applications which
>> require that type of speed…  then you shouldn’t use spark streaming.
>>
>> If you do have that constraint, then you can look at systems like
>> storm/flink/samza / whatever where you have a continuous queue and listener
>> and no micro batch delays.
>> Then for each bolt (storm) you can have a spark context for processing
>> the data. (Depending on what sort of processing you want to do.)
>>
>> To put this in perspective… if you’re using spark streaming / akka /
>> storm /etc to handle real time requests from the web, 500ms added delay can
>> be a long time.
>>
>> Choose the right tool.
>>
>> For the OP’s problem. Sure Tracking public transportation could be done
>> using spark streaming. It could also be done using half a dozen other tools
>> because the rate of data generation is much slower than 500ms.
>>
>> HTH
>>
>>
>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
>> wrote:
>>
>> couple of things.
>>
>> There is no such thing as Continuous Data Streaming as there is no such
>> thing as Continuous Availability.
>>
>> There is such thing as Discrete Data Streaming and  High Availability
>> but they reduce the finite unavailability to minimum. In terms of business
>> needs a 5 SIGMA is good enough and acceptable. Even the candles set to a
>> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader
>> makes a sell or buy decision on the basis of 2 seconds candlestick
>>
>> The calculation itself in measurements is subject to finite error as
>> defined by their Confidence Level (CL) using Standard Deviation function.
>>
>> OK so far I have never noticed a tool that requires that details of
>> granularity. Those stuff from Flink etc is in practical term is of little
>> value and does not make commercial sense.
>>
>> Now with regard to your needs, Spark micro batching is perfectly adequate.
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 27 April 2016 at 22:10, Esa Heikkinen < <esa.heikki...@student.tut.fi>
>> esa.heikki...@student.tut.fi> wrote:
>>
>>>
>>> Hi
>>>
>>> Thanks for the answer.
>>>
>>> I have developed a log file analyzer for RTPIS (Real Time Passenger
>>> Information System) system, where buses drive lines and the system try to
>>> estimate the arrival times to the bus stops. There are many different log
>>> files (and events) and analyzing situation can be very complex. Also
>>> spatial data can be included to the log data.
>>>
>>> The analyzer also has a query (or analyzing) language, which describes a
>>> expected behavior. This can be a requirement of system. Analyzer can be
>>> think to be also a test oracle.
>>>
>>> I have published a paper (SPLST'15 conference) about my analyzer and its
>>> language. The paper is maybe too technical, but it is found:
>>> http://ceur-ws.org/Vol-1525/paper-19.pdf
>>>
>>> I do not know yet where it belongs. I think it can be some "CEP with
>>> delays". Or do you know better ?
>>> My analyzer can also do little bit more complex and time-consuming
>>> analyzings? There are no a need for real time.
>>>
>>> And it is possible to do "CEP with delays" reasonably some existing
>>> analyzer (for example Spark) ?
>>>
>>> Regards
>>> PhD student at Tampere University of Technology, Finland,
>>> <http://www.tut.fi/>www.tut.fi
>>> Esa Heikkinen
>>>
>>> 27.4.2016, 15:51, Michael Segel kirjoitti:
>>>
>>> Spark and CEP? It depends…
>>>
>>> Ok, I know that’s not the answer you want to hear, but its a bit more
>>> complicated…
>>>
>>> If you consider Spark Streaming, you have some issues.
>>> Spark Streaming isn’t a Real Time solution because it is a micro batch
>>> solution. The smallest Window is 500ms.  This means that if your compute
>>> time is >> 500ms and/or  your event flow is >> 500ms this could work.
>>> (e.g. 'real time' image processing on a system that is capturing 60FPS
>>> because the processing time is >> 500ms. )
>>>
>>> So Spark Streaming wouldn’t be the best solution….
>>>
>>> However, you can combine spark with other technologies like Storm, Akka,
>>> etc .. where you have continuous streaming.
>>> So you could instantiate a spark context per worker in storm…
>>>
>>> I think if there are no class collisions between Akka and Spark, you
>>> could use Akka, which may have a better potential for communication between
>>> workers.
>>> So here you can handle CEP events.
>>>
>>> HTH
>>>
>>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh <
>>> <mich.talebza...@gmail.com>mich.talebza...@gmail.com> wrote:
>>>
>>> please see my other reply
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>> LinkedIn *
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>> <http://talebzadehmich.wordpress.com/>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 27 April 2016 at 10:40, Esa Heikkinen <
>>> <esa.heikki...@student.tut.fi>esa.heikki...@student.tut.fi> wrote:
>>>
>>>> Hi
>>>>
>>>> I have followed with interest the discussion about CEP and Spark. It is
>>>> quite close to my research, which is a complex analyzing for log files and
>>>> "history" data  (not actually for real time streams).
>>>>
>>>> I have few questions:
>>>>
>>>> 1) Is CEP only for (real time) stream data and not for "history" data?
>>>>
>>>> 2) Is it possible to search "backward" (upstream) by CEP with given
>>>> time window? If a start time of the time window is earlier than the current
>>>> stream time.
>>>>
>>>> 3) Do you know any good tools or softwares for "CEP's" using for log
>>>> data ?
>>>>
>>>> 4) Do you know any good (scientific) papers i should read about CEP ?
>>>>
>>>>
>>>> Regards
>>>> PhD student at Tampere University of Technology, Finland,
>>>> <http://www.tut.fi/>www.tut.fi
>>>> Esa Heikkinen
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: <user-unsubscr...@spark.apache.org>
>>>> user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: <user-h...@spark.apache.org>
>>>> user-h...@spark.apache.org
>>>>
>>>>
>>>
>>> The opinions expressed here are mine, while they may reflect a cognitive
>>> thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: Spark support for Complex Event Processing (CEP)

Reply via email to