Re: Spark support for Complex Event Processing (CEP)

Esa Heikkinen Fri, 29 Apr 2016 01:55:48 -0700

Hi


I try to explain my case ..

Situation is not so simple in my logs and solution. There also manytypes of logs and there are from many sources.

They are as csv-format and header line includes names of the columns.

This is simplified description of input logs.

LOG A's: bus coordinate logs (every bus has own log):
- timestamp
- bus number
- coordinates

LOG B: bus login/logout (to/from line) message log:
- timestamp
- bus number
- line number

LOG C:  log from central computers:
- timestamp
- bus number
- bus stop number
- estimated arrival time to bus stop

LOG A are updated every 30 seconds (i have also another system by 1seconds interval). LOG B are updated when bus starts from terminal busstop and arrives to final bus stop in a line. LOG C is updated whencentral computer sends new arrival time estimation to bus stop.

I also need metadata for logs (and analyzer). For example coordinatesfor bus stop areas.

Main purpose of analyzing is to check an accuracy (error) of theestimated arrival time to bus stops.

Because there are many buses and lines, it is too time-comsuming tocheck all of them. So i check only specific lines with specific busstops. There are many buses (logged to lines) coming to one bus stop andi am interested about only certain bus.

To do that, i have to read log partly not in time order (upstream) bysequence:

1. From LOG C is searched bus number
2. From LOG A is searched when the bus has leaved from terminal bus stop
3. From LOG B is searched when bus has sent a login to the line
4. From LOG A is searched when the bus has entered to bus stop

5. From LOG C is searched a last estimated arrival time to the bus stopand calculates error between real and estimated value

In my understanding (almost) all log file analyzers reads all data(lines) in time order from log files. My need is only for specific partof log (lines). To achieve that, my solution is to read logs in anarbitrary order (with given time window).

I know this solution is not suitable for all cases (for example for veryfast analyzing and very big data). This solution is suitable for verycomplex (targeted) analyzing. It can be too slow and memory-consuming,but well done pre-processing of log data can help a lot.


---
Esa Heikkinen

28.4.2016, 14:44, Michael Segel kirjoitti:

I don’t.
I believe that there have been a couple of hack-a-thons like one donein Chicago a few years back using public transportation data.
The first question is what sort of data do you get from the city?
I mean it could be as simple as time_stamp, bus_id, route and GPS(x,y). Or they could provide more information. Like last stop,distance to next stop, avg current velocity…
Then there is the frequency of the updates. Every second? Every 3seconds? 5 or 6 seconds…
This will determine how much work you have to do.
Maybe they provide the routes of the busses via a different API callsince its relatively static.
This will drive your solution more than the underlying technology.
Oh and whileI focused on bus, there are also rail and other modes ofpublic transportation like light rail, trains, etc …
HTH

-Mike
On Apr 28, 2016, at 4:10 AM, Esa Heikkinen<esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>>wrote:
Do you know any good examples how to use Spark streaming in trackingpublic transportation systems ?
Or Storm or some other tool example ?

Regards
Esa Heikkinen

28.4.2016, 3:16, Michael Segel kirjoitti:
Uhm…
I think you need to clarify a couple of things…
First there is this thing called analog signal processing…. Is thatcontinuous enough for you?
But more to the point, Spark Streaming does micro batching so ifyou’re processing a continuous stream of tick data, you will havemore than 50K of tics per second while there are markets open andtrading. Even at 50K a second, that would mean 1 every .02 ms or 50ticks a ms.
And you don’t want to wait until you have a batch to startprocessing, but you want to process when the data hits the queue andpull it from the queue as quickly as possible.
Spark streaming will be able to pull batches in as little as 500ms.So if you pull a batch at t0 and immediately have a tick in yourqueue, you won’t process that data until t0+500ms. And said batchwould contain 25,000 entries.
Depending on what you are doing… that 500ms delay can be enough tobe fatal to your trading process.
If you don’t like stock data, there are other examples mainly whenpulling data from real time embedded systems.
If you go back and read what I said, if your data flow is >> (muchslower) than 500ms, and / or the time to process is >> 500ms ( muchlonger ) you could use spark streaming. If not… and there areapplications which require that type of speed… then you shouldn’tuse spark streaming.
If you do have that constraint, then you can look at systems likestorm/flink/samza / whatever where you have a continuous queue andlistener and no micro batch delays.Then for each bolt (storm) you can have a spark context forprocessing the data. (Depending on what sort of processing you wantto do.)
To put this in perspective… if you’re using spark streaming / akka /storm /etc to handle real time requests from the web, 500ms addeddelay can be a long time.
Choose the right tool.
For the OP’s problem. Sure Tracking public transportation could bedone using spark streaming. It could also be done using half a dozenother tools because the rate of data generation is much slower than500ms.
HTH
On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh<mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote:
couple of things.
There is no such thing as Continuous Data Streaming as there is nosuch thing as Continuous Availability.
There is such thing as Discrete Data Streaming and HighAvailability but they reduce the finite unavailability to minimum.In terms of business needs a 5 SIGMA is good enough and acceptable.Even the candles set to a predefined time interval say 2, 4, 15seconds overlap. No FX savvy trader makes a sell or buy decision onthe basis of 2 seconds candlestick
The calculation itself in measurements is subject to finite erroras defined by their Confidence Level (CL) using Standard Deviationfunction.
OK so far I have never noticed a tool that requires that details ofgranularity. Those stuff from Flink etc is in practical term is oflittle value and does not make commercial sense.
Now with regard to your needs, Spark micro batching is perfectlyadequate.
HTH

Dr Mich Talebzadeh
LinkedIn/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 27 April 2016 at 22:10, Esa Heikkinen<esa.heikki...@student.tut.fi> wrote:
    Hi

    Thanks for the answer.

    I have developed a log file analyzer for RTPIS (Real Time
    Passenger Information System) system, where buses drive lines
    and the system try to estimate the arrival times to the bus
    stops. There are many different log files (and events) and
    analyzing situation can be very complex. Also spatial data can
    be included to the log data.

    The analyzer also has a query (or analyzing) language, which
    describes a expected behavior. This can be a requirement of
    system. Analyzer can be think to be also a test oracle.

    I have published a paper (SPLST'15 conference) about my
    analyzer and its language. The paper is maybe too technical,
    but it is found:
    http://ceur-ws.org/Vol-1525/paper-19.pdf

    I do not know yet where it belongs. I think it can be some "CEP
    with delays". Or do you know better ?
    My analyzer can also do little bit more complex and
    time-consuming analyzings? There are no a need for real time.

    And it is possible to do "CEP with delays" reasonably some
    existing analyzer (for example Spark) ?

    Regards
    PhD student at Tampere University of Technology, Finland,
    www.tut.fi
    Esa Heikkinen

    27.4.2016, 15:51, Michael Segel kirjoitti:
    Spark and CEP? It depends…

    Ok, I know that’s not the answer you want to hear, but its a
    bit more complicated…

    If you consider Spark Streaming, you have some issues.
    Spark Streaming isn’t a Real Time solution because it is a
    micro batch solution. The smallest Window is 500ms.  This
    means that if your compute time is >> 500ms and/or  your event
    flow is >> 500ms this could work.
    (e.g. 'real time' image processing on a system that is
    capturing 60FPS because the processing time is >> 500ms. )

    So Spark Streaming wouldn’t be the best solution….

    However, you can combine spark with other technologies like
    Storm, Akka, etc .. where you have continuous streaming.
    So you could instantiate a spark context per worker in storm…

    I think if there are no class collisions between Akka and
    Spark, you could use Akka, which may have a better potential
    for communication between workers.
    So here you can handle CEP events.

    HTH
    On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh
    <mich.talebza...@gmail.com> wrote:

    please see my other reply

    Dr Mich Talebzadeh

    LinkedIn
    
/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

    http://talebzadehmich.wordpress.com


    On 27 April 2016 at 10:40, Esa Heikkinen
    <esa.heikki...@student.tut.fi> wrote:

        Hi

        I have followed with interest the discussion about CEP
        and Spark. It is quite close to my research, which is a
        complex analyzing for log files and "history" data (not
        actually for real time streams).

        I have few questions:

        1) Is CEP only for (real time) stream data and not for
        "history" data?

        2) Is it possible to search "backward" (upstream) by CEP
        with given time window? If a start time of the time
        window is earlier than the current stream time.

        3) Do you know any good tools or softwares for "CEP's"
        using for log data ?

        4) Do you know any good (scientific) papers i should read
        about CEP ?


        Regards
        PhD student at Tampere University of Technology, Finland,
        www.tut.fi
        Esa Heikkinen

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        For additional commands, e-mail: user-h...@spark.apache.org
    The opinions expressed here are mine, while they may reflect a
    cognitive thought, that is purely accidental.
    Use at your own risk.
    Michael Segel
    michael_segel (AT) hotmail.com <http://hotmail.com/>
The opinions expressed here are mine, while they may reflect acognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com <http://hotmail.com/>

Re: Spark support for Complex Event Processing (CEP)

Reply via email to