Do you know any good examples how to use Spark streaming in tracking public transportation systems ?

Or Storm or some other tool example ?

Regards
Esa Heikkinen

28.4.2016, 3:16, Michael Segel kirjoitti:
Uhm…
I think you need to clarify a couple of things…

First there is this thing called analog signal processing…. Is that continuous enough for you?

But more to the point, Spark Streaming does micro batching so if you’re processing a continuous stream of tick data, you will have more than 50K of tics per second while there are markets open and trading. Even at 50K a second, that would mean 1 every .02 ms or 50 ticks a ms.

And you don’t want to wait until you have a batch to start processing, but you want to process when the data hits the queue and pull it from the queue as quickly as possible.

Spark streaming will be able to pull batches in as little as 500ms. So if you pull a batch at t0 and immediately have a tick in your queue, you won’t process that data until t0+500ms. And said batch would contain 25,000 entries.

Depending on what you are doing… that 500ms delay can be enough to be fatal to your trading process.

If you don’t like stock data, there are other examples mainly when pulling data from real time embedded systems.


If you go back and read what I said, if your data flow is >> (much slower) than 500ms, and / or the time to process is >> 500ms ( much longer ) you could use spark streaming. If not… and there are applications which require that type of speed… then you shouldn’t use spark streaming.

If you do have that constraint, then you can look at systems like storm/flink/samza / whatever where you have a continuous queue and listener and no micro batch delays. Then for each bolt (storm) you can have a spark context for processing the data. (Depending on what sort of processing you want to do.)

To put this in perspective… if you’re using spark streaming / akka / storm /etc to handle real time requests from the web, 500ms added delay can be a long time.

Choose the right tool.

For the OP’s problem. Sure Tracking public transportation could be done using spark streaming. It could also be done using half a dozen other tools because the rate of data generation is much slower than 500ms.

HTH


On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>> wrote:

couple of things.

There is no such thing as Continuous Data Streaming as there is no such thing as Continuous Availability.

There is such thing as Discrete Data Streaming and High Availability but they reduce the finite unavailability to minimum. In terms of business needs a 5 SIGMA is good enough and acceptable. Even the candles set to a predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader makes a sell or buy decision on the basis of 2 seconds candlestick

The calculation itself in measurements is subject to finite error as defined by their Confidence Level (CL) using Standard Deviation function.

OK so far I have never noticed a tool that requires that details of granularity. Those stuff from Flink etc is in practical term is of little value and does not make commercial sense.

Now with regard to your needs, Spark micro batching is perfectly adequate.

HTH

Dr Mich Talebzadeh

LinkedIn /https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>


On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi <mailto:esa.heikki...@student.tut.fi>> wrote:


    Hi

    Thanks for the answer.

    I have developed a log file analyzer for RTPIS (Real Time
    Passenger Information System) system, where buses drive lines and
    the system try to estimate the arrival times to the bus stops.
    There are many different log files (and events) and analyzing
    situation can be very complex. Also spatial data can be included
    to the log data.

    The analyzer also has a query (or analyzing) language, which
    describes a expected behavior. This can be a requirement of
    system. Analyzer can be think to be also a test oracle.

    I have published a paper (SPLST'15 conference) about my analyzer
    and its language. The paper is maybe too technical, but it is found:
    http://ceur-ws.org/Vol-1525/paper-19.pdf

    I do not know yet where it belongs. I think it can be some "CEP
    with delays". Or do you know better ?
    My analyzer can also do little bit more complex and
    time-consuming analyzings? There are no a need for real time.

    And it is possible to do "CEP with delays" reasonably some
    existing analyzer (for example Spark) ?

    Regards
    PhD student at Tampere University of Technology, Finland,
    www.tut.fi <http://www.tut.fi/>
    Esa Heikkinen

    27.4.2016, 15:51, Michael Segel kirjoitti:
    Spark and CEP? It depends…

    Ok, I know that’s not the answer you want to hear, but its a bit
    more complicated…

    If you consider Spark Streaming, you have some issues.
    Spark Streaming isn’t a Real Time solution because it is a micro
    batch solution. The smallest Window is 500ms.  This means that
    if your compute time is >> 500ms and/or  your event flow is >>
    500ms this could work.
    (e.g. 'real time' image processing on a system that is capturing
    60FPS because the processing time is >> 500ms. )

    So Spark Streaming wouldn’t be the best solution….

    However, you can combine spark with other technologies like
    Storm, Akka, etc .. where you have continuous streaming.
    So you could instantiate a spark context per worker in storm…

    I think if there are no class collisions between Akka and Spark,
    you could use Akka, which may have a better potential for
    communication between workers.
    So here you can handle CEP events.

    HTH

    On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh
    <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>>
    wrote:

    please see my other reply

    Dr Mich Talebzadeh

    LinkedIn
    
/https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

    http://talebzadehmich.wordpress.com
    <http://talebzadehmich.wordpress.com/>


    On 27 April 2016 at 10:40, Esa Heikkinen
    <esa.heikki...@student.tut.fi
    <mailto:esa.heikki...@student.tut.fi>> wrote:

        Hi

        I have followed with interest the discussion about CEP and
        Spark. It is quite close to my research, which is a complex
        analyzing for log files and "history" data  (not actually
        for real time streams).

        I have few questions:

        1) Is CEP only for (real time) stream data and not for
        "history" data?

        2) Is it possible to search "backward" (upstream) by CEP
        with given time window? If a start time of the time window
        is earlier than the current stream time.

        3) Do you know any good tools or softwares for "CEP's"
        using for log data ?

        4) Do you know any good (scientific) papers i should read
        about CEP ?


        Regards
        PhD student at Tampere University of Technology, Finland,
        www.tut.fi <http://www.tut.fi/>
        Esa Heikkinen

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
        <mailto:user-unsubscr...@spark.apache.org>
        For additional commands, e-mail: user-h...@spark.apache.org
        <mailto:user-h...@spark.apache.org>



    The opinions expressed here are mine, while they may reflect a
    cognitive thought, that is purely accidental.
    Use at your own risk.
    Michael Segel
    michael_segel (AT) hotmail.com <http://hotmail.com/>








The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com <http://hotmail.com>






Reply via email to