Also the point about "First there is this thing called analog signal processing…. Is that continuous enough for you? "
I agree that analog signal processing like a sine wave, an AM radio signal – is truly continuous. However, here we are talking about digital data which will always be sent as bytes and typically with bytes grouped into messages . In other words when we are sending data it is never truly continuous. We are sending discrete messages. HTH, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 28 April 2016 at 17:22, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > In a commerical (C)EP like say StreamBase, or for example its competitor > Apama, the arrival of an input event **immediately** triggers further > downstream processing. > > This is admitadly an asynchronous approach, not a synchronous clock-driven > micro-batch approach like Spark's. > > I suppose if one wants to split hairs / be philosophical, the clock rate > of the microprocessor chip underlies everything. But I don't think that > is quite the point. > > The point is that an asychonrous event-driven approach is as continuous / > immediate as **the underlying computer hardware will ever allow.**. It > is not limited by an architectural software clock. > > So it is asynchronous vs synchronous that is the key issue, not just the > exact speed of the software clock in the synchronous approach. > > It isalso indeed true that latencies down to the single digit microseconds > level can sometimes matter in financial trading but rarely. > > HTH > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 28 April 2016 at 12:44, Michael Segel <msegel_had...@hotmail.com> > wrote: > >> I don’t. >> >> I believe that there have been a couple of hack-a-thons like one done in >> Chicago a few years back using public transportation data. >> >> The first question is what sort of data do you get from the city? >> >> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). >> Or they could provide more information. Like last stop, distance to next >> stop, avg current velocity… >> >> Then there is the frequency of the updates. Every second? Every 3 >> seconds? 5 or 6 seconds… >> >> This will determine how much work you have to do. >> >> Maybe they provide the routes of the busses via a different API call >> since its relatively static. >> >> This will drive your solution more than the underlying technology. >> >> Oh and whileI focused on bus, there are also rail and other modes of >> public transportation like light rail, trains, etc … >> >> HTH >> >> -Mike >> >> >> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi> >> wrote: >> >> >> Do you know any good examples how to use Spark streaming in tracking >> public transportation systems ? >> >> Or Storm or some other tool example ? >> >> Regards >> Esa Heikkinen >> >> 28.4.2016, 3:16, Michael Segel kirjoitti: >> >> Uhm… >> I think you need to clarify a couple of things… >> >> First there is this thing called analog signal processing…. Is that >> continuous enough for you? >> >> But more to the point, Spark Streaming does micro batching so if you’re >> processing a continuous stream of tick data, you will have more than 50K of >> tics per second while there are markets open and trading. Even at 50K a >> second, that would mean 1 every .02 ms or 50 ticks a ms. >> >> And you don’t want to wait until you have a batch to start processing, >> but you want to process when the data hits the queue and pull it from the >> queue as quickly as possible. >> >> Spark streaming will be able to pull batches in as little as 500ms. So if >> you pull a batch at t0 and immediately have a tick in your queue, you won’t >> process that data until t0+500ms. And said batch would contain 25,000 >> entries. >> >> Depending on what you are doing… that 500ms delay can be enough to be >> fatal to your trading process. >> >> If you don’t like stock data, there are other examples mainly when >> pulling data from real time embedded systems. >> >> >> If you go back and read what I said, if your data flow is >> (much >> slower) than 500ms, and / or the time to process is >> 500ms ( much longer >> ) you could use spark streaming. If not… and there are applications which >> require that type of speed… then you shouldn’t use spark streaming. >> >> If you do have that constraint, then you can look at systems like >> storm/flink/samza / whatever where you have a continuous queue and listener >> and no micro batch delays. >> Then for each bolt (storm) you can have a spark context for processing >> the data. (Depending on what sort of processing you want to do.) >> >> To put this in perspective… if you’re using spark streaming / akka / >> storm /etc to handle real time requests from the web, 500ms added delay can >> be a long time. >> >> Choose the right tool. >> >> For the OP’s problem. Sure Tracking public transportation could be done >> using spark streaming. It could also be done using half a dozen other tools >> because the rate of data generation is much slower than 500ms. >> >> HTH >> >> >> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> couple of things. >> >> There is no such thing as Continuous Data Streaming as there is no such >> thing as Continuous Availability. >> >> There is such thing as Discrete Data Streaming and High Availability >> but they reduce the finite unavailability to minimum. In terms of business >> needs a 5 SIGMA is good enough and acceptable. Even the candles set to a >> predefined time interval say 2, 4, 15 seconds overlap. No FX savvy trader >> makes a sell or buy decision on the basis of 2 seconds candlestick >> >> The calculation itself in measurements is subject to finite error as >> defined by their Confidence Level (CL) using Standard Deviation function. >> >> OK so far I have never noticed a tool that requires that details of >> granularity. Those stuff from Flink etc is in practical term is of little >> value and does not make commercial sense. >> >> Now with regard to your needs, Spark micro batching is perfectly adequate. >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> >> >> On 27 April 2016 at 22:10, Esa Heikkinen < <esa.heikki...@student.tut.fi> >> esa.heikki...@student.tut.fi> wrote: >> >>> >>> Hi >>> >>> Thanks for the answer. >>> >>> I have developed a log file analyzer for RTPIS (Real Time Passenger >>> Information System) system, where buses drive lines and the system try to >>> estimate the arrival times to the bus stops. There are many different log >>> files (and events) and analyzing situation can be very complex. Also >>> spatial data can be included to the log data. >>> >>> The analyzer also has a query (or analyzing) language, which describes a >>> expected behavior. This can be a requirement of system. Analyzer can be >>> think to be also a test oracle. >>> >>> I have published a paper (SPLST'15 conference) about my analyzer and its >>> language. The paper is maybe too technical, but it is found: >>> http://ceur-ws.org/Vol-1525/paper-19.pdf >>> >>> I do not know yet where it belongs. I think it can be some "CEP with >>> delays". Or do you know better ? >>> My analyzer can also do little bit more complex and time-consuming >>> analyzings? There are no a need for real time. >>> >>> And it is possible to do "CEP with delays" reasonably some existing >>> analyzer (for example Spark) ? >>> >>> Regards >>> PhD student at Tampere University of Technology, Finland, >>> <http://www.tut.fi/>www.tut.fi >>> Esa Heikkinen >>> >>> 27.4.2016, 15:51, Michael Segel kirjoitti: >>> >>> Spark and CEP? It depends… >>> >>> Ok, I know that’s not the answer you want to hear, but its a bit more >>> complicated… >>> >>> If you consider Spark Streaming, you have some issues. >>> Spark Streaming isn’t a Real Time solution because it is a micro batch >>> solution. The smallest Window is 500ms. This means that if your compute >>> time is >> 500ms and/or your event flow is >> 500ms this could work. >>> (e.g. 'real time' image processing on a system that is capturing 60FPS >>> because the processing time is >> 500ms. ) >>> >>> So Spark Streaming wouldn’t be the best solution…. >>> >>> However, you can combine spark with other technologies like Storm, Akka, >>> etc .. where you have continuous streaming. >>> So you could instantiate a spark context per worker in storm… >>> >>> I think if there are no class collisions between Akka and Spark, you >>> could use Akka, which may have a better potential for communication between >>> workers. >>> So here you can handle CEP events. >>> >>> HTH >>> >>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh < >>> <mich.talebza...@gmail.com>mich.talebza...@gmail.com> wrote: >>> >>> please see my other reply >>> >>> Dr Mich Talebzadeh >>> >>> >>> LinkedIn * >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>> >>> >>> <http://talebzadehmich.wordpress.com/> >>> http://talebzadehmich.wordpress.com >>> >>> >>> >>> On 27 April 2016 at 10:40, Esa Heikkinen < >>> <esa.heikki...@student.tut.fi>esa.heikki...@student.tut.fi> wrote: >>> >>>> Hi >>>> >>>> I have followed with interest the discussion about CEP and Spark. It is >>>> quite close to my research, which is a complex analyzing for log files and >>>> "history" data (not actually for real time streams). >>>> >>>> I have few questions: >>>> >>>> 1) Is CEP only for (real time) stream data and not for "history" data? >>>> >>>> 2) Is it possible to search "backward" (upstream) by CEP with given >>>> time window? If a start time of the time window is earlier than the current >>>> stream time. >>>> >>>> 3) Do you know any good tools or softwares for "CEP's" using for log >>>> data ? >>>> >>>> 4) Do you know any good (scientific) papers i should read about CEP ? >>>> >>>> >>>> Regards >>>> PhD student at Tampere University of Technology, Finland, >>>> <http://www.tut.fi/>www.tut.fi >>>> Esa Heikkinen >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: <user-unsubscr...@spark.apache.org> >>>> user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: <user-h...@spark.apache.org> >>>> user-h...@spark.apache.org >>>> >>>> >>> >>> The opinions expressed here are mine, while they may reflect a cognitive >>> thought, that is purely accidental. >>> Use at your own risk. >>> Michael Segel >>> michael_segel (AT) hotmail.com >>> >>> >>> >>> >>> >>> >>> >> >> The opinions expressed here are mine, while they may reflect a cognitive >> thought, that is purely accidental. >> Use at your own risk. >> Michael Segel >> michael_segel (AT) hotmail.com >> >> >> >> >> >> >> >> >> >