If you’re getting the logs, then it really isn’t CEP unless you consider the event to be the log from the bus. This doesn’t sound like there is a time constraint.
Your bus schedule is fairly fixed and changes infrequently. Your bus stops are relatively fixed points. (Within a couple of meters) So then you’re taking bus A who is scheduled to drive route 123 and you want to compare their nearest location to the bus stop at time T and see how close it is to the scheduled route. Or am I missing something? -Mike > On Apr 29, 2016, at 3:54 AM, Esa Heikkinen <esa.heikki...@student.tut.fi> > wrote: > > > Hi > > I try to explain my case .. > > Situation is not so simple in my logs and solution. There also many types of > logs and there are from many sources. > They are as csv-format and header line includes names of the columns. > > This is simplified description of input logs. > > LOG A's: bus coordinate logs (every bus has own log): > - timestamp > - bus number > - coordinates > > LOG B: bus login/logout (to/from line) message log: > - timestamp > - bus number > - line number > > LOG C: log from central computers: > - timestamp > - bus number > - bus stop number > - estimated arrival time to bus stop > > LOG A are updated every 30 seconds (i have also another system by 1 seconds > interval). LOG B are updated when bus starts from terminal bus stop and > arrives to final bus stop in a line. LOG C is updated when central computer > sends new arrival time estimation to bus stop. > > I also need metadata for logs (and analyzer). For example coordinates for bus > stop areas. > > Main purpose of analyzing is to check an accuracy (error) of the estimated > arrival time to bus stops. > > Because there are many buses and lines, it is too time-comsuming to check all > of them. So i check only specific lines with specific bus stops. There are > many buses (logged to lines) coming to one bus stop and i am interested about > only certain bus. > > To do that, i have to read log partly not in time order (upstream) by > sequence: > 1. From LOG C is searched bus number > 2. From LOG A is searched when the bus has leaved from terminal bus stop > 3. From LOG B is searched when bus has sent a login to the line > 4. From LOG A is searched when the bus has entered to bus stop > 5. From LOG C is searched a last estimated arrival time to the bus stop and > calculates error between real and estimated value > > In my understanding (almost) all log file analyzers reads all data (lines) in > time order from log files. My need is only for specific part of log (lines). > To achieve that, my solution is to read logs in an arbitrary order (with > given time window). > > I know this solution is not suitable for all cases (for example for very fast > analyzing and very big data). This solution is suitable for very complex > (targeted) analyzing. It can be too slow and memory-consuming, but well done > pre-processing of log data can help a lot. > > --- > Esa Heikkinen > > 28.4.2016, 14:44, Michael Segel kirjoitti: >> I don’t. >> >> I believe that there have been a couple of hack-a-thons like one done in >> Chicago a few years back using public transportation data. >> >> The first question is what sort of data do you get from the city? >> >> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y). >> Or they could provide more information. Like last stop, distance to next >> stop, avg current velocity… >> >> Then there is the frequency of the updates. Every second? Every 3 seconds? 5 >> or 6 seconds… >> >> This will determine how much work you have to do. >> >> Maybe they provide the routes of the busses via a different API call since >> its relatively static. >> >> This will drive your solution more than the underlying technology. >> >> Oh and whileI focused on bus, there are also rail and other modes of public >> transportation like light rail, trains, etc … >> >> HTH >> >> -Mike >> >> >>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi >>> <mailto:esa.heikki...@student.tut.fi>> wrote: >>> >>> >>> Do you know any good examples how to use Spark streaming in tracking public >>> transportation systems ? >>> >>> Or Storm or some other tool example ? >>> >>> Regards >>> Esa Heikkinen >>> >>> 28.4.2016, 3:16, Michael Segel kirjoitti: >>>> Uhm… >>>> I think you need to clarify a couple of things… >>>> >>>> First there is this thing called analog signal processing…. Is that >>>> continuous enough for you? >>>> >>>> But more to the point, Spark Streaming does micro batching so if you’re >>>> processing a continuous stream of tick data, you will have more than 50K >>>> of tics per second while there are markets open and trading. Even at 50K >>>> a second, that would mean 1 every .02 ms or 50 ticks a ms. >>>> >>>> And you don’t want to wait until you have a batch to start processing, but >>>> you want to process when the data hits the queue and pull it from the >>>> queue as quickly as possible. >>>> >>>> Spark streaming will be able to pull batches in as little as 500ms. So if >>>> you pull a batch at t0 and immediately have a tick in your queue, you >>>> won’t process that data until t0+500ms. And said batch would contain >>>> 25,000 entries. >>>> >>>> Depending on what you are doing… that 500ms delay can be enough to be >>>> fatal to your trading process. >>>> >>>> If you don’t like stock data, there are other examples mainly when pulling >>>> data from real time embedded systems. >>>> >>>> >>>> If you go back and read what I said, if your data flow is >> (much slower) >>>> than 500ms, and / or the time to process is >> 500ms ( much longer ) you >>>> could use spark streaming. If not… and there are applications which >>>> require that type of speed… then you shouldn’t use spark streaming. >>>> >>>> If you do have that constraint, then you can look at systems like >>>> storm/flink/samza / whatever where you have a continuous queue and >>>> listener and no micro batch delays. >>>> Then for each bolt (storm) you can have a spark context for processing the >>>> data. (Depending on what sort of processing you want to do.) >>>> >>>> To put this in perspective… if you’re using spark streaming / akka / storm >>>> /etc to handle real time requests from the web, 500ms added delay can be a >>>> long time. >>>> >>>> Choose the right tool. >>>> >>>> For the OP’s problem. Sure Tracking public transportation could be done >>>> using spark streaming. It could also be done using half a dozen other >>>> tools because the rate of data generation is much slower than 500ms. >>>> >>>> HTH >>>> >>>> >>>>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com >>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>> >>>>> couple of things. >>>>> >>>>> There is no such thing as Continuous Data Streaming as there is no such >>>>> thing as Continuous Availability. >>>>> >>>>> There is such thing as Discrete Data Streaming and High Availability >>>>> but they reduce the finite unavailability to minimum. In terms of >>>>> business needs a 5 SIGMA is good enough and acceptable. Even the candles >>>>> set to a predefined time interval say 2, 4, 15 seconds overlap. No FX >>>>> savvy trader makes a sell or buy decision on the basis of 2 seconds >>>>> candlestick >>>>> >>>>> The calculation itself in measurements is subject to finite error as >>>>> defined by their Confidence Level (CL) using Standard Deviation function. >>>>> >>>>> OK so far I have never noticed a tool that requires that details of >>>>> granularity. Those stuff from Flink etc is in practical term is of little >>>>> value and does not make commercial sense. >>>>> >>>>> Now with regard to your needs, Spark micro batching is perfectly adequate. >>>>> >>>>> HTH >>>>> >>>>> Dr Mich Talebzadeh >>>>> >>>>> LinkedIn >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>> >>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>> >>>>> >>>>> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com >>>>> <http://talebzadehmich.wordpress.com/> >>>>> >>>>> >>>>> On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi >>>>> <mailto:esa.heikki...@student.tut.fi>> wrote: >>>>> >>>>> Hi >>>>> >>>>> Thanks for the answer. >>>>> >>>>> I have developed a log file analyzer for RTPIS (Real Time Passenger >>>>> Information System) system, where buses drive lines and the system try to >>>>> estimate the arrival times to the bus stops. There are many different log >>>>> files (and events) and analyzing situation can be very complex. Also >>>>> spatial data can be included to the log data. >>>>> >>>>> The analyzer also has a query (or analyzing) language, which describes a >>>>> expected behavior. This can be a requirement of system. Analyzer can be >>>>> think to be also a test oracle. >>>>> >>>>> I have published a paper (SPLST'15 conference) about my analyzer and its >>>>> language. The paper is maybe too technical, but it is found: >>>>> http://ceur-ws.org/Vol-1525/paper-19.pdf >>>>> <http://ceur-ws.org/Vol-1525/paper-19.pdf> >>>>> >>>>> I do not know yet where it belongs. I think it can be some "CEP with >>>>> delays". Or do you know better ? >>>>> My analyzer can also do little bit more complex and time-consuming >>>>> analyzings? There are no a need for real time. >>>>> >>>>> And it is possible to do "CEP with delays" reasonably some existing >>>>> analyzer (for example Spark) ? >>>>> >>>>> Regards >>>>> PhD student at Tampere University of Technology, Finland, >>>>> <http://www.tut.fi/>www.tut.fi <http://www.tut.fi/> >>>>> Esa Heikkinen >>>>> >>>>> 27.4.2016, 15:51, Michael Segel kirjoitti: >>>>>> Spark and CEP? It depends… >>>>>> >>>>>> Ok, I know that’s not the answer you want to hear, but its a bit more >>>>>> complicated… >>>>>> >>>>>> If you consider Spark Streaming, you have some issues. >>>>>> Spark Streaming isn’t a Real Time solution because it is a micro batch >>>>>> solution. The smallest Window is 500ms. This means that if your compute >>>>>> time is >> 500ms and/or your event flow is >> 500ms this could work. >>>>>> (e.g. 'real time' image processing on a system that is capturing 60FPS >>>>>> because the processing time is >> 500ms. ) >>>>>> >>>>>> So Spark Streaming wouldn’t be the best solution…. >>>>>> >>>>>> However, you can combine spark with other technologies like Storm, Akka, >>>>>> etc .. where you have continuous streaming. >>>>>> So you could instantiate a spark context per worker in storm… >>>>>> >>>>>> I think if there are no class collisions between Akka and Spark, you >>>>>> could use Akka, which may have a better potential for communication >>>>>> between workers. >>>>>> So here you can handle CEP events. >>>>>> >>>>>> HTH >>>>>> >>>>>>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh < >>>>>>> <mailto:mich.talebza...@gmail.com>mich.talebza...@gmail.com >>>>>>> <mailto:mich.talebza...@gmail.com>> wrote: >>>>>>> >>>>>>> please see my other reply >>>>>>> >>>>>>> Dr Mich Talebzadeh >>>>>>> >>>>>>> LinkedIn >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>>>>> >>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> >>>>>>> >>>>>>> >>>>>>> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com >>>>>>> <http://talebzadehmich.wordpress.com/> >>>>>>> >>>>>>> >>>>>>> On 27 April 2016 at 10:40, Esa Heikkinen < >>>>>>> <mailto:esa.heikki...@student.tut.fi>esa.heikki...@student.tut.fi >>>>>>> <mailto:esa.heikki...@student.tut.fi>> wrote: >>>>>>> Hi >>>>>>> >>>>>>> I have followed with interest the discussion about CEP and Spark. It is >>>>>>> quite close to my research, which is a complex analyzing for log files >>>>>>> and "history" data (not actually for real time streams). >>>>>>> >>>>>>> I have few questions: >>>>>>> >>>>>>> 1) Is CEP only for (real time) stream data and not for "history" data? >>>>>>> >>>>>>> 2) Is it possible to search "backward" (upstream) by CEP with given >>>>>>> time window? If a start time of the time window is earlier than the >>>>>>> current stream time. >>>>>>> >>>>>>> 3) Do you know any good tools or softwares for "CEP's" using for log >>>>>>> data ? >>>>>>> >>>>>>> 4) Do you know any good (scientific) papers i should read about CEP ? >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>> PhD student at Tampere University of Technology, Finland, >>>>>>> <http://www.tut.fi/>www.tut.fi <http://www.tut.fi/> >>>>>>> Esa Heikkinen >>>>>>> >>>>>>> --------------------------------------------------------------------- >>>>>>> To unsubscribe, e-mail: >>>>>>> <mailto:user-unsubscr...@spark.apache.org>user-unsubscr...@spark.apache.org >>>>>>> <mailto:user-unsubscr...@spark.apache.org> >>>>>>> For additional commands, e-mail: >>>>>>> <mailto:user-h...@spark.apache.org>user-h...@spark.apache.org >>>>>>> <mailto:user-h...@spark.apache.org> >>>>>>> >>>>>>> >>>>>> >>>>>> The opinions expressed here are mine, while they may reflect a cognitive >>>>>> thought, that is purely accidental. >>>>>> Use at your own risk. >>>>>> Michael Segel >>>>>> michael_segel (AT) hotmail.com <http://hotmail.com/> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> The opinions expressed here are mine, while they may reflect a cognitive >>>> thought, that is purely accidental. >>>> Use at your own risk. >>>> Michael Segel >>>> michael_segel (AT) hotmail.com <http://hotmail.com/> >>>> >>>> >>>> >>>> >>>> >>> >> >> >