Re: Spark support for Complex Event Processing (CEP)

Michael Segel Fri, 29 Apr 2016 09:39:43 -0700

If you’re getting the logs, then it really isn’t CEP unless you consider the 
event to be the log from the bus. 
This doesn’t sound like there is a time constraint.


Your bus schedule is fairly fixed and changes infrequently. 
Your bus stops are relatively fixed points. (Within a couple of meters) 

So then you’re taking bus A who is scheduled to drive route 123 and you want to 
compare their nearest location to the bus stop at time T and see how close it 
is to the scheduled route. 


Or am I missing something? 

-Mike

> On Apr 29, 2016, at 3:54 AM, Esa Heikkinen <esa.heikki...@student.tut.fi> 
> wrote:
> 
> 
> Hi
> 
> I try to explain my case ..
> 
> Situation is not so simple in my logs and solution. There also many types of 
> logs and there are from many sources.
> They are as csv-format and header line includes names of the columns.
> 
> This is simplified description of input logs.
> 
> LOG A's: bus coordinate logs (every bus has own log):
> - timestamp
> - bus number
> - coordinates
> 
> LOG B: bus login/logout (to/from line) message log:
> - timestamp
> - bus number
> - line number
> 
> LOG C:  log from central computers:
> - timestamp
> - bus number
> - bus stop number
> - estimated arrival time to bus stop
> 
> LOG A are updated every 30 seconds (i have also another system by 1 seconds 
> interval). LOG B are updated when bus starts from terminal bus stop and 
> arrives to final bus stop in a line. LOG C is updated when central computer 
> sends new arrival time estimation to bus stop.
> 
> I also need metadata for logs (and analyzer). For example coordinates for bus 
> stop areas.
> 
> Main purpose of analyzing is to check an accuracy (error) of the estimated 
> arrival time to bus stops.
> 
> Because there are many buses and lines, it is too time-comsuming to check all 
> of them. So i check only specific lines with specific bus stops. There are 
> many buses (logged to lines) coming to one bus stop and i am interested about 
> only certain bus.
> 
> To do that, i have to read log partly not in time order (upstream) by 
> sequence:
> 1. From LOG C is searched bus number
> 2. From LOG A is searched when the bus has leaved from terminal bus stop
> 3. From LOG B is searched when bus has sent a login to the line
> 4. From LOG A is searched when the bus has entered to bus stop
> 5. From LOG C is searched a last estimated arrival time to the bus stop and 
> calculates error between real and estimated value
> 
> In my understanding (almost) all log file analyzers reads all data (lines) in 
> time order from log files. My need is only for specific part of log (lines). 
> To achieve that, my solution is to read logs in an arbitrary order (with 
> given time window).
> 
> I know this solution is not suitable for all cases (for example for very fast 
> analyzing and very big data). This solution is suitable for very complex 
> (targeted) analyzing. It can be too slow and memory-consuming, but well done 
> pre-processing of log data can help a lot.
> 
> ---
> Esa Heikkinen
> 
> 28.4.2016, 14:44, Michael Segel kirjoitti:
>> I don’t.
>> 
>> I believe that there have been a  couple of hack-a-thons like one done in 
>> Chicago a few years back using public transportation data.
>> 
>> The first question is what sort of data do you get from the city? 
>> 
>> I mean it could be as simple as time_stamp, bus_id, route and GPS (x,y).   
>> Or they could provide more information. Like last stop, distance to next 
>> stop, avg current velocity… 
>> 
>> Then there is the frequency of the updates. Every second? Every 3 seconds? 5 
>> or 6 seconds…
>> 
>> This will determine how much work you have to do. 
>> 
>> Maybe they provide the routes of the busses via a different API call since 
>> its relatively static.
>> 
>> This will drive your solution more than the underlying technology. 
>> 
>> Oh and whileI focused on bus, there are also rail and other modes of public 
>> transportation like light rail, trains, etc … 
>> 
>> HTH
>> 
>> -Mike
>> 
>> 
>>> On Apr 28, 2016, at 4:10 AM, Esa Heikkinen <esa.heikki...@student.tut.fi 
>>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>>> 
>>> 
>>> Do you know any good examples how to use Spark streaming in tracking public 
>>> transportation systems ?
>>> 
>>> Or Storm or some other tool example ?
>>> 
>>> Regards
>>> Esa Heikkinen
>>> 
>>> 28.4.2016, 3:16, Michael Segel kirjoitti:
>>>> Uhm… 
>>>> I think you need to clarify a couple of things…
>>>> 
>>>> First there is this thing called analog signal processing…. Is that 
>>>> continuous enough for you? 
>>>> 
>>>> But more to the point, Spark Streaming does micro batching so if you’re 
>>>> processing a continuous stream of tick data, you will have more than 50K 
>>>> of tics per second while there are markets open and trading.  Even at 50K 
>>>> a second, that would mean 1 every .02 ms or 50 ticks a ms. 
>>>> 
>>>> And you don’t want to wait until you have a batch to start processing, but 
>>>> you want to process when the data hits the queue and pull it from the 
>>>> queue as quickly as possible. 
>>>> 
>>>> Spark streaming will be able to pull batches in as little as 500ms. So if 
>>>> you pull a batch at t0 and immediately have a tick in your queue, you 
>>>> won’t process that data until t0+500ms. And said batch would contain 
>>>> 25,000 entries. 
>>>> 
>>>> Depending on what you are doing… that 500ms delay can be enough to be 
>>>> fatal to your trading process. 
>>>> 
>>>> If you don’t like stock data, there are other examples mainly when pulling 
>>>> data from real time embedded systems. 
>>>> 
>>>> 
>>>> If you go back and read what I said, if your data flow is >> (much slower) 
>>>> than 500ms, and / or the time to process is >> 500ms ( much longer )  you 
>>>> could use spark streaming.  If not… and there are applications which 
>>>> require that type of speed…  then you shouldn’t use spark streaming. 
>>>> 
>>>> If you do have that constraint, then you can look at systems like 
>>>> storm/flink/samza / whatever where you have a continuous queue and 
>>>> listener and no micro batch delays.
>>>> Then for each bolt (storm) you can have a spark context for processing the 
>>>> data. (Depending on what sort of processing you want to do.) 
>>>> 
>>>> To put this in perspective… if you’re using spark streaming / akka / storm 
>>>> /etc to handle real time requests from the web, 500ms added delay can be a 
>>>> long time. 
>>>> 
>>>> Choose the right tool. 
>>>> 
>>>> For the OP’s problem. Sure Tracking public transportation could be done 
>>>> using spark streaming. It could also be done using half a dozen other 
>>>> tools because the rate of data generation is much slower than 500ms. 
>>>> 
>>>> HTH
>>>> 
>>>> 
>>>>> On Apr 27, 2016, at 4:34 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>> 
>>>>> couple of things.
>>>>> 
>>>>> There is no such thing as Continuous Data Streaming as there is no such 
>>>>> thing as Continuous Availability.
>>>>> 
>>>>> There is such thing as Discrete Data Streaming and  High Availability  
>>>>> but they reduce the finite unavailability to minimum. In terms of 
>>>>> business needs a 5 SIGMA is good enough and acceptable. Even the candles 
>>>>> set to a predefined time interval say 2, 4, 15 seconds overlap. No FX 
>>>>> savvy trader makes a sell or buy decision on the basis of 2 seconds 
>>>>> candlestick
>>>>> 
>>>>> The calculation itself in measurements is subject to finite error as 
>>>>> defined by their Confidence Level (CL) using Standard Deviation function.
>>>>> 
>>>>> OK so far I have never noticed a tool that requires that details of 
>>>>> granularity. Those stuff from Flink etc is in practical term is of little 
>>>>> value and does not make commercial sense.
>>>>> 
>>>>> Now with regard to your needs, Spark micro batching is perfectly adequate.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn   
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>  
>>>>>  
>>>>> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com 
>>>>> <http://talebzadehmich.wordpress.com/>
>>>>>  
>>>>> 
>>>>> On 27 April 2016 at 22:10, Esa Heikkinen <esa.heikki...@student.tut.fi 
>>>>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>>>>> 
>>>>> Hi
>>>>> 
>>>>> Thanks for the answer.
>>>>> 
>>>>> I have developed a log file analyzer for RTPIS (Real Time Passenger 
>>>>> Information System) system, where buses drive lines and the system try to 
>>>>> estimate the arrival times to the bus stops. There are many different log 
>>>>> files (and events) and analyzing situation can be very complex. Also 
>>>>> spatial data can be included to the log data.
>>>>> 
>>>>> The analyzer also has a query (or analyzing) language, which describes a 
>>>>> expected behavior. This can be a requirement of system. Analyzer can be 
>>>>> think to be also a test oracle.
>>>>> 
>>>>> I have published a paper (SPLST'15 conference) about my analyzer and its 
>>>>> language. The paper is maybe too technical, but it is found:
>>>>> http://ceur-ws.org/Vol-1525/paper-19.pdf 
>>>>> <http://ceur-ws.org/Vol-1525/paper-19.pdf>
>>>>> 
>>>>> I do not know yet where it belongs. I think it can be some "CEP with 
>>>>> delays". Or do you know better ?
>>>>> My analyzer can also do little bit more complex and time-consuming 
>>>>> analyzings? There are no a need for real time.
>>>>> 
>>>>> And it is possible to do "CEP with delays" reasonably some existing 
>>>>> analyzer (for example Spark) ?
>>>>> 
>>>>> Regards
>>>>> PhD student at Tampere University of Technology, Finland,  
>>>>> <http://www.tut.fi/>www.tut.fi <http://www.tut.fi/>
>>>>> Esa Heikkinen
>>>>> 
>>>>> 27.4.2016, 15:51, Michael Segel kirjoitti:
>>>>>> Spark and CEP? It depends… 
>>>>>> 
>>>>>> Ok, I know that’s not the answer you want to hear, but its a bit more 
>>>>>> complicated… 
>>>>>> 
>>>>>> If you consider Spark Streaming, you have some issues. 
>>>>>> Spark Streaming isn’t a Real Time solution because it is a micro batch 
>>>>>> solution. The smallest Window is 500ms.  This means that if your compute 
>>>>>> time is >> 500ms and/or  your event flow is >> 500ms this could work.
>>>>>> (e.g. 'real time' image processing on a system that is capturing 60FPS 
>>>>>> because the processing time is >> 500ms. ) 
>>>>>> 
>>>>>> So Spark Streaming wouldn’t be the best solution…. 
>>>>>> 
>>>>>> However, you can combine spark with other technologies like Storm, Akka, 
>>>>>> etc .. where you have continuous streaming. 
>>>>>> So you could instantiate a spark context per worker in storm… 
>>>>>> 
>>>>>> I think if there are no class collisions between Akka and Spark, you 
>>>>>> could use Akka, which may have a better potential for communication 
>>>>>> between workers. 
>>>>>> So here you can handle CEP events. 
>>>>>> 
>>>>>> HTH
>>>>>> 
>>>>>>> On Apr 27, 2016, at 7:03 AM, Mich Talebzadeh < 
>>>>>>> <mailto:mich.talebza...@gmail.com>mich.talebza...@gmail.com 
>>>>>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>>>>>> 
>>>>>>> please see my other reply
>>>>>>> 
>>>>>>> Dr Mich Talebzadeh
>>>>>>>  
>>>>>>> LinkedIn   
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>  
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>>>>>  
>>>>>>>  
>>>>>>> <http://talebzadehmich.wordpress.com/>http://talebzadehmich.wordpress.com
>>>>>>>  <http://talebzadehmich.wordpress.com/>
>>>>>>>  
>>>>>>> 
>>>>>>> On 27 April 2016 at 10:40, Esa Heikkinen < 
>>>>>>> <mailto:esa.heikki...@student.tut.fi>esa.heikki...@student.tut.fi 
>>>>>>> <mailto:esa.heikki...@student.tut.fi>> wrote:
>>>>>>> Hi
>>>>>>> 
>>>>>>> I have followed with interest the discussion about CEP and Spark. It is 
>>>>>>> quite close to my research, which is a complex analyzing for log files 
>>>>>>> and "history" data  (not actually for real time streams).
>>>>>>> 
>>>>>>> I have few questions:
>>>>>>> 
>>>>>>> 1) Is CEP only for (real time) stream data and not for "history" data?
>>>>>>> 
>>>>>>> 2) Is it possible to search "backward" (upstream) by CEP with given 
>>>>>>> time window? If a start time of the time window is earlier than the 
>>>>>>> current stream time.
>>>>>>> 
>>>>>>> 3) Do you know any good tools or softwares for "CEP's" using for log 
>>>>>>> data ?
>>>>>>> 
>>>>>>> 4) Do you know any good (scientific) papers i should read about CEP ?
>>>>>>> 
>>>>>>> 
>>>>>>> Regards
>>>>>>> PhD student at Tampere University of Technology, Finland,  
>>>>>>> <http://www.tut.fi/>www.tut.fi <http://www.tut.fi/>
>>>>>>> Esa Heikkinen
>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail:  
>>>>>>> <mailto:user-unsubscr...@spark.apache.org>user-unsubscr...@spark.apache.org
>>>>>>>  <mailto:user-unsubscr...@spark.apache.org>
>>>>>>> For additional commands, e-mail:  
>>>>>>> <mailto:user-h...@spark.apache.org>user-h...@spark.apache.org 
>>>>>>> <mailto:user-h...@spark.apache.org>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> The opinions expressed here are mine, while they may reflect a cognitive 
>>>>>> thought, that is purely accidental. 
>>>>>> Use at your own risk. 
>>>>>> Michael Segel
>>>>>> michael_segel (AT) hotmail.com <http://hotmail.com/>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> The opinions expressed here are mine, while they may reflect a cognitive 
>>>> thought, that is purely accidental. 
>>>> Use at your own risk. 
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com <http://hotmail.com/>
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>> 
>> 
>

Re: Spark support for Complex Event Processing (CEP)

Reply via email to