Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella
Actually for click stream, the users space wouldn't be a continuum, unless the order of users is important or the fact that they are coming in a kind of order can be used by the algo. The purpose of the break or binning function is to package things in a cluster for which we know the properties, bu

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread Mayur Rustagi
Interesting, clickstream data would have its own window concept based on session of User , I can imagine windows would change across streams but wouldnt they large be domain specific in Nature? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-08-01 Thread andy petrella
Heya, Dunno if these ideas are still in the air or felt in the warp ^^. However there is a paper on avocado that mentions a way of working with their data (sequence's reads) in a windowed manner without n

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Indeed, these two cases are tightly coupled (the first one is a special case of the second). Actually, these "outliers" could be handled by a dedicated function what I named outliersManager -- I was not so much inspired ^^, but we could name these outliers, "outlaws" and thus the function would be

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread Tathagata Das
I think it makes sense, though without a concrete implementation its hard to be sure. Applying sorting on the RDD according to the RDDs makes sense, but I can think of two kinds of fundamental problems. 1. How do you deal with ordering across RDD boundaries. Say two consecutive RDDs in the DStream

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Heya TD, Thanks for the detailed answer! Much appreciated. Regarding order among elements within an RDD, you're definitively right, it'd kill the //ism and would require synchronization which is completely avoided in distributed env. That's why, I won't push this constraint to the RDDs themselve

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-15 Thread Tathagata Das
Very interesting ideas Andy! Conceptually i think it makes sense. In fact, it is true that dealing with time series data, windowing over application time, windowing over number of events, are things that DStream does not natively support. The real challenge is actually mapping the conceptual windo

[brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-15 Thread andy petrella
Dear Sparkers, *[sorry for the lengthy email... => head to the gist for a preview :-p**]* I would like to share some thinking I had due to a use case I faced. Basically, as the subject announced it, it's a generalization of the DStream c