I would like to use Spark (and Spark streaming) to do some processing on time series. I have text files with many lines where each line contains a timestamp and values associated with this timestamp. Each timestamp is unique. Timestamps are ordered. I am considering them as keys. The lines in my text files are already ordered by timestamps.
I am looking for a neat way to leverage this order in my spark programs, and my questions are all about this. I am using "sc.textFile(..)", doing transformations with .map(), .join(), etc. I am able to split my data (e.g. per day) with a custom partitioner. However, invariably at some point I can observe that the initial ordering I had is lost. Currently, this forces me to do calls to ".sortByKey()", but I have the impression that this manner is far from optimal. I would prefer preserving ordering information whenever this is possible, instead of losing it and recomputing it later. - Is there a description about functions that lose the order and functions that preserve it? (As far as I understand, map() should preserve the order for instance). I would like to understand when (and why) the order cannot be preserved in Spark. - I think that many distributed algorithms (e.g. joins) could be much faster when taking advantage of the fact that keys are ordered. Is there a way to specify this in Spark? - I would like to implement algorithms that traverse my time series data in order, with a sliding window over time, just as with "reduceByWindow()" in Spark Streaming, but taking order into account. I need to compute non-associative functions over these rolling windows. This seems difficult with the current versions of Spark/Spark Streaming, without a notion of order (therefore limiting computable functions to associative ones). Am I missing something here? - Are there recommended ways to deal with ordered data (and keys) such as time series data in Spark/Spark Streaming? Thank you for any hint. Best regards Pierre -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Time-series-in-Spark-Spark-Streaming-tp11775.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org