Re: Use Spark Streaming for Batch?

2015-02-22 Thread Tobias Pfeiffer
Hi,

On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote:
  /Might it be possible to perform large batches processing on HDFS time
  series data using Spark Streaming?/
 
  1.I understand that there is not currently an InputDStream that could do
  what's needed.  I would have to create such a thing.
  2. Time is a problem.  I would have to use the timestamps on our events
 for
  any time-based logic and state management
  3. The batch duration would become meaningless in this scenario.
 Could I
  just set it to something really small (say 1 second) and then let it
 fall
  behind, processing the data as quickly as it could?


So, if it is not an issue for you if everything is processed in one batch,
you can use streamingContext.textFileStream(hdfsDirectory). This will
create a DStream that has a huge RDD with all data in the first batch and
then empty batches afterwards. You can have small batch size, should not be
a problem.
An alternative would be to write some code that creates one RDD per file in
your HDFS directory, create a Queue of those RDDs and then use
streamingContext.queueStream(), possibly with the oneAtATime=true parameter
(which will process only one RDD per batch).

However, to do window computations etc with the timestamps embedded *in*
your data will be a major effort, as in: You cannot use the existing
windowing functionality from Spark Streaming. If you want to read more
about that, there have been a number of discussions about that topic on
this list; maybe you can look them up.

Tobias


Re: Use Spark Streaming for Batch?

2015-02-22 Thread Soumitra Kumar
See if https://issues.apache.org/jira/browse/SPARK-3660 helps you. My patch
has been accepted and, this enhancement is scheduled for 1.3.0.

This lets you specify initialRDD for updateStateByKey operation. Let me
know if you need any information.

On Sun, Feb 22, 2015 at 5:21 PM, Tobias Pfeiffer t...@preferred.jp wrote:

 Hi,

 On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com
 wrote:
  /Might it be possible to perform large batches processing on HDFS time
  series data using Spark Streaming?/
 
  1.I understand that there is not currently an InputDStream that could do
  what's needed.  I would have to create such a thing.
  2. Time is a problem.  I would have to use the timestamps on our events
 for
  any time-based logic and state management
  3. The batch duration would become meaningless in this scenario.
 Could I
  just set it to something really small (say 1 second) and then let it
 fall
  behind, processing the data as quickly as it could?


 So, if it is not an issue for you if everything is processed in one batch,
 you can use streamingContext.textFileStream(hdfsDirectory). This will
 create a DStream that has a huge RDD with all data in the first batch and
 then empty batches afterwards. You can have small batch size, should not be
 a problem.
 An alternative would be to write some code that creates one RDD per file
 in your HDFS directory, create a Queue of those RDDs and then use
 streamingContext.queueStream(), possibly with the oneAtATime=true parameter
 (which will process only one RDD per batch).

 However, to do window computations etc with the timestamps embedded *in*
 your data will be a major effort, as in: You cannot use the existing
 windowing functionality from Spark Streaming. If you want to read more
 about that, there have been a number of discussions about that topic on
 this list; maybe you can look them up.

 Tobias




Re: Use Spark Streaming for Batch?

2015-02-21 Thread Sean Owen
I agree with your assessment as to why it *doesn't* just work. I don't
think a small batch duration helps as all files it sees at the outset
are processed in one batch. Your timestamps are a user-space concept
not a framework concept.

However, there ought to be a great deal of reusability between the
two, so maybe a small refactoring lets you use 95% of it as-is.

Isn't the core of your job to process an RDD of timestamp+data
together with state to produce new state? if you have the pieces to do
that, you should be able to hook them into Spark Streaming to its
timestamp value, and its updateStateByKey, but then as easily just
point this generic logic at an RDD from historical data and an empty
initial state?

On Sat, Feb 21, 2015 at 1:05 AM, craigv craigvanderbo...@gmail.com wrote:
 We have a sophisticated Spark Streaming application that we have been using
 successfully in production for over a year to process a time series of
 events.  Our application makes novel use of updateStateByKey() for state
 management.

 We now have the need to perform exactly the same processing on input data
 that's not real-time, but has been persisted to disk.  We do not want to
 rewrite our Spark Streaming app unless we have to.

 /Might it be possible to perform large batches processing on HDFS time
 series data using Spark Streaming?/

 1.I understand that there is not currently an InputDStream that could do
 what's needed.  I would have to create such a thing.
 2. Time is a problem.  I would have to use the timestamps on our events for
 any time-based logic and state management
 3. The batch duration would become meaningless in this scenario.  Could I
 just set it to something really small (say 1 second) and then let it fall
 behind, processing the data as quickly as it could?

 It all seems possible.  But could Spark Streaming work this way?  If I
 created a DStream that delivered (say) months of events, could Spark
 Streaming effectively process this in a batch fashion?

 Any and all comments/ideas welcome!






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Use-Spark-Streaming-for-Batch-tp21745.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Use Spark Streaming for Batch?

2015-02-20 Thread craigv
We have a sophisticated Spark Streaming application that we have been using
successfully in production for over a year to process a time series of
events.  Our application makes novel use of updateStateByKey() for state
management.

We now have the need to perform exactly the same processing on input data
that's not real-time, but has been persisted to disk.  We do not want to
rewrite our Spark Streaming app unless we have to.

/Might it be possible to perform large batches processing on HDFS time
series data using Spark Streaming?/

1.I understand that there is not currently an InputDStream that could do
what's needed.  I would have to create such a thing.
2. Time is a problem.  I would have to use the timestamps on our events for
any time-based logic and state management
3. The batch duration would become meaningless in this scenario.  Could I
just set it to something really small (say 1 second) and then let it fall
behind, processing the data as quickly as it could?

It all seems possible.  But could Spark Streaming work this way?  If I
created a DStream that delivered (say) months of events, could Spark
Streaming effectively process this in a batch fashion?

Any and all comments/ideas welcome!






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Use-Spark-Streaming-for-Batch-tp21745.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org