Pause Spark Streaming reading or sampling streaming data
Hi, I have a question about sampling Spark Streaming data, or getting part of the data. For every minute, I only want the data read in during the first 10 seconds, and discard all data in the next 50 seconds. Is there any way to pause reading and discard data in that period? I'm doing this to sample from a stream of huge amount of data, which saves processing time in the real-time program. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pause-Spark-Streaming-reading-or-sampling-streaming-data-tp24146.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Multiple operations on same DStream in Spark Streaming
Hi I'm working with Spark Streaming using scala, and trying to figure out the following problem. In my DStream[(int, int)], each record is an int pair tuple. For each batch, I would like to filter out all records with first integer below average of first integer in this batch, and for all records with first integer above average of first integer in the batch, compute the average of second integers in such records. What's the best practice to implement this? I tried this but kept getting the object not serializable exception because it's hard to share variables (such as average of first int in the batch) between workers and driver. Any suggestions? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-operations-on-same-DStream-in-Spark-Streaming-tp23995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Create RDD from output of unix command
What's the best practice of creating RDD from some external unix command output? I assume if the output size is large (say millions of lines), creating RDD from an array of all lines is not a good idea? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Create-RDD-from-output-of-unix-command-tp23723.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark Streaming reads from stdin or output from command line utility
Hi, I'm new to Spark Streaming, and I want to create a application where Spark Streaming could create DStream from stdin. Basically I have a command line utility that generates stream data, and I'd like to pipe data into DStream. What's the best way to do that? I thought rdd.pipe() could help, but it seems that requires an rdd in the first place, which does not apply. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-reads-from-stdin-or-output-from-command-line-utility-tp23289.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org