Re: Pause Spark Streaming reading or sampling streaming data

2015-08-06 Thread Dimitris Kouzis - Loukas
Hi, - yes - it's great that you wrote it yourself - it means you have more
control. I have the feeling that the most efficient point to discard as
much data as possible - or even modify your subscription protocol to - your
spark input source - not even receive the other 50 seconds of data is the
most efficient point. After you deliver data to DStream - you might filter
them as much as you want - but you will still be subject to garbage
collection and/or potential shuffles/and HDD checkpoints.

On Thu, Aug 6, 2015 at 1:31 AM, Heath Guo heath...@fb.com wrote:

 Hi Dimitris,

 Thanks for your reply. Just wondering – are you asking about my streaming
 input source? I implemented a custom receiver and have been using that.
 Thanks.

 From: Dimitris Kouzis - Loukas look...@gmail.com
 Date: Wednesday, August 5, 2015 at 5:27 PM
 To: Heath Guo heath...@fb.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Pause Spark Streaming reading or sampling streaming data

 What driver do you use? Sounds like something you should do before the
 driver...

 On Thu, Aug 6, 2015 at 12:50 AM, Heath Guo heath...@fb.com wrote:

 Hi, I have a question about sampling Spark Streaming data, or getting
 part of the data. For every minute, I only want the data read in during the
 first 10 seconds, and discard all data in the next 50 seconds. Is there any
 way to pause reading and discard data in that period? I'm doing this to
 sample from a stream of huge amount of data, which saves processing time in
 the real-time program. Thanks!





Re: Pause Spark Streaming reading or sampling streaming data

2015-08-06 Thread Dimitris Kouzis - Loukas
Re-reading your description - I guess you could potentially make your input
source to connect for 10 seconds, pause for 50 and then reconnect.

On Thu, Aug 6, 2015 at 10:32 AM, Dimitris Kouzis - Loukas look...@gmail.com
 wrote:

 Hi, - yes - it's great that you wrote it yourself - it means you have more
 control. I have the feeling that the most efficient point to discard as
 much data as possible - or even modify your subscription protocol to - your
 spark input source - not even receive the other 50 seconds of data is the
 most efficient point. After you deliver data to DStream - you might filter
 them as much as you want - but you will still be subject to garbage
 collection and/or potential shuffles/and HDD checkpoints.

 On Thu, Aug 6, 2015 at 1:31 AM, Heath Guo heath...@fb.com wrote:

 Hi Dimitris,

 Thanks for your reply. Just wondering – are you asking about my streaming
 input source? I implemented a custom receiver and have been using that.
 Thanks.

 From: Dimitris Kouzis - Loukas look...@gmail.com
 Date: Wednesday, August 5, 2015 at 5:27 PM
 To: Heath Guo heath...@fb.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Pause Spark Streaming reading or sampling streaming data

 What driver do you use? Sounds like something you should do before the
 driver...

 On Thu, Aug 6, 2015 at 12:50 AM, Heath Guo heath...@fb.com wrote:

 Hi, I have a question about sampling Spark Streaming data, or getting
 part of the data. For every minute, I only want the data read in during the
 first 10 seconds, and discard all data in the next 50 seconds. Is there any
 way to pause reading and discard data in that period? I'm doing this to
 sample from a stream of huge amount of data, which saves processing time in
 the real-time program. Thanks!






Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread foobar
Hi, I have a question about sampling Spark Streaming data, or getting part of
the data. For every minute, I only want the data read in during the first 10
seconds, and discard all data in the next 50 seconds. Is there any way to
pause reading and discard data in that period? I'm doing this to sample from
a stream of huge amount of data, which saves processing time in the
real-time program. Thanks! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pause-Spark-Streaming-reading-or-sampling-streaming-data-tp24146.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Heath Guo
Hi, I have a question about sampling Spark Streaming data, or getting part of 
the data. For every minute, I only want the data read in during the first 10 
seconds, and discard all data in the next 50 seconds. Is there any way to pause 
reading and discard data in that period? I'm doing this to sample from a stream 
of huge amount of data, which saves processing time in the real-time program. 
Thanks!


Re: Pause Spark Streaming reading or sampling streaming data

2015-08-05 Thread Heath Guo
Hi Dimitris,

Thanks for your reply. Just wondering – are you asking about my streaming input 
source? I implemented a custom receiver and have been using that. Thanks.

From: Dimitris Kouzis - Loukas look...@gmail.commailto:look...@gmail.com
Date: Wednesday, August 5, 2015 at 5:27 PM
To: Heath Guo heath...@fb.commailto:heath...@fb.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Pause Spark Streaming reading or sampling streaming data

What driver do you use? Sounds like something you should do before the driver...

On Thu, Aug 6, 2015 at 12:50 AM, Heath Guo 
heath...@fb.commailto:heath...@fb.com wrote:
Hi, I have a question about sampling Spark Streaming data, or getting part of 
the data. For every minute, I only want the data read in during the first 10 
seconds, and discard all data in the next 50 seconds. Is there any way to pause 
reading and discard data in that period? I'm doing this to sample from a stream 
of huge amount of data, which saves processing time in the real-time program. 
Thanks!