Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread Shixiong(Ryan) Zhu
You can create lazily instantiated singleton instances. See http://spark.apache.org/docs/latest/streaming-programming-guide.html#accumulators-broadcast-variables-and-checkpoints for examples of accumulators and broadcast variables. You can use the same approach to create your cached RDD. On Tue,

Re: Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
and my cached RDD is not small. If it was maybe I could materialize and broadcast. Thanks On Tue, Feb 7, 2017 at 10:28 AM, shyla deshpande wrote: > I have a situation similar to the following and I get SPARK-13758 >

Spark streaming question - SPARK-13758 Need to use an external RDD inside DStream processing...Please help

2017-02-07 Thread shyla deshpande
I have a situation similar to the following and I get SPARK-13758 . I understand why I get this error, but I want to know what should be the approach in dealing with these situations. Thanks > var cached =

Combine code for RDD and DStream

2015-08-03 Thread Sidd S
Hello! I am developing a Spark program that uses both batch and streaming (separately). They are both pretty much the exact same programs, except the inputs come from different sources. Unfortunately, RDD's and DStream's define all of their transformations in their own files, and so I have two

Re: Combine code for RDD and DStream

2015-08-03 Thread Sidd S
DStreams transform function helps me solve this issue elegantly. Thanks! On Mon, Aug 3, 2015 at 1:42 PM, Sidd S ssinga...@gmail.com wrote: Hello! I am developing a Spark program that uses both batch and streaming (separately). They are both pretty much the exact same programs, except the

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
evo.efti...@isecc.com wrote: The only way to join / union /cogroup a DStream RDD with Batch RDD is via the transform method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch. Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new

adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
The only way to join / union /cogroup a DStream RDD with Batch RDD is via the transform method, which returns another DStream RDD and hence it gets discarded at the end of the micro-batch. Is there any way to e.g. union Dstream RDD with Batch RDD which produces a new Batch RDD containing

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
. a second time moreover after specific period of time -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Wednesday, April 15, 2015 8:14 PM To: Evo Eftimov Cc: user@spark.apache.org Subject: Re: adding new elements to batch RDD from DStream RDD Yes, I mean there's

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
JavaPairDStreamK2,V2 transformToPair(FunctionR,JavaPairRDDK2,V2 transformFunc) Return a new DStream in which each RDD is generated by applying a function on each RDD of 'this' DStream. As you can see it ALWAYS returns a DStream NOT a JavaRDD aka batch RDD Re the rest of the discussion (re-loading

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
. from HDFS file) batch RDD e.g. JavaRDD - the only way to union / join / cogroup from DSTreamRDD to batch RDD is via the transform method which always returns DStream RDD NOT batch RDD - check the API On a separate note - your suggestion to keep reloading a Batch RDD from a file - it may have

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
to batch RDD is via the transform method which always returns DStream RDD NOT batch RDD - check the API On a separate note - your suggestion to keep reloading a Batch RDD from a file - it may have some applications in other scenarios so lets drill down into it - in the context of Spark Streaming

Re: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Sean Owen
DStream RDD Yes, I mean there's nothing to keep you from using them together other than their very different lifetime. That's probably the key here: if you need the streaming data to live a long time it has to live in persistent storage first. I do exactly this and what you describe for the same

RE: adding new elements to batch RDD from DStream RDD

2015-04-15 Thread Evo Eftimov
transformFunc) Return a new DStream in which each RDD is generated by applying a function on each RDD of 'this' DStream. As you can see it ALWAYS returns a DStream NOT a JavaRDD aka batch RDD Re the rest of the discussion (re-loading batch RDD from file within spark steraming context) - lets leave

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
/windowSize 2) Sort RDD by (group, timestamp) 3) Use toLocalIterator to collect each group/partition 4) Turn each group/partition to RDD and put them in a Queue 5) Use SparkStreamingContext.queueStream to consume the Queue[RDD] as DStream Looks good to me, will try it today. The downside is all data

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
Hi Jianshi, For simulation purpose, I think you can try ConstantInputDStream and QueueInputDStream to convert one RDD or series of RDD into DStream, the first one output the same RDD in each batch duration, and the second one just output a RDD in a queue in each batch duration. You can take

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
, For simulation purpose, I think you can try ConstantInputDStream and QueueInputDStream to convert one RDD or series of RDD into DStream, the first one output the same RDD in each batch duration, and the second one just output a RDD in a queue in each batch duration. You can take a look

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
amount of data back to driver. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 2:39 PM To: Shao, Saisai Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Hi Saisai, I understand it's non-trivial

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
@spark.apache.org; Tathagata Das (t...@databricks.com) *Subject:* Re: RDD to DStream Hi Saisai, I understand it's non-trivial, but the requirement of simulating offline data as stream is also fair. :) I just wrote a prototype, however, I need to do a collect and a bunch of parallelize

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
, October 27, 2014 2:39 PM *To:* Shao, Saisai *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) *Subject:* Re: RDD to DStream Hi Saisai, I understand it's non-trivial, but the requirement of simulating offline data as stream is also fair. :) I just wrote a prototype, however

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
cannot support nested RDD in closure. Thanks Jerry From: Jianshi Huang [mailto:jianshi.hu...@gmail.com] Sent: Monday, October 27, 2014 3:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Ok, back to Scala code, I'm wondering why I

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
nested RDD in closure. Thanks Jerry *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] *Sent:* Monday, October 27, 2014 3:30 PM *To:* Shao, Saisai *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) *Subject:* Re: RDD to DStream Ok, back to Scala code, I'm wondering

RE: RDD to DStream

2014-10-27 Thread Shao, Saisai
@spark.apache.org; Tathagata Das (t...@databricks.com) Subject: Re: RDD to DStream Yeah, you're absolutely right Saisai. My point is we should allow this kind of logic in RDD, let's say transforming type RDD[(Key, Iterable[T])] to Seq[(Key, RDD[T])]. Make sense? Jianshi On Mon, Oct 27, 2014 at 3:56 PM

Re: RDD to DStream

2014-10-27 Thread Jianshi Huang
[mailto:jianshi.hu...@gmail.com] *Sent:* Monday, October 27, 2014 4:07 PM *To:* Shao, Saisai *Cc:* user@spark.apache.org; Tathagata Das (t...@databricks.com) *Subject:* Re: RDD to DStream Yeah, you're absolutely right Saisai. My point is we should allow this kind of logic in RDD, let's say transforming

Re: RDD to DStream

2014-10-26 Thread Jianshi Huang
. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using pure RDD transformations. What you have done is a decent

Re: RDD to DStream

2014-08-06 Thread Tathagata Das
Hey Aniket, Great thoughts! I understand the usecase. But as you have realized yourself it is not trivial to cleanly stream a RDD as a DStream. Since RDD operations are defined to be scan based, it is not efficient to define RDD based on slices of data within a partition of another RDD, using

RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
Sometimes it is useful to convert a RDD into a DStream for testing purposes (generating DStreams from historical data, etc). Is there an easy way to do this? I could come up with the following inefficient way but no sure if there is a better way to achieve this. Thoughts? class RDDExtension[T

Re: RDD to DStream

2014-08-01 Thread Aniket Bhatnagar
August 2014 13:55, Aniket Bhatnagar aniket.bhatna...@gmail.com wrote: Sometimes it is useful to convert a RDD into a DStream for testing purposes (generating DStreams from historical data, etc). Is there an easy way to do this? I could come up with the following inefficient way but no sure

Re: RDD to DStream

2014-08-01 Thread Mayur Rustagi
Nice question :) Ideally you should use a queuestream interface to push RDD into a queue then spark streaming can handle the rest. Though why are you looking to convert RDD to DStream, another workaround folks use is to source DStream from folders move files that they need reprocessed back

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
I totally agree that doing if we are able to do this it will be very cool. However, this requires having a common trait / interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah shahmaul

Re: Generic Interface between RDD and DStream

2014-07-11 Thread andy petrella
/ interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah shahmaul...@gmail.com wrote: I wanted to get a perspective on how to share code between Spark batch processing and Spark Streaming

Re: Generic Interface between RDD and DStream

2014-07-11 Thread Tathagata Das
this it will be very cool. However, this requires having a common trait / interface between RDD and DStream, which we dont have as of now. It would be very cool though. On my wish list for sure. TD On Thu, Jul 10, 2014 at 11:53 AM, mshah shahmaul...@gmail.com wrote: I wanted to get