Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
stead of having my rdd processing code called once for each RDD in the batch, is it possible to essentially group all of the RDDs from the batch into a single RDD and single partition and therefore operate on all of the elements in the batch at once? My goal here is to do an operation exactly on

How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread swetha
Hi, We have a requirement wherein we need to keep RDDs in memory between Spark batch processing that happens every one hour. The idea here is to have RDDs that have active user sessions in memory between two jobs so that once a job processing is done and another job is run after an hour the RDDs

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Akhil Das
Currently RDDs are not encrypted, I think you can go ahead and open a JIRA to add this feature and may be in future release it could be added. Thanks Best Regards On Fri, Jul 31, 2015 at 1:47 PM, Matthew O'Reilly wrote: > Hi, > > I am currently working on the latest version o

Re: Encryption on RDDs or in-memory/cache on Apache Spark

2015-08-02 Thread Jörn Franke
, Matthew O'Reilly a écrit : > Hi, > > I am currently working on the latest version of Apache Spark (1.4.1), > pre-built package for Hadoop 2.6+. > > Is there any feature in Spark/Hadoop to encrypt RDDs or in-memory/cache > (something similar is Altibase's HDB: > http

How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that has been created in current session. Anyone knows if there is any such commands available? Something similar to SparkSQL to list all temp tables : show

how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Dear Romi, Priya, Sujt and Shivaram and all, I have took lots of days to think into this issue, however, without  any enough good solution...I shall appreciate your all kind help. There is an RDD rdd1, and another RDD rdd2, (rdd2 can be PairRDD, or DataFrame with two columns as ).StringDate colum

Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Subhajit Purkayastha
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 2 RDDs but not 3. Thanks

Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Sean Owen
spark-rapids is not part of Spark, so couldn't speak to it, but Spark itself does not use GPUs at all. It does let you configure a task to request a certain number of GPUs, and that would work for RDDs, but it's up to the code being executed to use the GPUs. On Tue, Sep 21, 2021

Re: Does Apache Spark 3 support GPU usage for Spark RDDs?

2021-09-21 Thread Artemis User
pu-usage-for-spark-rdds> Regards, - Abhishek Shakya Senior Data Scientist 1, Contact: +919002319890 | Email ID: abhishek.sha...@aganitha.ai <mailto:abhishek.sha...@aganitha.ai> Aganitha Cognitive Solutions <https://aganitha.ai/>

[SparkQL] how are RDDs partitioned and distributed in a standalone cluster?

2018-02-18 Thread prabhastechie
Say I have a main method with the following pseudo-code (to be run on a spark standalone cluster): main(args) { RDD rdd rdd1 = rdd.map(...) // some other statements not using RDD rdd2 = rdd.filter(...) } When executed, will each of the two statements involving RDDs (map and filter) be

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
do sliding window operation on RDDs in Pyspark? Hello, I have 2 text file in the following form and my goal is to calculate the Pearson correlation between them using sliding window in pyspark: 123.00 -12.00 334.00 . . . First I read these 2 text file and store them in RDD format and then I

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread zakhavan
the sensors and produce them as DStreams. The following code is how I'm getting the data and write them into 2 text files. Do you have any idea how I can use Kafka in this case so that I have DStreams instead of RDDs? from obspy.clients.seedlink.easyseedlink import create_client from obspy impo

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-02 Thread Taylor Cox
window operation on RDDs in Pyspark? Thank you, Taylor for your reply. The second solution doesn't work for my case since my text files are getting updated every second. Actually, my input data is live such that I'm getting 2 streams of data from 2 seismic sensors and then I write them i

RE: How to do sliding window operation on RDDs in Pyspark?

2018-10-04 Thread zakhavan
Thank you. It helps. Zeinab -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Hemanth Yamijala
Hi, I am trying to find where Spark persists RDDs when we call the persist() api and executed under YARN. This is purely for understanding... In my driver program, I wait indefinitely, so as to avoid any clean up problems. In the actual job, I roughly do the following: JavaRDD lines

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tathagata Das
You are just setting up the computation here using foreacRDD. You have not even run the streaming context to get any data. On Wed, Feb 25, 2015 at 2:21 PM, Thanigai Vellore < thanigai.vell...@gmail.com> wrote: > I have this function in the driver program which collects the result fr

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Thanigai Vellore
5 at 2:21 PM, Thanigai Vellore < > thanigai.vell...@gmail.com> wrote: > >> I have this function in the driver program which collects the result from >> rdds (in a stream) into an array and return. However, even though the RDDs >> (in the dstream) have data, the function is returning

Re: Spark Streaming - Collecting RDDs into array in the driver program

2015-02-25 Thread Tobias Pfeiffer
Hi, On Thu, Feb 26, 2015 at 11:24 AM, Thanigai Vellore < thanigai.vell...@gmail.com> wrote: > It appears that the function immediately returns even before the > foreachrdd stage is executed. Is that possible? > Sure, that's exactly what happens. foreachRDD() schedules a computation, it does not p

Re: How to cogroup/join pair RDDs with different key types?

2014-04-14 Thread Andrew Ash
Are your IPRanges all on nice, even CIDR-format ranges? E.g. 192.168.0.0/16or 10.0.0.0/8? If the range is always an even subnet mask and not split across subnets, I'd recommend flatMapping the ipToUrl RDD to (IPRange, String) and then joining the two RDDs. The expansion would be at most 3

Re: How to cogroup/join pair RDDs with different key types?

2014-04-15 Thread Roger Hoover
partition boundaries 3. Partition ipToUrl and the new ipRangeToZip according to the partitioning scheme from step 1 4. Join matching partitions of these two RDDs I still don't know how to do step 4 though. I see that RDDs have a mapPartitions() operation to let you do whatever you want with a p

Re: How to cogroup/join pair RDDs with different key types?

2014-04-15 Thread Roger Hoover
rtitioning scheme from step 1 > 4. Join matching partitions of these two RDDs > > I still don't know how to do step 4 though. I see that RDDs have a > mapPartitions() operation to let you do whatever you want with a partition. > What I need is a way to get my hands on two parti

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Roger Hoover
rom step 1 > 4. Join matching partitions of these two RDDs > > I still don't know how to do step 4 though. I see that RDDs have a > mapPartitions() operation to let you do whatever you want with a partition. > What I need is a way to get my hands on two partitions at once, e

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Andrew Ash
;> 1. Choose number of partitions, n, over IP space >> 2. Preprocess the IPRanges, splitting any of them that cross partition >> boundaries >> 3. Partition ipToUrl and the new ipRangeToZip according to the >> partitioning scheme from step 1 >> 4. Join matchi

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Roger Hoover
space >>> 2. Preprocess the IPRanges, splitting any of them that cross partition >>> boundaries >>> 3. Partition ipToUrl and the new ipRangeToZip according to the >>> partitioning scheme from step 1 >>> 4. Join matching partitions of these two RDDs >>> &

Re: What is the recommended way to store state across RDDs?

2014-04-28 Thread Gerard Maas
recommended way to store state across RDDs as you traverse a > DStream and go from 1 RDD to another? > > > > Consider a trivial example of moving average. Between RDDs should the > average be saved in a cache (ie redis) or is there another globar var type > available in Spark?

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
Excuse me - the line inside the loop should read: rdd.foreach(myFunc) - not sc. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-an-action-inside-a-loop-across-multiple-RDDs-java-io-NotSerializableException-tp16580p16581.html Sent from the Apache

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread Cheng Lian
generate multiple RDDs for various datasets across a single run and then run an action on it - E.g. MyFunc myFunc = ... //It implements VoidFunction //set some extra variables - all serializable ... for (JavaRDD rdd: rddList) { ... sc.foreach(myFunc); } The problem I'm seeing is that after the

Re: Running an action inside a loop across multiple RDDs + java.io.NotSerializableException

2014-10-16 Thread _soumya_
ot; + locale + filePathSuffix; jsonContent.setTagName(tagName); jsonContent.setOutputPath(filePath); idLists.get(tagName).foreach( jsonContent ); ... } -- View this message in context: http://apache-sp

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Gerard Maas
Pinging TD -- I'm sure you know :-) -kr, Gerard. On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas wrote: > Hi, > > We have been implementing several Spark Streaming jobs that are basically > processing data and inserting it into Cassandra, sorting it among different > keyspaces. > > We've been fo

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-20 Thread Matt Narrell
http://spark.apache.org/docs/latest/streaming-programming-guide.html foreachRDD is executed on the driver…. mn > On Oct 20, 2014, at 3:07 AM, Gerard Maas wrote: > > Pinging TD -- I'm sure you know :-) > > -kr, Gerard. >

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
Thanks Matt, Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. My question still remains: Is it better to foreachRDD early in the process or do as much Dstream transformations before going

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-22 Thread Gerard Maas
PS: Just to clarify my statement: >>Unlike the feared RDD operations on the driver, it's my understanding that these Dstream ops on the driver are merely creating an execution plan for each RDD. With "feared RDD operations on the driver" I meant to contrast an rdd action like rdd.collect that wou

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-23 Thread Tathagata Das
rence between - dstream.count() which produces a DStream of single-element RDDs, the element being the count, and - dstream.foreachRDD(_.count()) which returns the count directly. In the first case, some random worker node is chosen for the reduce, in another the driver is chosen for the reduce.

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Gerard Maas
uce): *Here there is another > subtle execution difference between > - dstream.count() which produces a DStream of single-element RDDs, the > element being the count, and > - dstream.foreachRDD(_.count()) which returns the count directly. > > In the first case, some random worker node i

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread Tathagata Das
tly specified and is the same both cases, >> then the performance should be similar. Note that this difference in the >> default numbers are not guaranteed to be like this, it could change in >> future implementations. >> >> *3. Aggregation-like operations (count, redu

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
rforms better. But if >>> the number of reducers is explicitly specified and is the same both cases, >>> then the performance should be similar. Note that this difference in the >>> default numbers are not guaranteed to be like this, it could change in >>> futu

How to map values read from test file to 2 different RDDs

2016-05-23 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

Is there a way to co-locate partitions from two partitioned RDDs?

2016-01-19 Thread nwali
Hi, I am working with Spark in Java on top of a HDFS cluster. In my code two RDDs are partitioned with the same partitioner (HashPartitioner with the same number of partitions), so they are co-partitioned. Thus same keys are on the same partitions' number but that does not mean that both RDD

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Jon Chase
called once for each RDD in the > batch, is it possible to essentially group all of the RDDs from the batch > into a single RDD and single partition and therefore operate on all of the > elements in the batch at once? > > My goal here is to do an operation exactly once for every bat

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread N B
t;> >> Instead of having my rdd processing code called once for each RDD in the >> batch, is it possible to essentially group all of the RDDs from the batch >> into a single RDD and single partition and therefore operate on all of the >> elements in the batch at once? &g

Re: Possible to combine all RDDs from a DStream batch into one?

2015-07-15 Thread Ted Yu
in my Spark Streaming program >>> (Java): >>> >>> dStream.foreachRDD((rdd, batchTime) -> { >>> log.info("processing RDD from batch {}", batchTime); >>> >>> // my rdd processing code >>>

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Tachyon is one way. Also check out the Spark Job Server <https://github.com/spark-jobserver/spark-jobserver> . -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html Sent fr

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread ericacm
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but it's not exactly the same as keeping it cached in Spark. Spark Job Server is a way to keep it cached in Spark. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RD

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread harirajaram
I was about say whatever the previous post said,so +1 to the previous post,from my understanding (gut feeling) of your requirement it very easy to do this with spark-job-server. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory

Re: How to keep RDDs in memory between two different batch jobs?

2015-07-22 Thread Haoyuan Li
Yes. Tachyon can handle this well: http://tachyon-project.org/ Best, Haoyuan On Wed, Jul 22, 2015 at 10:56 AM, swetha wrote: > Hi, > > We have a requirement wherein we need to keep RDDs in memory between Spark > batch processing that happens every one hour. The idea here is

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Dhaval Patel
Apologies I accidentally included Spark User DL on BCC. The actual email message is below. = Hi: I have been working on few example using zeppelin. I have been trying to find a command that would list all *dataframes/RDDs* that

Re: How to list all dataframes and RDDs available in current session?

2015-08-20 Thread Rishitesh Mishra
I am not sure if you can view all RDDs in a session. Tables are maintained in a catalogue . Hence its easier. However you can see the DAG representation , which lists all the RDDs in a job , with Spark UI. On 20 Aug 2015 22:34, "Dhaval Patel" wrote: > Apologies > > I ac

Re: How to list all dataframes and RDDs available in current session?

2015-08-21 Thread Raghavendra Pandey
You get the list of all the persistet rdd using spark context... On Aug 21, 2015 12:06 AM, "Rishitesh Mishra" wrote: > I am not sure if you can view all RDDs in a session. Tables are maintained > in a catalogue . Hence its easier. However you can see the DAG > representatio

Re: How to list all dataframes and RDDs available in current session?

2015-08-24 Thread Dhaval Gmail
text... >> On Aug 21, 2015 12:06 AM, "Rishitesh Mishra" >> wrote: >> I am not sure if you can view all RDDs in a session. Tables are maintained >> in a catalogue . Hence its easier. However you can see the DAG >> representation , which lists all the RD

DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread petranidis
Hi all, I am new to spark and I have writen a few spark programs mostly around machine learning applications. I am trying to resolve a particular problem where there are two RDDs that should be updated by using elements of each other. More specifically, if the two pair RDDs are called A and B

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Romi Kuntsman
Hi, If I understand correctly: rdd1 contains keys (of type StringDate) rdd2 contains keys and values and rdd3 contains all the keys, and the values from rdd2? I think you should make rdd1 and rdd2 PairRDD, and then use outer join. Does that make sense? On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Hi Romi, Yes, you understand it correctly.And rdd1 keys are cross with rdd2 keys, that is, there are lots of same keys between rdd1 and rdd2, and there are some keys inrdd1 but not in rdd2, there are also some keys in rdd2 but not in rdd1.Then rdd3 keys would be same with rdd1 keys, rdd3 will no

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Richard Eggert
It's the same as joining 2. Join two together, and then join the third one to the result of that. On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" wrote: > Can I join 3 different RDDs together in a Spark SQL DF? I can find > examples for 2 RDDs but not 3. > > > > Thanks > > >

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
A join B join C === (A join B) join C Semantically they are equivalent, right? From: Richard Eggert [mailto:richard.egg...@gmail.com] Sent: Monday, October 12, 2015 5:12 AM To: Subhajit Purkayastha Cc: User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? It's the same as join

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Ted Yu
t [mailto:richard.egg...@gmail.com] > *Sent:* Monday, October 12, 2015 5:12 AM > *To:* Subhajit Purkayastha > *Cc:* User > *Subject:* Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? > > > > It's the same as joining 2. Join two together, and then join the third one &g

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
hih...@gmail.com] Sent: Monday, October 12, 2015 8:37 AM To: Cheng, Hao Cc: Richard Eggert; Subhajit Purkayastha; User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? Some weekend reading: http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative Cheers On Sun, Oct 11, 2015 a

Is there any window operation for RDDs in Pyspark? like for DStreams

2018-11-20 Thread zakhavan
Hello, I have two RDDs and my goal is to calculate the Pearson's correlation between them using sliding window. I want to have 200 samples in each window from rdd1 and rdd2 and calculate the correlation between them and then slide the window with 120 samples and calculate the correlation be

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-06 Thread k.tham
an RDD cannot contain elements of type RDD. (i.e. you can't nest RDDs within RDDs, in fact, I don't think it makes any sense) I suggest rather than having an RDD of file names, collect those file name strings back on to the driver as a Scala array of file names, and then from there, mak

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread rkgurram
Thank you for the response, sure will try that out. Currently I changed my code such that the first map "files.map" to "files.flatMap", which I guess will do similar what you are saying, it gives me a List[] of elements (in this case LabeledPoints, I could also do RDDs) which

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Raghavendra Pandey
You can also use join function of rdd. This is actually kind of append funtion that add up all the rdds and create one uber rdd. On Wed, Jan 7, 2015, 14:30 rkgurram wrote: > Thank you for the response, sure will try that out. > > Currently I changed my code such that the first map &

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Sean Owen
of append > funtion that add up all the rdds and create one uber rdd. > > On Wed, Jan 7, 2015, 14:30 rkgurram wrote: > >> Thank you for the response, sure will try that out. >> >> Currently I changed my code such that the first map "files.map" to >> &quo

Re: How to merge a RDD of RDDs into one uber RDD

2015-01-07 Thread Raghavendra Pandey
a.pan...@gmail.com> wrote: > >> You can also use join function of rdd. This is actually kind of append >> funtion that add up all the rdds and create one uber rdd. >> >> On Wed, Jan 7, 2015, 14:30 rkgurram wrote: >> >>> Thank you for the response, sure will

Re: Trying to find where Spark persists RDDs when run with YARN

2015-01-18 Thread Sean Owen
These will be under the working directory of the YARN container running the executor. I don't have it handy but think it will also be a "spark-local" or similar directory. On Sun, Jan 18, 2015 at 2:50 PM, Hemanth Yamijala wrote: > Hi, > > I am trying to find where Spa

Re: any work around to support nesting of RDDs other than join

2014-04-06 Thread nkd
context: http://apache-spark-user-list.1001560.n3.nabble.com/any-work-around-to-support-nesting-of-RDDs-other-than-join-tp3816p3820.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Blind Faith
Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ] Now, I want to join them by key values, so for example I want to return the following

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Sean Owen
There isn't a cycle in your graph, since although you reuse reference variables in your code called A and B you are in fact creating new RDDs at each operation. You have some other problem, and you'd have to provide detail on why you think something is deadlocked, like a thread dump. O

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Petros Nyfantis
ble "GC task thread#7 (ParallelGC)" prio=10 tid=0x7f9adc02d000 nid=0x65d1 runnable "VM Periodic Task Thread" prio=10 tid=0x7f9adc0c1800 nid=0x65d9 waiting on condition JNI global references: 208 On 14/09/15 19:45, Sean Owen wrote: There isn't a cycle

how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Eric Ho
Hi, I've two nested-for loops like this: *for all elements in Array A do:* *for all elements in Array B do:* *compare a[3] with b[4] see if they 'match' and if match, return that element;* If I were to represent Arrays A and B as 2 separate RDDs, how would my code look like ?

Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi, I'm writing a Spark program where I want to divide a RDD into different groups, but the groups are too big to use groupByKey. To cope with that, since I know in advance the list of keys for each group, I build a map from the keys to the RDDs that result from filtering the input RDD to ge

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Sonal Goyal
Check cogroup. Best Regards, Sonal Founder, Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Thu, Nov 13, 2014 at 5:11 PM, Blind Faith wrote: > Let us say I have the following two RDDs, with the following key-pair > values. > >

Re: Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Davies Liu
rdd1.union(rdd2).groupByKey() On Thu, Nov 13, 2014 at 3:41 AM, Blind Faith wrote: > Let us say I have the following two RDDs, with the following key-pair > values. > > rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] > > and > > rdd2 = [ (key1,

How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Hi all, I'm sure this must have been solved already, but I can't see anything obvious. Using Spark Streaming, I'm trying to execute a transform function on a DStream at short batch intervals (e.g. 1 second), but only write the resulting data to disk using saveAsTextFiles in a larger batch after

Re: how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Jörn Franke
t; > compare a[3] with b[4] see if they 'match' and if match, return that element; > > If I were to represent Arrays A and B as 2 separate RDDs, how would my code > look like ? > > I couldn't find any RDD functions that would do this for me efficiently. I > don&

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Paweł Szulc
< juan.rodriguez.hort...@gmail.com> wrote: > Hi, > > I'm writing a Spark program where I want to divide a RDD into different > groups, but the groups are too big to use groupByKey. To cope with that, > since I know in advance the list of keys for each group, I build a map from &

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi Paweł, Thanks a lot for your answer. I finally got the program to work by using aggregateByKey, but I was wondering why creating thousands of RDDs doesn't work. I think that could be interesting for using methods that work on RDDs like for example JavaDoubleRDD.stats() (

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Sean Owen
At some level, enough RDDs creates hundreds of thousands of tiny partitions of data each of which creates a task for each stage. The raw overhead of all the message passing can slow things down a lot. I would not design something to use an RDD per key. You would generally use key by some value you

Re: Is there a limit to the number of RDDs in a Spark context?

2015-02-18 Thread Juan Rodríguez Hortalá
Hi Sean, Thanks a lot for your answer. That explains it, as I was creating thousands of RDDs, so I guess the communication overhead was the reason why the Spark job was freezing. After changing the code to use RDDs of pairs and aggregateByKey it works just fine, and quite fast. Again, thanks a

Re: Is there a limit to the number of RDDs in a Spark context?

2015-03-12 Thread Juan Rodríguez Hortalá
Hi, It's been some time since my last message on the subject of using many RDDs in a Spark job, but I have just encountered the same problem again. The thing it's that I have an RDD of time tagged data, that I want to 1) divide into windows according to a timestamp field; 2) compute

Is "Array Of Struct" supported in json RDDs? is it possible to query this?

2014-10-13 Thread shahab
Hello, Given the following structure, is it possible to query, e.g. session[0].id ? In general, is it possible to query "Array Of Struct" in json RDDs? root |-- createdAt: long (nullable = true) |-- id: string (nullable = true) |-- sessions: array (nullable = true) ||

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ashic Mahtab
Hi Ewan,Transforms are definitions of what needs to be done - they don't execute until and action is triggered. For what you want, I think you might need to have an action that writes out rdds to some sort of buffered writer. -Ashic. From: ewan.le...@realitymine.com To: user@spark.apach

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
good idea too, thanks. Thanks, Ewan From: Ashic Mahtab [mailto:as...@live.com] Sent: 31 December 2015 13:50 To: Ewan Leith ; Apache Spark Subject: RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions Hi Ewan, Transforms are definitions of what n

Re: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2016-01-04 Thread Tathagata Das
utside the window length. A buffered > writer is a good idea too, thanks. > > > > Thanks, > > Ewan > > > > *From:* Ashic Mahtab [mailto:as...@live.com] > *Sent:* 31 December 2015 13:50 > *To:* Ewan Leith ; Apache Spark < > user@spark.apache.org> >

Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-03 Thread lonikar
n df.cache().map{row => ...}? Is it a logical row which maintains an array of columns and each column in turn is an array of values for batchSize rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when

Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread shahab
Hi, Probably this question is already answered sometime in the mailing list, but i couldn't find it. Sorry for posting this again. I need to to join and apply filtering on three different RDDs, I just wonder which of the following alternatives are more efficient: 1- first joint all three

Re: Is "Array Of Struct" supported in json RDDs? is it possible to query this?

2014-10-13 Thread Yin Huai
If you are using HiveContext, it should work in 1.1. Thanks, Yin On Mon, Oct 13, 2014 at 5:08 AM, shahab wrote: > Hello, > > Given the following structure, is it possible to query, e.g. session[0].id > ? > > In general, is it possible to query "Array Of Struct&

Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-21 Thread wdbaruni
I am new to Spark and I understand that Spark divides the executor memory into the following fractions: *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or .cache() and can be defined by setting spark.storage.memoryFraction (default 0.6) *Shuffle and aggregation buffers

Re: Does Apache Spark maintain a columnar structure when creating RDDs from Parquet or ORC files?

2015-06-07 Thread Cheng Lian
ze rows? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-Apache-Spark-maintain-a-columnar-structure-when-creating-RDDs-from-Parquet-or-ORC-files-tp23139.html Sent from the Apache Spark User List mailing li

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread Daniel Siegmann
Join causes a shuffle (sending data across the network). I expect it will be better to filter before you join, so you reduce the amount of data which is sent across the network. Note this would be true for *any* transformation which causes a shuffle. It would not be true if you're combining

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-13 Thread shahab
uld be true for *any* transformation which causes a shuffle. > It would not be true if you're combining RDDs with union, since that > doesn't cause a shuffle. > > On Thu, Mar 12, 2015 at 11:04 AM, shahab > wrote: > >> Hi, >> >> Probably this question is alr

Re: Which memory fraction is Spark using to compute RDDs that are not going to be persisted

2015-07-22 Thread Andrew Or
27;s the most memory intensive. -Andrew 2015-07-21 13:47 GMT-07:00 wdbaruni : > I am new to Spark and I understand that Spark divides the executor memory > into the following fractions: > > *RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or > .ca

[spark-core] Choosing the correct number of partitions while joining two RDDs with partitioner set on one

2017-08-07 Thread Piyush Narang
hi folks, I was debugging a Spark job that ending up with too few partitions during the join step and thought I'd reach out understand if this is the right behavior / what typical workarounds are. I have two RDDs that I'm joining. One with a lot of partitions (5K+) and one with m

Pyspark, references to different rdds being overwritten to point to the same rdd, different results when using .cache()

2014-07-09 Thread nimbus
Discovered this in ipynb, and I haven't yet checked to see if it happens elsewhere. here's a simple example: this produces the output: Which is not what I wanted. Alarmingly, if I call .cache() on these rdds, it changes the result and I get what I wanted. which produces:

withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Jack Wenger
0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 186, lxpbda25.ra1.intra.groupama.fr): org.apache.spark.SparkException: Can only zip RDDs with same number of elements in each partition at org.apache.spark.rdd.RDD$$anonfun$zip$1$$anonfun$apply$27$$anon$1.hasNext(RDD.scala:83

Re: withColumn gives "Can only zip RDDs with same number of elements in each partition" but not with a LIMIT on the dataframe

2016-12-20 Thread Richard Startin
hile calling o36.sql. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 18 in stage 2.0 failed 4 times, most recent failure: Lost task 18.3 in stage 2.0 (TID 186, lxpbda25.ra1.intra.groupama.fr<http://lxpbda25.ra1.intra.groupama.fr>): org.apache.spark.Spark

<    1   2   3   4   5   6