Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers
well the "hadoop" way is to save to a/b and a/c and read from a/* :) On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam wrote: > Hi Spark users and developers, > > anyone knows how to union two RDDs without the overhead of it? > > say rdd1.union(rdd2).saveTextFile(..) > Th

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam
Hi Spark users and developers, anyone knows how to union two RDDs without the overhead of it? say rdd1.union(rdd2).saveTextFile(..) This requires a stage to union the 2 rdds before saveAsTextFile (2 stages). Is there a way to skip the union step but have the contents of the two rdds save to the

Re: mapWithState / stateSnapshots() yielding empty rdds?

2016-01-29 Thread Sebastian Piu
Just saw I'm not calling state.update() in my trackState function. I guess that is the issue! On Fri, Jan 29, 2016 at 9:36 AM, Sebastian Piu wrote: > Hi All, > > I'm playing with the new mapWithState functionality but I can't get it > quite to work yet. > > I'm doing two print() calls on the

mapWithState / stateSnapshots() yielding empty rdds?

2016-01-29 Thread Sebastian Piu
Hi All, I'm playing with the new mapWithState functionality but I can't get it quite to work yet. I'm doing two print() calls on the stream: 1. after mapWithState() call, first batch shows results - next batches yield empty 2. after stateSnapshots(), always yields an empty RDD Any pointers on wh

Is there a way to co-locate partitions from two partitioned RDDs?

2016-01-19 Thread nwali
Hi, I am working with Spark in Java on top of a HDFS cluster. In my code two RDDs are partitioned with the same partitioner (HashPartitioner with the same number of partitions), so they are co-partitioned. Thus same keys are on the same partitions' number but that does not mean that both RDD

Re: Sending large objects to specific RDDs

2016-01-17 Thread Daniel Imberman
; a.reduceByKey(generateInvertedIndex) >>>>>>> vectors:RDD.mapPartitions{ >>>>>>> iter => >>>>>>> val invIndex = invertedIndexes(samePartitionKey) >>>>>>> iter.map(i

Re: Sending large objects to specific RDDs

2016-01-16 Thread Ted Yu
>>>>>> ) >>>>>> } >>>>>> >>>>>> How could I go about setting up the Partition such that the specific >>>>>> data >>>>>> structure I need will be pres

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman
slow >> down the process a lot) >> >> Another thought I am currently exploring is whether there is some way I >> can >> create a custom Partition or Partitioner that could hold the data >> structure >> (Although that might get too complicated and become pr

Re: Sending large objects to specific RDDs

2016-01-16 Thread Daniel Imberman
t; the >>>>> extra overhead of sending over all values (which would happen if I >>>>> were to >>>>> make a broadcast variable). >>>>> >>>>> One thought I have been having is to store the objects in HDFS but I'm >>

Re: Sending large objects to specific RDDs

2016-01-16 Thread Koert Kuipers
; (Although that might get too complicated and become problematic) > > Any thoughts on how I could attack this issue would be highly appreciated. > > thank you for your help! > > > > -- > View this message in context: > htt

Re: Sending large objects to specific RDDs

2016-01-15 Thread Ted Yu
I were >>>> to >>>> make a broadcast variable). >>>> >>>> One thought I have been having is to store the objects in HDFS but I'm >>>> not >>>> sure if that would be a suboptimal solution (It seems like it could slow >

Re: Sending large objects to specific RDDs

2016-01-14 Thread Daniel Imberman
; down the process a lot) >>> >>> Another thought I am currently exploring is whether there is some way I >>> can >>> create a custom Partition or Partitioner that could hold the data >>> structure >>> (Although that might get too complicated

Re: Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
tition or Partitioner that could hold the data >> structure >> (Although that might get too complicated and become problematic) >> >> Any thoughts on how I could attack this issue would be highly appreciated. >> >> thank you for your help! >> >> >&g

Re: automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Andrew Or
that has been automatically cleaned, then you will run into problems like shuffle fetch failures or broadcast variable not found etc, which may fail your job. Alternatively, Spark already automatically cleans up all variables that have been garbage collected, including RDDs, shuffle dependencies

automatically unpersist RDDs which are not used for 24 hours?

2016-01-13 Thread Alexander Pivovarov
Is it possible to automatically unpersist RDDs which are not used for 24 hours?

Re: Sending large objects to specific RDDs

2016-01-13 Thread Ted Yu
ted and become problematic) > > Any thoughts on how I could attack this issue would be highly appreciated. > > thank you for your help! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp

Sending large objects to specific RDDs

2016-01-13 Thread Daniel Imberman
s issue would be highly appreciated. thank you for your help! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sending-large-objects-to-specific-RDDs-tp25967.html Sent from the Apache Spark User List mailing lis

Re: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2016-01-04 Thread Tathagata Das
utside the window length. A buffered > writer is a good idea too, thanks. > > > > Thanks, > > Ewan > > > > *From:* Ashic Mahtab [mailto:as...@live.com] > *Sent:* 31 December 2015 13:50 > *To:* Ewan Leith ; Apache Spark < > user@spark.apache.org> >

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
good idea too, thanks. Thanks, Ewan From: Ashic Mahtab [mailto:as...@live.com] Sent: 31 December 2015 13:50 To: Ewan Leith ; Apache Spark Subject: RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions Hi Ewan, Transforms are definitions of what n

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ashic Mahtab
Hi Ewan,Transforms are definitions of what needs to be done - they don't execute until and action is triggered. For what you want, I think you might need to have an action that writes out rdds to some sort of buffered writer. -Ashic. From: ewan.le...@realitymine.com To: user@spark.apach

Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Hi all, I'm sure this must have been solved already, but I can't see anything obvious. Using Spark Streaming, I'm trying to execute a transform function on a DStream at short batch intervals (e.g. 1 second), but only write the resulting data to disk using saveAsTextFiles in a larger batch after

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-30 Thread Chris Fregly
e on the use case? It looks a little >> bit like an abuse of Spark in general . Interactive queries that are not >> suitable for in-memory batch processing might be better supported by ignite >> that has in-memory indexes, concept of hot, warm, cold data etc. or hive on >> tez

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Annabel Melongo
12/28/2015 5:22 PM (GMT-05:00) To: Richard Eggert Cc: Daniel Siegmann , Divya Gehlot , "user @spark" Subject: Re: DataFrame Vs RDDs ... Which one to use When ? here's a good article that sums it up, in my opinion:  https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Darren Govoni
/2015 5:22 PM (GMT-05:00) To: Richard Eggert Cc: Daniel Siegmann , Divya Gehlot , "user @spark" Subject: Re: DataFrame Vs RDDs ... Which one to use When ? here's a good article that sums it up, in my opinion:  https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-byte

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Chris Fregly
here's a good article that sums it up, in my opinion: https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ basically, building apps with RDDs is like building with apps with primitive JVM bytecode. haha. @richard: remember that even if you're current

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Richard Eggert
ements or JSON, then Spark SQL can automatically figure out how to convert it into an RDD of Record objects (and therefore a DataFrame), but there's no way to automatically go the other way (from DataFrame/Record back to custom types). In general, you can ultimately do more with RDDs than DataFr

Re: DataFrame Vs RDDs ... Which one to use When ?

2015-12-28 Thread Daniel Siegmann
DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're us

DataFrame Vs RDDs ... Which one to use When ?

2015-12-27 Thread Divya Gehlot
Hi, I am new bee to spark and a bit confused about RDDs and DataFames in Spark. Can somebody explain me with the use cases which one to use when ? Would really appreciate the clarification . Thanks, Divya

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Saisai Shao
ll be one RDD in each batch. >> >> You could refer to the implementation of DStream#getOrCompute. >> >> >> On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel >> wrote: >> >>> It may be simple question...But, I am struggling to understand this >>&

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
run Patel > wrote: > >> It may be simple question...But, I am struggling to understand this >> >> DStream is a sequence of RDDs created in a batch window. So, how do I >> know how many RDDs are created in a batch? >> >> I am clear about the number of part

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Saisai Shao
Normally there will be one RDD in each batch. You could refer to the implementation of DStream#getOrCompute. On Mon, Dec 21, 2015 at 11:04 AM, Arun Patel wrote: > It may be simple question...But, I am struggling to understand this > > DStream is a sequence of RDDs created i

Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Arun Patel
It may be simple question...But, I am struggling to understand this DStream is a sequence of RDDs created in a batch window. So, how do I know how many RDDs are created in a batch? I am clear about the number of partitions created which is Number of Partitions = (Batch Interval

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
ap . > > > On 14 Dec 2015, at 17:19, Krishna Rao wrote: > > > > Hi all, > > > > What's the best way to run ad-hoc queries against a cached RDDs? > > > > For example, say I have an RDD that has been processed, and persisted to > memory-only. I want

Re: Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Jörn Franke
tez+llap . > On 14 Dec 2015, at 17:19, Krishna Rao wrote: > > Hi all, > > What's the best way to run ad-hoc queries against a cached RDDs? > > For example, say I have an RDD that has been processed, and persisted to > memory-only. I want to be

Run ad-hoc queries at runtime against cached RDDs

2015-12-14 Thread Krishna Rao
Hi all, What's the best way to run ad-hoc queries against a cached RDDs? For example, say I have an RDD that has been processed, and persisted to memory-only. I want to be able to run a count (actually "countApproxDistinct") after filtering by an, at compile time, unknown (spe

Re: Saving RDDs in Tachyon

2015-12-09 Thread Calvin Jia
Hi Mark, Were you able to successfully store the RDD with Akhil's method? When you read it back as an objectFile, you will also need to specify the correct type. You can find more information about integrating Spark and Tachyon on this page: http://tachyon-project.org/documentation/Running-Spark-

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-08 Thread Ewan Higgs
t each new model which is trained caches a set of RDDs and eventually the executors run out of memory. Is there any way in Pyspark to unpersist() these RDDs after each iteration? The names of the RDDs which I gather from the UI is: itemInBlocks itemOutBlocks Products ratingBlocks userInBlocks user

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Sean Owen
with various model configurations. Ideally I would like > to be able to run one job that trains the recommendation model with many > different configurations to try to optimize for performance. A sample code > in python is copied below. > > The issue I have is that each new model

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-07 Thread Ewan Higgs
us model configurations. Ideally I would like to be able to run one job that trains the recommendation model with many different configurations to try to optimize for performance. A sample code in python is copied below. The issue I have is that each new model which is trained caches a set of

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Jacek Laskowski
On Tue, Dec 1, 2015 at 10:57 AM, Shams ul Haque wrote: > Thanks for the suggestion, i am going to try union. ...and please report your findings back. > And what is your opinion on 2nd question. Dunno. If you find a solution, let us know. Jacek

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sushrut Ikhar
har?promo=email_sig> On Tue, Dec 1, 2015 at 3:34 PM, Sonal Goyal wrote: > I think you should be able to join different rdds with same key. Have you > tried that? > On Dec 1, 2015 3:30 PM, "Praveen Chundi" wrote: > >> cogroup could be useful to you, since all t

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sonal Goyal
I think you should be able to join different rdds with same key. Have you tried that? On Dec 1, 2015 3:30 PM, "Praveen Chundi" wrote: > cogroup could be useful to you, since all three are PairRDD's. > > > https://spark.apache.org/docs/

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Praveen Chundi
cogroup could be useful to you, since all three are PairRDD's. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions Best Regards, Praveen On 01.12.2015 10:47, Shams ul Haque wrote: Hi All, I have made 3 RDDs of 3 different dataset, all RDD

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Shams ul Haque
help in your case. > > def union[T](rdds: Seq[RDD[T]])(implicit arg0: ClassTag[T]): RDD[T] > > > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext > > Pozdrawiam, > Jacek > > -- > Jacek Laskowski | https://medium.com/@jace

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Jacek Laskowski
Hi, Never done it before, but just yesterday I found out about SparkContext.union method that could help in your case. def union[T](rdds: Seq[RDD[T]])(implicit arg0: ClassTag[T]): RDD[T] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext Pozdrawiam, Jacek

merge 3 different types of RDDs in one

2015-12-01 Thread Shams ul Haque
Hi All, I have made 3 RDDs of 3 different dataset, all RDDs are grouped by CustomerID in which 2 RDDs have value of Iterable type and one has signle bean. All RDDs have id of Long type as CustomerId. Below are the model for 3 RDDs: JavaPairRDD> JavaPairRDD> JavaPairRDD Now, i have to mer

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Mark Hamstra
> > In the near future, I guess GUI interfaces of Spark will be available > soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all. > They can analyze their data by clicking a few buttons, instead of writing > the programs. : ) That's not in the future.

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Xiao Li
compiler (i.e., Catalyst) will optimize your programs. For most users, the SQL, DataFrame and Dataset APIs are good enough to satisfy most requirements. In the near future, I guess GUI interfaces of Spark will be available soon. Spark users (e.g, CEOs) might not need to know what are RDDs at all. They

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
Thanks Michael, that helped me a lot! On 23 November 2015 at 17:47, Michael Armbrust wrote: > Here is how I view the relationship between the various components of > Spark: > > - *RDDs - *a low level API for expressing DAGs that will be executed in > parallel by Spark workers

Re: Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Michael Armbrust
Here is how I view the relationship between the various components of Spark: - *RDDs - *a low level API for expressing DAGs that will be executed in parallel by Spark workers - *Catalyst -* an internal library for expressing trees that we use to build relational algebra and expression

Relation between RDDs, DataFrames and Project Tungsten

2015-11-23 Thread Jakob Odersky
Hi everyone, I'm doing some reading-up on all the newer features of Spark such as DataFrames, DataSets and Project Tungsten. This got me a bit confused on the relation between all these concepts. When starting to learn Spark, I read a book and the original paper on RDDs, this lead

Corelation between 2 consecutive RDDs in Dstream

2015-11-20 Thread anshu shukla
1- Is there any wat=y to either make the pair of RDDs from a Dstream- Dstream ---> Dstream so that i can use already defined corelation function in spark. *Aim is to find auto-corelation value in spark .(As per my knowledge spark streaming does not support this.)* -- Thanks &

passing RDDs/DataFrames as arguments to functions - what happens?

2015-11-08 Thread Kristina Rogale Plazonic
Hi, I thought I understood RDDs and DataFrames, but one noob thing is bugging me (because I'm seeing weird errors involving joins): *What does Spark do when you pass a big dataframe as an argument to a function? * Are these dataframes included in the closure of the function, and is ther

Re: What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha kasireddy
I think they are roughly of equal size. On Fri, Nov 6, 2015 at 3:45 PM, Ted Yu wrote: > Can you tell us a bit more about your use case ? > > Are the two RDDs expected to be of roughly equal size or, to be of vastly > different sizes ? > > Thanks > > On Fri, Nov 6, 2015 a

Re: What is the efficient way to Join two RDDs?

2015-11-06 Thread Ted Yu
Can you tell us a bit more about your use case ? Are the two RDDs expected to be of roughly equal size or, to be of vastly different sizes ? Thanks On Fri, Nov 6, 2015 at 3:21 PM, swetha wrote: > Hi, > > What is the efficient way to join two RDDs? Would converting both the

What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha
Hi, What is the efficient way to join two RDDs? Would converting both the RDDs to IndexedRDDs be of any help? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-efficient-way-to-Join-two-RDDs-tp25310.html Sent from the Apache Spark

Re: Serializers problems maping RDDs to objects again

2015-11-06 Thread Iker Perez de Albeniz
I have seen that the problem is on the Geohash class that can not be picked.. but in groupByKey i use an other custom class an there is no problem... 2015-11-06 13:44 GMT+01:00 Iker Perez de Albeniz : > Hi All, > > I am new at this list. Before sending this mail i have searched on archive > but

Serializers problems maping RDDs to objects again

2015-11-06 Thread Iker Perez de Albeniz
Hi All, I am new at this list. Before sending this mail i have searched on archive but i have not found a solution for me. i am using spark to process user locations based on RSSI. My spark script look like this.. text_files = sc.textFile(','.join(files[startime])) result = text_files.f

Re: Programatically create RDDs based on input

2015-11-02 Thread amit tewari
tu > > > On Sat, Oct 31, 2015 at 11:18 PM, ayan guha wrote: > >> My java knowledge is limited, but you may try with a hashmap and put RDDs >> in it? >> >> On Sun, Nov 1, 2015 at 4:34 AM, amit tewari >> wrote: >> >>> Thanks Ayan thats somethi

Re: Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Deng Ching-Mallete
Hi, You should perform an action (e.g. count, take, saveAs*, etc. ) in order for your RDDs to be cached since cache/persist are lazy functions. You might also want to do coalesce instead of repartition to avoid shuffling. Thanks, Deng On Mon, Nov 2, 2015 at 5:53 PM, Sushrut Ikhar wrote: >

Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Sushrut Ikhar
Hi, I need to split a RDD into 3 different RDD using filter-transformation. I have cached the original RDD before using filter. The input is lopsided leaving some executors with heavy load while others with less; so I have repartitioned it. *DAG-lineage I expected:* I/P RDD --> MAP RDD --> SHUF

Re: Programatically create RDDs based on input

2015-10-31 Thread Natu Lauchande
") ) .. ; Natu On Sat, Oct 31, 2015 at 11:18 PM, ayan guha wrote: > My java knowledge is limited, but you may try with a hashmap and put RDDs > in it? > > On Sun, Nov 1, 2015 at 4:34 AM, amit tewari > wrote: > >> Thanks Ayan thats something similar to what I am lo

Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
My java knowledge is limited, but you may try with a hashmap and put RDDs in it? On Sun, Nov 1, 2015 at 4:34 AM, amit tewari wrote: > Thanks Ayan thats something similar to what I am looking at but trying the > same in Java is giving compile error: > > JavaRDD jRDD[] = n

Re: Programatically create RDDs based on input

2015-10-31 Thread amit tewari
> # In Driver > fileList=["/file1.txt","/file2.txt"] > rdds = [] > for f in fileList: > rdd = jsc.textFile(f) > rdds.append(rdd) > > > On Sat, Oct 31, 2015 at 11:14 PM, ayan guha wrote: > >> Yes, this can be done. quick

Re: Programatically create RDDs based on input

2015-10-31 Thread ayan guha
Yes, this can be done. quick python equivalent: # In Driver fileList=["/file1.txt","/file2.txt"] rdd = [] for f in fileList: rdd = jsc.textFile(f) rdds.append(rdd) On Sat, Oct 31, 2015 at 11:09 PM, amit tewari wrote: > Hi > > I need the abil

Programatically create RDDs based on input

2015-10-31 Thread amit tewari
Hi I need the ability to be able to create RDDs programatically inside my program (e.g. based on varaible number of input files). Can this be done? I need this as I want to run the following statement inside an iteration: JavaRDD rdd1 = jsc.textFile("/file1.txt"); Thanks Amit

Re: Saving RDDs in Tachyon

2015-10-30 Thread Akhil Das
I guess you can do a .saveAsObjectFiles and read it back as sc.objectFile Thanks Best Regards On Fri, Oct 23, 2015 at 7:57 AM, mark wrote: > I have Avro records stored in Parquet files in HDFS. I want to read these > out as an RDD and save that RDD in Tachyon for any spark job that wants the >

Re: [Spark Streaming] Why are some uncached RDDs are growing?

2015-10-28 Thread Tathagata Das
UpdateStateByKey automatically caches its RDDs. On Tue, Oct 27, 2015 at 8:05 AM, diplomatic Guru wrote: > > Hello All, > > When I checked my running Stream job on WebUI, I can see that some RDDs > are being listed that were not requested to be cached. What more is that > they

[Spark Streaming] Why are some uncached RDDs are growing?

2015-10-27 Thread diplomatic Guru
Hello All, When I checked my running Stream job on WebUI, I can see that some RDDs are being listed that were not requested to be cached. What more is that they are growing! I've not asked them to be cached. What are they? Are they the state (UpdateStateByKey)? Only the rows in white are

Saving RDDs in Tachyon

2015-10-22 Thread mark
I have Avro records stored in Parquet files in HDFS. I want to read these out as an RDD and save that RDD in Tachyon for any spark job that wants the data. How do I save the RDD in Tachyon? What format do I use? Which RDD 'saveAs...' method do I want? Thanks

Inner Joins on Cassandra RDDs

2015-10-21 Thread Priya Ch
Hello All, I have two Cassandra RDDs. I am using joinWithCassandraTable which is doing a cartesian join because of which we are getting unwanted rows. How to perform inner join on Cassandra RDDs ? If I intend to use normal join, i have to read entire table which is costly. Is there any

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
hih...@gmail.com] Sent: Monday, October 12, 2015 8:37 AM To: Cheng, Hao Cc: Richard Eggert; Subhajit Purkayastha; User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? Some weekend reading: http://stackoverflow.com/questions/20022196/are-left-outer-joins-associative Cheers On Sun, Oct 11, 2015 a

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Ted Yu
t [mailto:richard.egg...@gmail.com] > *Sent:* Monday, October 12, 2015 5:12 AM > *To:* Subhajit Purkayastha > *Cc:* User > *Subject:* Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? > > > > It's the same as joining 2. Join two together, and then join the third one &g

RE: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Cheng, Hao
A join B join C === (A join B) join C Semantically they are equivalent, right? From: Richard Eggert [mailto:richard.egg...@gmail.com] Sent: Monday, October 12, 2015 5:12 AM To: Subhajit Purkayastha Cc: User Subject: Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF? It's the same as join

Re: Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Richard Eggert
It's the same as joining 2. Join two together, and then join the third one to the result of that. On Oct 11, 2015 2:57 PM, "Subhajit Purkayastha" wrote: > Can I join 3 different RDDs together in a Spark SQL DF? I can find > examples for 2 RDDs but not 3. > > > > Thanks > > >

Saprk 1.5 - How to join 3 RDDs in a SQL DF?

2015-10-11 Thread Subhajit Purkayastha
Can I join 3 different RDDs together in a Spark SQL DF? I can find examples for 2 RDDs but not 3. Thanks

Re: ExecutorLostFailure when working with RDDs

2015-10-09 Thread Ivan Héda
The solution is to set 'spark.shuffle.io.preferDirectBufs' to 'false'. Then it is working. Cheers! On Fri, Oct 9, 2015 at 3:13 PM, Ivan Héda wrote: > Hi, > > I'm facing an issue with PySpark (1.5.1, 1.6.0-SNAPSHOT) running over Yarn > (2.6.0-cdh5.4.4). Everything seems fine when working with d

ExecutorLostFailure when working with RDDs

2015-10-09 Thread Ivan Héda
Hi, I'm facing an issue with PySpark (1.5.1, 1.6.0-SNAPSHOT) running over Yarn (2.6.0-cdh5.4.4). Everything seems fine when working with dataframes, but when i need RDD the workers start to fail. Like in the next code table1 = sqlContext.table('someTable') table1.count() ## OK ## cca 500 millions

Best practises to clean up RDDs for old applications

2015-10-08 Thread Jens Rantil
Hi, I have a couple of old application RDDs under /var/lib/spark/rdd that haven't been properly cleaned up after themselves. Example: # du -shx /var/lib/spark/rdd/* 44K /var/lib/spark/rdd/liblz4-java1011984124691611873.so 48K /var/lib/spark/rdd/snappy-1.0.5-libsnappyjava.so 2.3G /var/lib/

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-10-01 Thread Hemant Bhanawat
r 30, 2015 4:07 PM > *To:* user@spark.apache.org > *Subject:* Re: partition recomputation in big lineage RDDs > > > Hi, > > In fact, my RDD will get a new version (a new RDD assigned to the same > var) quite frequently, by merging bulks of 1000 events of events of last >

Re: [cache eviction] partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
rg Subject: Re: partition recomputation in big lineage RDDs Hi, In fact, my RDD will get a new version (a new RDD assigned to the same var) quite frequently, by merging bulks of 1000 events of events of last 10s. But recomputation would be more efficient to do not by reading initial RDD partit

Re: partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Subject: partition recomputation in big lineage RDDs Hi, If I implement a manner to have an up-to-date version of my RDD by ingesting some new events, called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can e

partition recomputation in big lineage RDDs

2015-09-30 Thread Nicolae Marasoiu
Hi, If I implement a manner to have an up-to-date version of my RDD by ingesting some new events, called RDD_inc (from increment), and I provide a "merge" function m(RDD, RDD_inc), which returns the RDD_new, it looks like I can evolve the state of my RDD by constructing new RDDs al

Re: How to subtract two RDDs with same size

2015-09-23 Thread Zhiliang Zhu
: Hi Zhiliang, How about doing something like this? val rdd3 = rdd1.zip(rdd2).map(p =>     p._1.zip(p._2).map(z => z._1 - z._2)) The first zip will join the two RDDs and produce an RDD of (Array[Float], Array[Float]) pairs. On each pair, we zip the two Array[Float] components together t

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Tathagata Das
gt;> >> On Tue, Sep 22, 2015 at 4:20 AM, Uthayan Suthakar < >> uthayan.sutha...@gmail.com> wrote: >> >>> >>> Hello All, >>> >>> We have a Spark Streaming job that reads data from DB (three tables) and >>> cache them into memor

Re: How to subtract two RDDs with same size

2015-09-23 Thread Sujit Pal
Hi Zhiliang, How about doing something like this? val rdd3 = rdd1.zip(rdd2).map(p => p._1.zip(p._2).map(z => z._1 - z._2)) The first zip will join the two RDDs and produce an RDD of (Array[Float], Array[Float]) pairs. On each pair, we zip the two Array[Float] components together to f

Re: Why RDDs are being dropped by Executors?

2015-09-23 Thread Uthayan Suthakar
cache them into memory ONLY at the start then it will happily carry out the >> incremental calculation with the new data. What we've noticed occasionally >> is that one of the RDDs caches only 90% of the data. Therefore, at each >> execution time the remaining 10% had to be r

Re: How to subtract two RDDs with same size

2015-09-23 Thread Zhiliang Zhu
there is matrix add API, might map rdd2 each row element to be negative , then make rdd1 and rdd2 and call add ? Or some more ways ... On Wednesday, September 23, 2015 3:11 PM, Zhiliang Zhu wrote: Hi All, There are two RDDs :  RDD> rdd1, and RDD> rdd2,that is to say, rd

How to subtract two RDDs with same size

2015-09-23 Thread Zhiliang Zhu
Hi All, There are two RDDs :  RDD> rdd1, and RDD> rdd2,that is to say, rdd1 and rdd2 are similar with DataFrame, or Matrix with same row number and column number. I would like to get RDD> rdd3,  each element in rdd3 is the subtract between rdd1 and rdd2 of thesame position, which i

Re: Why RDDs are being dropped by Executors?

2015-09-22 Thread Tathagata Das
) and > cache them into memory ONLY at the start then it will happily carry out the > incremental calculation with the new data. What we've noticed occasionally > is that one of the RDDs caches only 90% of the data. Therefore, at each > execution time the remaining 10% had to be r

Re: Partitions on RDDs

2015-09-22 Thread Yashwanth Kumar
one during actions stage. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-on-RDDs-tp24775p24779.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To u

Re: Partitions on RDDs

2015-09-22 Thread Richard Eggert
In general, RDDs get partitioned automatically without programmer intervention. You generally don't need to worry about them unless you need to adjust the size/number of partitions or the partitioning scheme according to the needs of your application. Partitions get redistributed among

Partitions on RDDs

2015-09-22 Thread XIANDI
I'm always confused by the partitions. We may have many RDDs in the code. Do we need to partition on all of them? Do the rdds get rearranged among all the nodes whenever we do a partition? What is a wise way of doing partitions? -- View this message in context: http://apache-spark-user

Why RDDs are being dropped by Executors?

2015-09-22 Thread Uthayan Suthakar
Hello All, We have a Spark Streaming job that reads data from DB (three tables) and cache them into memory ONLY at the start then it will happily carry out the incremental calculation with the new data. What we've noticed occasionally is that one of the RDDs caches only 90% of the data. Ther

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Hi Romi, Yes, you understand it correctly.And rdd1 keys are cross with rdd2 keys, that is, there are lots of same keys between rdd1 and rdd2, and there are some keys inrdd1 but not in rdd2, there are also some keys in rdd2 but not in rdd1.Then rdd3 keys would be same with rdd1 keys, rdd3 will no

Re: how to get RDD from two different RDDs with cross column

2015-09-21 Thread Romi Kuntsman
Hi, If I understand correctly: rdd1 contains keys (of type StringDate) rdd2 contains keys and values and rdd3 contains all the keys, and the values from rdd2? I think you should make rdd1 and rdd2 PairRDD, and then use outer join. Does that make sense? On Mon, Sep 21, 2015 at 8:37 PM Zhiliang Zhu

how to get RDD from two different RDDs with cross column

2015-09-21 Thread Zhiliang Zhu
Dear Romi, Priya, Sujt and Shivaram and all, I have took lots of days to think into this issue, however, without  any enough good solution...I shall appreciate your all kind help. There is an RDD rdd1, and another RDD rdd2, (rdd2 can be PairRDD, or DataFrame with two columns as ).StringDate colum

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Petros Nyfantis
ble "GC task thread#7 (ParallelGC)" prio=10 tid=0x7f9adc02d000 nid=0x65d1 runnable "VM Periodic Task Thread" prio=10 tid=0x7f9adc0c1800 nid=0x65d9 waiting on condition JNI global references: 208 On 14/09/15 19:45, Sean Owen wrote: There isn't a cycle

Re: DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread Sean Owen
There isn't a cycle in your graph, since although you reuse reference variables in your code called A and B you are in fact creating new RDDs at each operation. You have some other problem, and you'd have to provide detail on why you think something is deadlocked, like a thread dump. O

DAG Scheduler deadlock when two RDDs reference each other, force Stages manually?

2015-09-14 Thread petranidis
Hi all, I am new to spark and I have writen a few spark programs mostly around machine learning applications. I am trying to resolve a particular problem where there are two RDDs that should be updated by using elements of each other. More specifically, if the two pair RDDs are called A and B

<    1   2   3   4   5   6   >