Fwd:

2015-04-02 Thread Himanish Kushary
On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: You're generating all possible pairs? In that case, why not just generate the sequential pairs you want from the start? On Wed, Mar 25, 2015 at 3:11 PM, Himanish Kushary himan...@gmail.com wrote

[no subject]

2015-03-25 Thread Himanish Kushary
Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F) (B,F) I need to transform/filter this into a RDD of pairs that does not repeat a string once it has been used once. So something like , (A,B) (C,D) (E,F) (B,C) is out because B has already ben used in (A,B), (A,D)

Re:

2015-03-25 Thread Himanish Kushary
PM, Nathan Kronenfeld nkronenfeld@uncharted.software wrote: What would it do with the following dataset? (A, B) (A, C) (B, D) On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have a RDD of pairs of strings like below : (A,B) (B,C) (C,D) (A,D) (E,F

Re: Tools to manage workflows on Spark

2015-03-01 Thread Himanish Kushary
We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline for orchestration of the different spark jobs. AWS datapipeline provides automatic EMR cluster provisioning, retry on failure,SNS notification etc. out of the box and works well for us. On Sun, Mar 1, 2015 at 7:02 PM,

Re: High CPU usage in Driver

2015-02-27 Thread Himanish Kushary
the settings for the parameters *spark.akka.frameSize (= 500), **spark.akka.timeout,**spark.akka.askTimeout and **spark.core.connection.ack.wait.timeout *to get rid of any insufficient frame size and timeout errors Thanks Himanish On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary himan...@gmail.com

Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
a join (or a variant like cogroup, leftOuterJoin, subtractByKey etc. found in PairRDDFunctions) the downside is this requires a shuffle of both your RDDs On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com wrote: Hi, I have two RDD's with csv data as below : RDD-1

Filter data from one RDD based on data from another RDD

2015-02-19 Thread Himanish Kushary
Hi, I have two RDD's with csv data as below : RDD-1 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647