On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld
nkronenfeld@uncharted.software wrote:
You're generating all possible pairs?
In that case, why not just generate the sequential pairs you want from the
start?
On Wed, Mar 25, 2015 at 3:11 PM, Himanish Kushary himan...@gmail.com
wrote
Hi,
I have a RDD of pairs of strings like below :
(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)
I need to transform/filter this into a RDD of pairs that does not repeat a
string once it has been used once. So something like ,
(A,B)
(C,D)
(E,F)
(B,C) is out because B has already ben used in (A,B), (A,D)
PM, Nathan Kronenfeld
nkronenfeld@uncharted.software wrote:
What would it do with the following dataset?
(A, B)
(A, C)
(B, D)
On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary himan...@gmail.com
wrote:
Hi,
I have a RDD of pairs of strings like below :
(A,B)
(B,C)
(C,D)
(A,D)
(E,F
We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline
for orchestration of the different spark jobs. AWS datapipeline provides
automatic EMR cluster provisioning, retry on failure,SNS notification etc.
out of the box and works well for us.
On Sun, Mar 1, 2015 at 7:02 PM,
the settings for the parameters *spark.akka.frameSize (=
500), **spark.akka.timeout,**spark.akka.askTimeout and
**spark.core.connection.ack.wait.timeout
*to get rid of any insufficient frame size and timeout errors
Thanks
Himanish
On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary himan...@gmail.com
a join (or a variant like cogroup,
leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)
the downside is this requires a shuffle of both your RDDs
On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary himan...@gmail.com
wrote:
Hi,
I have two RDD's with csv data as below :
RDD-1
Hi,
I have two RDD's with csv data as below :
RDD-1
101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647