How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Blind Faith
Say I have two RDDs with the following values x = [(1, 3), (2, 4)] and y = [(3, 5), (4, 7)] and I want to have z = [(1, 3), (2, 4), (3, 5), (4, 7)] How can I achieve this. I know you can use outerJoin followed by map to achieve this, but is there a more direct way for this.

Is there a way to create key based on counts in Spark

2014-11-18 Thread Blind Faith
As it is difficult to explain this, I would show what I want. Lets us say, I have an RDD A with the following value A = [word1, word2, word3] I want to have an RDD with the following value B = [(1, word1), (2, word2), (3, word3)] That is, it gives a unique number to each entry as a key value.

How can I apply such an inner join in Spark Scala/Python

2014-11-17 Thread Blind Faith
So let us say I have RDDs A and B with the following values. A = [ (1, 2), (2, 4), (3, 6) ] B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ] I want to apply an inner join, such that I get the following as a result. C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ] That is, those keys which are not

Which function in spark is used to combine two RDDs by keys

2014-11-13 Thread Blind Faith
Let us say I have the following two RDDs, with the following key-pair values. rdd1 = [ (key1, [value1, value2]), (key2, [value3, value4]) ] and rdd2 = [ (key1, [value5, value6]), (key2, [value7]) ] Now, I want to join them by key values, so for example I want to return the following

How to change the default limiter for textFile function

2014-11-11 Thread Blind Faith
I am a newbie to spark, and I program in Python. I use textFile function to make an RDD from a file. I notice that the default limiter is newline. However I want to change this default limiter to something else. After searching the web, I came to know about textinputformat.record.delimiter

Does spark works on multicore systems?

2014-11-08 Thread Blind Faith
I am a Spark newbie and I use python (pyspark). I am trying to run a program on a 64 core system, but no matter what I do, it always uses 1 core. It doesn't matter if I run it using spark-submit --master local[64] run.sh or I call x.repartition(64) in my code with an RDD, the spark program always