Re: How to minimize shuffling on Spark dataframe Join?

2015-08-19 Thread Romi Kuntsman
If you create a PairRDD from the DataFrame, using dataFrame.toRDD().mapToPair(), then you can call partitionBy(someCustomPartitioner) which will partition the RDD by the key (of the pair). Then the operations on it (like joining with another RDD) will consider this partitioning. I'm not sure that

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-12 Thread Abdullah Anwar
Hi Hemant, Thank you for your replay. I think source of my dataframe is not partitioned on key, its an avro file where 'id' is a field .. but I don't know how to read a file and at the same time configure partition key. I couldn't find anything on SQLContext.read.load where you can set

Fwd: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Abdullah Anwar
I have two dataframes like this student_rdf = (studentid, name, ...) student_result_rdf = (studentid, gpa, ...) we need to join this two dataframes. we are now doing like this, student_rdf.join(student_result_rdf, student_result_rdf[studentid] == student_rdf[studentid]) So it is simple.

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-11 Thread Hemant Bhanawat
Is the source of your dataframe partitioned on key? As per your mail, it looks like it is not. If that is the case, for partitioning the data, you will have to shuffle the data anyway. Another part of your question is - how to co-group data from two dataframes based on a key? I think for RDD's