Joining large dataset causes failure on Spark!

Hasan Asfoor Tue, 22 Apr 2014 14:48:07 -0700

Greetings,

I have created an RDD of 600000 rows and then I joined it with itself. For
some reason Spark consumes all of my storage which is more than 20GB of
free storage! Is this the expected behavior of Spark!? Am I doing something
wrong here? The code is shown below (done in Java). I tried to cache the
RDD but I got java heap except! Is there a way around it!? Note that the
file size is 150MB only!




//Initializing Spark
JavaSparkContext sc = new JavaSparkContext(conf);
 //Reading a file that has 600,000 rows and transform it into an RDD of
<Integer,Row>
//The key is basically a hashcode to the similar attributes in a row (java
String.hashCode) so that similar rows hash to the same code
 JavaPairRDD<Integer, Row> rdd1 = sc.textFile(filePath1,7).map(new
PairFunction<String,Integer,Row>() {

@Override
 public Tuple2<Integer, Row> call(String arg0) throws Exception {
Row row = new Row(arg0, true);
 return new Tuple2<Integer, Row>(row.getHashCode(), row);
}
});
 //Joining rdd1 with itself to pair similar rows with eachother.
JavaPairRDD<Integer, Tuple2<Row, Row>> i = rdd1.join(rdd1);

Your help is highly appreciated.

Regards,
Hasan

Joining large dataset causes failure on Spark!

Reply via email to