Greetings,
I have created an RDD of 600000 rows and then I joined it with itself. For
some reason Spark consumes all of my storage which is more than 20GB of
free storage! Is this the expected behavior of Spark!? Am I doing something
wrong here? The code is shown below (done in Java). I tried to cache the
RDD but I got java heap except! Is there a way around it!? Note that the
file size is 150MB only!
//Initializing Spark
JavaSparkContext sc = new JavaSparkContext(conf);
//Reading a file that has 600,000 rows and transform it into an RDD of
<Integer,Row>
//The key is basically a hashcode to the similar attributes in a row (java
String.hashCode) so that similar rows hash to the same code
JavaPairRDD<Integer, Row> rdd1 = sc.textFile(filePath1,7).map(new
PairFunction<String,Integer,Row>() {
@Override
public Tuple2<Integer, Row> call(String arg0) throws Exception {
Row row = new Row(arg0, true);
return new Tuple2<Integer, Row>(row.getHashCode(), row);
}
});
//Joining rdd1 with itself to pair similar rows with eachother.
JavaPairRDD<Integer, Tuple2<Row, Row>> i = rdd1.join(rdd1);
Your help is highly appreciated.
Regards,
Hasan