Hi everyone, I work with Spark for a little while now and have encountered a strange problem that gives me headaches, which has to do with the JavaRDD.subtract method. Consider the following piece of code:
public static void main(String[] args) { //context is of type JavaSparkContext; FILE is the filepath to my input file JavaRDD<String> rawTestSet = context.textFile(FILE); JavaRDD<String> rawTestSet2 = context.textFile(FILE); //Gives 0 everytime -> Correct System.out.println("rawTestSetMinusRawTestSet2 = " + rawTestSet.subtract(rawTestSet2).count()); //SearchData is a custom POJO that holds my data JavaRDD<SearchData> testSet = convert(rawTestSet); JavaRDD<SearchData> testSet2 = convert(rawTestSet); JavaRDD<SearchData> testSet3 = convert(rawTestSet2); //These calls give numbers !=0 on cluster mode -> Incorrect System.out.println("testSetMinuesTestSet2 = " + testSet.subtract(testSet2).count()); System.out.println("testSetMinuesTestSet3 = " + testSet.subtract(testSet3).count()); System.out.println("testSet2MinuesTestSet3 = " + testSet2.subtract(testSet3).count()); } private static JavaRDD<SearchData> convert(JavaRDD<String> input) { return input.filter(new Matches("myRegex")) .map(new DoSomething()) .map(new Split("mySplitParam")) .map(new ToMap()) .map(new Clean()) .map(new ToSearchData()); } In this code, I read a file (usually from HDFS, but applies to disk as well) and then convert the Strings into custom objects to hold the data using a chain of filter- and map-operations. These objects are simple POJOs with overriden hashCode() and equal() functions. I then apply the subtract method to several JavaRDDs that contain exact equal data. Note: I have omitted the POJO code and the filter- and map-functions to make the code more concise, but I can post it later if the need arises. In the main method shown above are several calls of the subtract method, all of which should give empty RDDs as results because the data in all RDDs should be exactly the same. This works for Spark in local mode, however when executing the code on a cluster the second block of subtract calls does not result in empty sets, which tells me that it is a more complicated issue. The input data on local and cluster mode was exactly the same. Can someone shed some light on this issue? I feel like I'm overlooking something rather obvious. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org