Can you call collect() and log the output to get more clue what is left ? Which Spark release are you using ?
Cheers On Fri, Jul 31, 2015 at 9:01 AM, Warfish <sebastian.ka...@gmail.com> wrote: > Hi everyone, > > I work with Spark for a little while now and have encountered a strange > problem that gives me headaches, which has to do with the JavaRDD.subtract > method. Consider the following piece of code: > > public static void main(String[] args) { > //context is of type JavaSparkContext; FILE is the filepath to my > input file > JavaRDD<String> rawTestSet = context.textFile(FILE); > JavaRDD<String> rawTestSet2 = context.textFile(FILE); > > //Gives 0 everytime -> Correct > System.out.println("rawTestSetMinusRawTestSet2 = " + > rawTestSet.subtract(rawTestSet2).count()); > > //SearchData is a custom POJO that holds my data > JavaRDD<SearchData> testSet = convert(rawTestSet); > JavaRDD<SearchData> testSet2 = convert(rawTestSet); > JavaRDD<SearchData> testSet3 = convert(rawTestSet2); > > //These calls give numbers !=0 on cluster mode -> Incorrect > System.out.println("testSetMinuesTestSet2 = " + > testSet.subtract(testSet2).count()); > System.out.println("testSetMinuesTestSet3 = " + > testSet.subtract(testSet3).count()); > System.out.println("testSet2MinuesTestSet3 = " + > testSet2.subtract(testSet3).count()); > } > > private static JavaRDD<SearchData> convert(JavaRDD<String> input) { > return input.filter(new Matches("myRegex")) > .map(new DoSomething()) > .map(new Split("mySplitParam")) > .map(new ToMap()) > .map(new Clean()) > .map(new ToSearchData()); > } > > In this code, I read a file (usually from HDFS, but applies to disk as > well) > and then convert the Strings into custom objects to hold the data using a > chain of filter- and map-operations. These objects are simple POJOs with > overriden hashCode() and equal() functions. I then apply the subtract > method > to several JavaRDDs that contain exact equal data. > > Note: I have omitted the POJO code and the filter- and map-functions to > make > the code more concise, but I can post it later if the need arises. > > In the main method shown above are several calls of the subtract method, > all > of which should give empty RDDs as results because the data in all RDDs > should be exactly the same. This works for Spark in local mode, however > when > executing the code on a cluster the second block of subtract calls does not > result in empty sets, which tells me that it is a more complicated issue. > The input data on local and cluster mode was exactly the same. > > Can someone shed some light on this issue? I feel like I'm overlooking > something rather obvious. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >