Re: Enum values in custom objects mess up RDD operations
Thanks a lot Igor, the following hashCode function is stable: @Override public int hashCode() { int hash = 5; hash = 41 * hash + this.myEnum.ordinal(); return hash; } For anyone having the same problem: http://tech.technoflirt.com/2014/08/21/issues-with-java-enum-hashcode/ Cheers, Sebastian Igor Berman igor.ber...@gmail.com schrieb am Do., 6. Aug. 2015 um 10:59 Uhr: enums hashcode is jvm instance specific(ie. different jvms will give you different values), so you can use ordinal in hashCode computation or use hashCode on enums ordinal as part of hashCode computation On 6 August 2015 at 11:41, Warfish sebastian.ka...@gmail.com wrote: Hi everyone, I was working with Spark for a little while now and have encountered a very strange behaviour that caused me a lot of headaches: I have written my own POJOs to encapsulate my data and this data is held in some JavaRDDs. Part of these POJOs is a member variable of a custom enum type. Whenever I do some operations on these RDDs such as subtract, groupByKey, reduce or similar things, the results are inconsistent and non-sensical. However, this happens only when the application runs in standalone cluster mode (10 nodes). When running locally on my developer machine, the code executes just fine. If you want to reproduce this behaviour, here http://apache-spark-user-list.1001560.n3.nabble.com/file/n24149/SparkTest.zip is the complete Maven project that you can run out of the box. I am running Spark 1.4.0 and submitting the application using /usr/local/spark-1.4.0-bin-hadoop2.4/bin/spark-submit --class de.spark.test.Main ./SparkTest-1.0-SNAPSHOT.jar Consider the following code for my custom object: package de.spark.test; import java.io.Serializable; import java.util.Objects; public class MyObject implements Serializable { private MyEnum myEnum; public MyObject(MyEnum myEnum) { this.myEnum = myEnum; } public MyEnum getMyEnum() { return myEnum; } public void setMyEnum(MyEnum myEnum) { this.myEnum = myEnum; } @Override public int hashCode() { int hash = 5; hash = 41 * hash + Objects.hashCode(this.myEnum); return hash; } @Override public boolean equals(Object obj) { if (obj == null) { return false; } if (getClass() != obj.getClass()) { return false; } final MyObject other = (MyObject) obj; if (this.myEnum != other.myEnum) { return false; } return true; } @Override public String toString() { return MyObject{ + myEnum= + myEnum + '}'; } } As you can see, I have overriden equals() and hashCode() (both are auto-generated). The enum is given as follows: package de.spark.test; import java.io.Serializable; public enum MyEnum implements Serializable { VALUE1, VALUE2 } The main() method is defined by: package de.spark.test; import java.util.ArrayList; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; public class Main { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName(Spark Test) .setMaster(myMaster); JavaSparkContext jsc = new JavaSparkContext(conf); System.out.println(/// Object generation); ListMyObject l1 = new ArrayList(); for(int i = 0; i 1000; i++) { l1.add(new MyObject(MyEnum.VALUE1)); } JavaRDDMyObject myObjectRDD1 = jsc.parallelize(l1); JavaRDDMyObject myObjectRDD2 = jsc.parallelize(l1); System.out.println(myObjectRDD1 count = + myObjectRDD1.count()); System.out.println(myObjectRDD2 count = + myObjectRDD2.count()); System.out.println(/// Distinct); JavaRDDMyObject myObjectRDD1Distinct = myObjectRDD1.distinct(); JavaRDDMyObject myObjectRDD2Distinct = myObjectRDD2.distinct(); System.out.println(myObjectRDD1Distinct count = + myObjectRDD1Distinct.count()); System.out.println(myObjectRDD2Distinct count = + myObjectRDD2Distinct.count()); System.out.println(/// Subtract); JavaRDDMyObject myObjectRDD1Minus1 = myObjectRDD1.subtract(myObjectRDD1); JavaRDDMyObject myObjectRDD1Minus2 = myObjectRDD1.subtract(myObjectRDD2); JavaRDDMyObject myObjectRDD2Minus1 = myObjectRDD2.subtract(myObjectRDD1); System.out.println(myObjectRDD1Minus1 count= + myObjectRDD1Minus1.count()); System.out.println(myObjectRDD1Minus2 count= + myObjectRDD1Minus2.count()); System.out.println(myObjectRDD2Minus1 count= + myObjectRDD2Minus1.count());
Re: Issues with JavaRDD.subtract(JavaRDD) method in local vs. cluster mode
Thanks for the quick reply. I will be unable to collect more data until Monday though, but I will update the thread accordingly. I am using Spark 1.4.0. Were there any related issues reported? I wasn't able to find any, but I may have overlooked something. I have also updated the original question to include the relevant Java files, maybe the issue is hidden there somewhere. Ted Yu yuzhih...@gmail.com schrieb am Fr., 31. Juli 2015 um 18:09 Uhr: Can you call collect() and log the output to get more clue what is left ? Which Spark release are you using ? Cheers On Fri, Jul 31, 2015 at 9:01 AM, Warfish sebastian.ka...@gmail.com wrote: Hi everyone, I work with Spark for a little while now and have encountered a strange problem that gives me headaches, which has to do with the JavaRDD.subtract method. Consider the following piece of code: public static void main(String[] args) { //context is of type JavaSparkContext; FILE is the filepath to my input file JavaRDDString rawTestSet = context.textFile(FILE); JavaRDDString rawTestSet2 = context.textFile(FILE); //Gives 0 everytime - Correct System.out.println(rawTestSetMinusRawTestSet2= + rawTestSet.subtract(rawTestSet2).count()); //SearchData is a custom POJO that holds my data JavaRDDSearchData testSet = convert(rawTestSet); JavaRDDSearchData testSet2= convert(rawTestSet); JavaRDDSearchData testSet3= convert(rawTestSet2); //These calls give numbers !=0 on cluster mode - Incorrect System.out.println(testSetMinuesTestSet2 = + testSet.subtract(testSet2).count()); System.out.println(testSetMinuesTestSet3 = + testSet.subtract(testSet3).count()); System.out.println(testSet2MinuesTestSet3 = + testSet2.subtract(testSet3).count()); } private static JavaRDDSearchData convert(JavaRDDString input) { return input.filter(new Matches(myRegex)) .map(new DoSomething()) .map(new Split(mySplitParam)) .map(new ToMap()) .map(new Clean()) .map(new ToSearchData()); } In this code, I read a file (usually from HDFS, but applies to disk as well) and then convert the Strings into custom objects to hold the data using a chain of filter- and map-operations. These objects are simple POJOs with overriden hashCode() and equal() functions. I then apply the subtract method to several JavaRDDs that contain exact equal data. Note: I have omitted the POJO code and the filter- and map-functions to make the code more concise, but I can post it later if the need arises. In the main method shown above are several calls of the subtract method, all of which should give empty RDDs as results because the data in all RDDs should be exactly the same. This works for Spark in local mode, however when executing the code on a cluster the second block of subtract calls does not result in empty sets, which tells me that it is a more complicated issue. The input data on local and cluster mode was exactly the same. Can someone shed some light on this issue? I feel like I'm overlooking something rather obvious. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Issues-with-JavaRDD-subtract-JavaRDD-method-in-local-vs-cluster-mode-tp24099.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org