Re: Enum values in custom objects mess up RDD operations
()); System.out.println(/// End); } } Both RDDs contain 1000 exactly equal objects, one would expect each call of distinct() to result in 1 and subtract(JavaRDDMyObject) to result in empty RDDs. However here is some sample output: /// Object generation myObjectRDD1 count = 1000 myObjectRDD2 count = 1000 /// Distinct myObjectRDD1Distinct count = 1 myObjectRDD2Distinct count = 2 /// Subtract myObjectRDD1Minus1 count= 500 myObjectRDD1Minus2 count= 0 myObjectRDD2Minus1 count= 0 /// End And this is a new run, directly following the previous one: /// Object generation myObjectRDD1 count = 1000 myObjectRDD2 count = 1000 /// Distinct myObjectRDD1Distinct count = 2 myObjectRDD2Distinct count = 1 /// Subtract myObjectRDD1Minus1 count= 500 myObjectRDD1Minus2 count= 500 myObjectRDD2Minus1 count= 0 /// End Some thoughts/observations: As soon as I take the enum value out of the hashCode() function of MyObject, the code works just fine, i.e. the new hashCode() function becomes @Override public int hashCode() { int hash = 5; //hash = 41 * hash + Objects.hashCode(this.myEnum); return hash; } Additionally, the code executes fine on a local machine and only behaves strangely on a cluster. These two observations make me believe that Spark uses the hashCode of each object to distribute the objects between worker nodes and somehow the enum value results in inconsistent hash codes. Can someone help me out here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Enum-values-in-custom-objects-mess-up-RDD-operations-tp24149.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Enum values in custom objects mess up RDD operations
And this is a new run, directly following the previous one: /// Object generation myObjectRDD1 count = 1000 myObjectRDD2 count = 1000 /// Distinct myObjectRDD1Distinct count = 2 myObjectRDD2Distinct count = 1 /// Subtract myObjectRDD1Minus1 count= 500 myObjectRDD1Minus2 count= 500 myObjectRDD2Minus1 count= 0 /// End Some thoughts/observations: As soon as I take the enum value out of the hashCode() function of MyObject, the code works just fine, i.e. the new hashCode() function becomes @Override public int hashCode() { int hash = 5; //hash = 41 * hash + Objects.hashCode(this.myEnum); return hash; } Additionally, the code executes fine on a local machine and only behaves strangely on a cluster. These two observations make me believe that Spark uses the hashCode of each object to distribute the objects between worker nodes and somehow the enum value results in inconsistent hash codes. Can someone help me out here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Enum-values-in-custom-objects-mess-up-RDD-operations-tp24149.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Enum values in custom objects mess up RDD operations
count = 1000 /// Distinct myObjectRDD1Distinct count = 1 myObjectRDD2Distinct count = 2 /// Subtract myObjectRDD1Minus1 count= 500 myObjectRDD1Minus2 count= 0 myObjectRDD2Minus1 count= 0 /// End And this is a new run, directly following the previous one: /// Object generation myObjectRDD1 count = 1000 myObjectRDD2 count = 1000 /// Distinct myObjectRDD1Distinct count = 2 myObjectRDD2Distinct count = 1 /// Subtract myObjectRDD1Minus1 count= 500 myObjectRDD1Minus2 count= 500 myObjectRDD2Minus1 count= 0 /// End Some thoughts/observations: As soon as I take the enum value out of the hashCode() function of MyObject, the code works just fine, i.e. the new hashCode() function becomes @Override public int hashCode() { int hash = 5; //hash = 41 * hash + Objects.hashCode(this.myEnum); return hash; } Additionally, the code executes fine on a local machine and only behaves strangely on a cluster. These two observations make me believe that Spark uses the hashCode of each object to distribute the objects between worker nodes and somehow the enum value results in inconsistent hash codes. Can someone help me out here? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Enum-values-in-custom-objects-mess-up-RDD-operations-tp24149.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org