Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Sebastian Kalix
());


 System.out.println(///
 End);
   }

 }


 Both RDDs contain 1000 exactly equal objects, one would expect each call
 of
 distinct() to result in 1 and subtract(JavaRDDMyObject) to result in
 empty
 RDDs. However here is some sample output:


 /// Object generation
 myObjectRDD1 count  = 1000
 myObjectRDD2 count  = 1000
 /// Distinct
 myObjectRDD1Distinct count  = 1
 myObjectRDD2Distinct count  = 2
 /// Subtract
 myObjectRDD1Minus1 count= 500
 myObjectRDD1Minus2 count= 0
 myObjectRDD2Minus1 count= 0
 /// End


 And this is a new run, directly following the previous one:

 /// Object generation
 myObjectRDD1 count  = 1000
 myObjectRDD2 count  = 1000
 /// Distinct
 myObjectRDD1Distinct count  = 2
 myObjectRDD2Distinct count  = 1
 /// Subtract
 myObjectRDD1Minus1 count= 500
 myObjectRDD1Minus2 count= 500
 myObjectRDD2Minus1 count= 0
 /// End


 Some thoughts/observations: As soon as I take the enum value out of the
 hashCode() function of MyObject, the code works just fine, i.e. the new
 hashCode() function becomes

 @Override
 public int hashCode() {
 int hash = 5;
 //hash = 41 * hash + Objects.hashCode(this.myEnum);
 return hash;
 }

 Additionally, the code executes fine on a local machine and only behaves
 strangely on a cluster. These two observations make me believe that Spark
 uses the hashCode of each object to distribute the objects between worker
 nodes and somehow the enum value results in inconsistent hash codes.

 Can someone help me out here?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Enum-values-in-custom-objects-mess-up-RDD-operations-tp24149.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Enum values in custom objects mess up RDD operations

2015-08-06 Thread Warfish


And this is a new run, directly following the previous one:

/// Object generation
myObjectRDD1 count  = 1000
myObjectRDD2 count  = 1000
/// Distinct
myObjectRDD1Distinct count  = 2
myObjectRDD2Distinct count  = 1
/// Subtract
myObjectRDD1Minus1 count= 500
myObjectRDD1Minus2 count= 500
myObjectRDD2Minus1 count= 0
/// End


Some thoughts/observations: As soon as I take the enum value out of the
hashCode() function of MyObject, the code works just fine, i.e. the new
hashCode() function becomes

@Override
public int hashCode() {
int hash = 5;
//hash = 41 * hash + Objects.hashCode(this.myEnum);
return hash;
}

Additionally, the code executes fine on a local machine and only behaves
strangely on a cluster. These two observations make me believe that Spark
uses the hashCode of each object to distribute the objects between worker
nodes and somehow the enum value results in inconsistent hash codes.

Can someone help me out here?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Enum-values-in-custom-objects-mess-up-RDD-operations-tp24149.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Igor Berman
 count  = 1000
 /// Distinct
 myObjectRDD1Distinct count  = 1
 myObjectRDD2Distinct count  = 2
 /// Subtract
 myObjectRDD1Minus1 count= 500
 myObjectRDD1Minus2 count= 0
 myObjectRDD2Minus1 count= 0
 /// End


 And this is a new run, directly following the previous one:

 /// Object generation
 myObjectRDD1 count  = 1000
 myObjectRDD2 count  = 1000
 /// Distinct
 myObjectRDD1Distinct count  = 2
 myObjectRDD2Distinct count  = 1
 /// Subtract
 myObjectRDD1Minus1 count= 500
 myObjectRDD1Minus2 count= 500
 myObjectRDD2Minus1 count= 0
 /// End


 Some thoughts/observations: As soon as I take the enum value out of the
 hashCode() function of MyObject, the code works just fine, i.e. the new
 hashCode() function becomes

 @Override
 public int hashCode() {
 int hash = 5;
 //hash = 41 * hash + Objects.hashCode(this.myEnum);
 return hash;
 }

 Additionally, the code executes fine on a local machine and only behaves
 strangely on a cluster. These two observations make me believe that Spark
 uses the hashCode of each object to distribute the objects between worker
 nodes and somehow the enum value results in inconsistent hash codes.

 Can someone help me out here?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Enum-values-in-custom-objects-mess-up-RDD-operations-tp24149.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org