Thank you for your investigation into this!

Just for completeness, I've confirmed it's a problem only in REPL, not in 
compiled Spark programs.

But within REPL, a direct consequence of non-same classes after 
serialization/deserialization also means that lookup() doesn't work:

scala> class C(val s:String) extends Serializable {
     |   override def equals(o: Any) = if (o.isInstanceOf[C]) 
o.asInstanceOf[C].s == s else false
     |   override def toString = s
     | }
defined class C

scala> val r = sc.parallelize(Array((new C("a"),11),(new C("a"),12)))
r: org.apache.spark.rdd.RDD[(C, Int)] = ParallelCollectionRDD[3] at parallelize 
at <console>:14

scala> r.lookup(new C("a"))
<console>:17: error: type mismatch;
 found   : C
 required: C
              r.lookup(new C("a"))
                       ^



On Tuesday, May 13, 2014 3:05 PM, Anand Avati <av...@gluster.org> wrote:

On Tue, May 13, 2014 at 8:26 AM, Michael Malak <michaelma...@yahoo.com> wrote:

Reposting here on dev since I didn't see a response on user:
>
>I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In 
>the Spark Shell, equals() fails when I use the canonical equals() pattern of 
>match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 
>0.9.0/Scala 2.10.3.
>
>Is this a bug?
>
>Spark Shell (equals uses match{})
>=================================
>
>class C(val s:String) extends Serializable {
>  override def equals(o: Any) = o match {
>    case that: C => that.s == s
>    case _ => false
>  }
>}
>
>val x = new C("a")
>val bos = new java.io.ByteArrayOutputStream()
>val out = new java.io.ObjectOutputStream(bos)
>out.writeObject(x);
>val b = bos.toByteArray();
>out.close
>bos.close
>val y = new java.io.ObjectInputStream(new 
>java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
>x.equals(y)
>
>res: Boolean = false
>
>Spark Shell (equals uses isInstanceOf[])
>========================================
>
>class C(val s:String) extends Serializable {
>  override def equals(o: Any) = if (o.isInstanceOf[C]) (o.asInstanceOf[C].s == 
>s) else false
>}
>
>val x = new C("a")
>val bos = new java.io.ByteArrayOutputStream()
>val out = new java.io.ObjectOutputStream(bos)
>out.writeObject(x);
>val b = bos.toByteArray();
>out.close
>bos.close
>val y = new java.io.ObjectInputStream(new 
>java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
>x.equals(y)
>
>res: Boolean = true
>
>Scala Shell (equals uses match{})
>=================================
>
>class C(val s:String) extends Serializable {
>  override def equals(o: Any) = o match {
>    case that: C => that.s == s
>    case _ => false
>  }
>}
>
>val x = new C("a")
>val bos = new java.io.ByteArrayOutputStream()
>val out = new java.io.ObjectOutputStream(bos)
>out.writeObject(x);
>val b = bos.toByteArray();
>out.close
>bos.close
>val y = new java.io.ObjectInputStream(new 
>java.io.ByteArrayInputStream(b)).readObject().asInstanceOf[C]
>x.equals(y)
>
>res: Boolean = true
>


Hmm. I see that this can be reproduced without Spark in Scala 2.11, with and 
without -Yrepl-class-based command line flag to the repl. Spark's REPL has the 
effective behavior of 2.11's -Yrepl-class-based flag. Inspecting the byte code 
generated, it appears -Yrepl-class-based results in the creation of "$outer" 
field in the generated classes (including class C). The first case match in 
equals() is resulting code along the lines of (simplified):

if (o isinstanceof Cstr && this.$outer == that.$outer) { // do string compare 
// }

$outer is the synthetic field object to the outer object in which the object 
was created (in this case, the repl environment). Now obviously, when x is 
taken through the bytestream and deserialized, it would have a new $outer 
created (it may have deserialized in a different jvm or machine for all we 
know). So the $outer's mismatching is expected. However I'm still trying to 
understand why the outers need to be the same for the case match.

Reply via email to