[ 
https://issues.apache.org/jira/browse/SPARK-39140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534633#comment-17534633
 ] 

Kris Mok commented on SPARK-39140:
----------------------------------

This isn't a "bug" in Spark's {{JavaSerializer}} per-se, but a feature that's 
working by-design that could be overlooked by the users of this class.

The underlying issue can be demonstrated with a simple tweak to the code 
snippet in the issue description:
{code:scala}
abstract class AA extends Serializable {
  val ts = System.nanoTime()
}
case class BB(x: Int) extends AA {
}
{code}
Now run the rest of the original code snipper and you should see that 
{{input.ts}} and {{obj1.ts}} are now the same, i.e. it's correctly serialized 
and then deserialized.

What is it telling us? Well, in Scala, {{case class}} imply implementing the 
{{scala.Serializable}} trait, which has the effect of {{java.io.Serializable}}. 
The {{JavaSerializer}} uses Java's standard library's builtin serialization 
mechanism, which specifies that only fields on {{Serializable}} classes are 
serialized by default. Effectively, you can have a subclass that is 
serializable, but fields on all of its non-serializable supertypes will just be 
ignored.

There are ways to customize the Java serialization mechanism to make subclasses 
also write out fields from the superclass, but there's no builtin way to do 
that -- you have to roll your own code. Scala doesn't help here either.

Reflection-based third-party serializers, like Kryo in this example, usually 
ignores the language-level serialization markers and just serializes everything 
-- except for those explicitly marked as transient (excluded from 
serialization). That's how {{KryoSerializer}} "works" here.

https://docs.oracle.com/javase/8/docs/platform/serialization/spec/serial-arch.html#a4176
{quote}Special handling is required for arrays, enum constants, and objects of 
type Class, ObjectStreamClass, and String. Other objects must implement either 
the Serializable or the Externalizable interface to be saved in or restored 
from a stream.{quote}

and 
https://docs.oracle.com/javase/8/docs/platform/serialization/spec/output.html#a861
{quote}Each subclass of a serializable object may define its own writeObject 
method. If a class does not implement the method, the default serialization 
provided by defaultWriteObject will be used. When implemented, the class is 
only responsible for writing its own fields, not those of its supertypes or 
subtypes.{quote}

and https://docs.oracle.com/javase/8/docs/api/java/io/ObjectOutputStream.html
{quote}Serialization does not write out the fields of any object that does not 
implement the java.io.Serializable interface. Subclasses of Objects that are 
not serializable can be serializable. In this case the non-serializable class 
must have a no-arg constructor to allow its fields to be initialized. In this 
case it is the responsibility of the subclass to save and restore the state of 
the non-serializable class. It is frequently the case that the fields of that 
class are accessible (public, package, or protected) or that there are get and 
set methods that can be used to restore the state.{quote}

> JavaSerializer doesn't serialize the fields of superclass
> ---------------------------------------------------------
>
>                 Key: SPARK-39140
>                 URL: https://issues.apache.org/jira/browse/SPARK-39140
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Gengliang Wang
>            Priority: Major
>
> To reproduce:
>  
> {code:java}
> abstract class AA {
>   val ts = System.nanoTime()
> }
> case class BB(x: Int) extends AA {
> }
> val input = BB(1)
> println("original ts: " + input.ts)
> val javaSerializer = new JavaSerializer(new SparkConf())
> val javaInstance = javaSerializer.newInstance()
> val bytes1 = javaInstance.serialize[BB](input)
> val obj1 = javaInstance.deserialize[BB](bytes1)
> println("deserialization result from java: " + obj1.ts)
> val kryoSerializer = new KryoSerializer(new SparkConf())
> val kryoInstance = kryoSerializer.newInstance()
> val bytes2 = kryoInstance.serialize[BB](input)
> val obj2 = kryoInstance.deserialize[BB](bytes2)
> println("deserialization result from kryo: " + obj2.ts) {code}
>  
>  
> The output is
>  
> {code:java}
> original ts: 115014173658666
> deserialization result from java: 115014306794333
> deserialization result from kryo: 115014173658666{code}
>  
> We can see that the fields from the superclass AA are not serialized with 
> JavaSerializer. When switching to KryoSerializer, it works.
> This caused bugs in the project SPARK-38615: TreeNode.origin with actual 
> information is not serialized to executors when a plan can't be executed with 
> whole-staged-codegen.
> It could also lead to bugs in serializing the lambda function within RDD API 
> like 
> mapPartitions/mapPartitionsWithIndex/etc.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to