On Sat, Mar 22, 2014 at 3:45 PM, Michael Armbrust <mich...@databricks.com>wrote:

> >
> > From my experience, covariance often becomes a pain when dealing with
> > serialization/deserialization (I've experienced a few cases while
> > developing play-json & datomisca).
> > Moreover, if you have implicits, variance often becomes a headache...
>
>
> This is exactly the kind of feedback I was hoping for!  Can you be any more
> specific about the kinds of problems you ran into here?
>

I've been rethinking about this topic after writing my first mail.

The problem I was talking about is when you try to use typeclass converters
and make them contravariant/covariant for input/output. Something like:

Reader[-I, +O] { def read(i:I): O }

Doing this, you soon have implicit collisions and philosophical concerns
about what it means to serialize/deserialize a Parent class and a Child
class...

For ex, if you have a Reader[I, Dog], you also have a Reader[I, Mammal] by
covariance.
Then you use this Reader[I, Mammal] to read a Cat because it's a Mammal.
But it fails as the original Reader expects the representation of a full
Dog, not only a part of it corresponding to the Mammal...

So you see here that the problem is on deserialization/deserialization
mechanism itself.

In your case, you don't have this kind of concerns as JavaSerializer and
KryoSerializer are more about basic marshalling that operates at low-level
class representation and you don't rely on implicit typeclasses...

So let's consider what you really want, RDD[+T] and see whether it will
have bad impacts.

if you do:

val rddChild: RDD[Child] = sc.parallelize(Seq(Child(...), Child(...), ...))

If you perform map/reduce ops on this rddChild, when remoting objects,
spark context will serialize all sequence elements as Child.

But if you do that:
val rddParent: RDD[Parent] = rddChild

If you perform ops on rddParent, I believe that the serializer should
serialize elements as Parent elements, certainly losing some data from
Child.
On the remote node, they will be deserialized as Parent too but they
shouldn't be Child elements anymore.

So, here, if it works as I say (I'm not sure), it would mean the following:
you have created a RDD from some data and just by invoking covariance, you
might have lost data through the remoting mechanism.

Is it something bad in your opinion? (I'm thinking aloud)

Pascal

Reply via email to