Re: Making RDDs Covariant

2014-03-22 Thread Pascal Voitot Dev
Hi,

Covariance always seems like a good idea at first but you must be really
careful as it always has unexpected consequences...
From my experience, covariance often becomes a pain when dealing with
serialization/deserialization (I've experienced a few cases while
developing play-json  datomisca).
Moreover, if you have implicits, variance often becomes a headache...

I'm not against necessarily but my experience proved it wasn't always a so
good idea. The solution is to test in depth...

Pascal



On Sat, Mar 22, 2014 at 6:31 AM, Jey Kottalam j...@cs.berkeley.edu wrote:

 That would be awesome. I support this!

 On Fri, Mar 21, 2014 at 7:28 PM, Michael Armbrust
 mich...@databricks.com wrote:
  Hey Everyone,
 
  Here is a pretty major (but source compatible) change we are considering
  making to the RDD API for 1.0.  Java and Python APIs would remain the
 same,
  but users of Scala would likely need to use less casts.  This would be
  especially true for libraries whose functions take RDDs as parameters.
  Any
  comments would be appreciated!
 
  https://spark-project.atlassian.net/browse/SPARK-1296
 
  Michael



Re: Making RDDs Covariant

2014-03-22 Thread Michael Armbrust

 From my experience, covariance often becomes a pain when dealing with
 serialization/deserialization (I've experienced a few cases while
 developing play-json  datomisca).
 Moreover, if you have implicits, variance often becomes a headache...


This is exactly the kind of feedback I was hoping for!  Can you be any more
specific about the kinds of problems you ran into here?


Re: Making RDDs Covariant

2014-03-22 Thread Koert Kuipers
i believe kryo serialization uses runtime class, not declared class
we have no issues serializing covariant scala lists


On Sat, Mar 22, 2014 at 11:59 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 On Sat, Mar 22, 2014 at 3:45 PM, Michael Armbrust mich...@databricks.com
 wrote:

  
   From my experience, covariance often becomes a pain when dealing with
   serialization/deserialization (I've experienced a few cases while
   developing play-json  datomisca).
   Moreover, if you have implicits, variance often becomes a headache...
 
 
  This is exactly the kind of feedback I was hoping for!  Can you be any
 more
  specific about the kinds of problems you ran into here?
 

 I've been rethinking about this topic after writing my first mail.

 The problem I was talking about is when you try to use typeclass converters
 and make them contravariant/covariant for input/output. Something like:

 Reader[-I, +O] { def read(i:I): O }

 Doing this, you soon have implicit collisions and philosophical concerns
 about what it means to serialize/deserialize a Parent class and a Child
 class...

 For ex, if you have a Reader[I, Dog], you also have a Reader[I, Mammal] by
 covariance.
 Then you use this Reader[I, Mammal] to read a Cat because it's a Mammal.
 But it fails as the original Reader expects the representation of a full
 Dog, not only a part of it corresponding to the Mammal...

 So you see here that the problem is on deserialization/deserialization
 mechanism itself.

 In your case, you don't have this kind of concerns as JavaSerializer and
 KryoSerializer are more about basic marshalling that operates at low-level
 class representation and you don't rely on implicit typeclasses...

 So let's consider what you really want, RDD[+T] and see whether it will
 have bad impacts.

 if you do:

 val rddChild: RDD[Child] = sc.parallelize(Seq(Child(...), Child(...), ...))

 If you perform map/reduce ops on this rddChild, when remoting objects,
 spark context will serialize all sequence elements as Child.

 But if you do that:
 val rddParent: RDD[Parent] = rddChild

 If you perform ops on rddParent, I believe that the serializer should
 serialize elements as Parent elements, certainly losing some data from
 Child.
 On the remote node, they will be deserialized as Parent too but they
 shouldn't be Child elements anymore.

 So, here, if it works as I say (I'm not sure), it would mean the following:
 you have created a RDD from some data and just by invoking covariance, you
 might have lost data through the remoting mechanism.

 Is it something bad in your opinion? (I'm thinking aloud)

 Pascal



Re: Making RDDs Covariant

2014-03-22 Thread andy petrella
Dear,
I'm pretty much following the Pascal's advices, since I've myseelf
encoutered some problems with implicits (when playing the same kind of game
with my Neo4J Scala API).

Nevertheless, one remark regarding the serialization, the lost of data
shouldn't arrive in the case whenimplicit typeclasses aren't involved. Of
course using Typeclasses means that the instance will be chosen at compile
time. Without them it will behave like classical use cases where the
serializer will do the dirty work at runtime and using the current class :/.

Now, imho, I'd be interested to have RDD covariant on the content type,
this because I have an API that I should be able to share with you soon or
sooner where we are trying to bind the two worlds (rdd+SparkCtx and
dstream+StreamingCtx) and also to combine and chain job components.
In a nutshell, it will able to define Source, Process and Sink of Container
of Wagons (Rdds Dstreams themselves) to compose a Job using a (to be
defined) DSLs.
So without covariance I cannot for now define a generic noop Sink.

My0.02c
Andy

Sent from Tab, sorry for the typos...
 Le 22 mars 2014 17:00, Pascal Voitot Dev pascal.voitot@gmail.com a
écrit :

 On Sat, Mar 22, 2014 at 3:45 PM, Michael Armbrust mich...@databricks.com
 wrote:

  
   From my experience, covariance often becomes a pain when dealing with
   serialization/deserialization (I've experienced a few cases while
   developing play-json  datomisca).
   Moreover, if you have implicits, variance often becomes a headache...
 
 
  This is exactly the kind of feedback I was hoping for!  Can you be any
 more
  specific about the kinds of problems you ran into here?
 

 I've been rethinking about this topic after writing my first mail.

 The problem I was talking about is when you try to use typeclass converters
 and make them contravariant/covariant for input/output. Something like:

 Reader[-I, +O] { def read(i:I): O }

 Doing this, you soon have implicit collisions and philosophical concerns
 about what it means to serialize/deserialize a Parent class and a Child
 class...

 For ex, if you have a Reader[I, Dog], you also have a Reader[I, Mammal] by
 covariance.
 Then you use this Reader[I, Mammal] to read a Cat because it's a Mammal.
 But it fails as the original Reader expects the representation of a full
 Dog, not only a part of it corresponding to the Mammal...

 So you see here that the problem is on deserialization/deserialization
 mechanism itself.

 In your case, you don't have this kind of concerns as JavaSerializer and
 KryoSerializer are more about basic marshalling that operates at low-level
 class representation and you don't rely on implicit typeclasses...

 So let's consider what you really want, RDD[+T] and see whether it will
 have bad impacts.

 if you do:

 val rddChild: RDD[Child] = sc.parallelize(Seq(Child(...), Child(...), ...))

 If you perform map/reduce ops on this rddChild, when remoting objects,
 spark context will serialize all sequence elements as Child.

 But if you do that:
 val rddParent: RDD[Parent] = rddChild

 If you perform ops on rddParent, I believe that the serializer should
 serialize elements as Parent elements, certainly losing some data from
 Child.
 On the remote node, they will be deserialized as Parent too but they
 shouldn't be Child elements anymore.

 So, here, if it works as I say (I'm not sure), it would mean the following:
 you have created a RDD from some data and just by invoking covariance, you
 might have lost data through the remoting mechanism.

 Is it something bad in your opinion? (I'm thinking aloud)

 Pascal



Re: Making RDDs Covariant

2014-03-22 Thread David Hall
On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 The problem I was talking about is when you try to use typeclass converters
 and make them contravariant/covariant for input/output. Something like:

 Reader[-I, +O] { def read(i:I): O }

 Doing this, you soon have implicit collisions and philosophical concerns
 about what it means to serialize/deserialize a Parent class and a Child
 class...



You should (almost) never make a typeclass param contravariant. It's almost
certainly not what you want:

https://issues.scala-lang.org/browse/SI-2509

-- David


Re: Making RDDs Covariant

2014-03-22 Thread Pascal Voitot Dev
On Sat, Mar 22, 2014 at 8:38 PM, David Hall d...@cs.berkeley.edu wrote:

 On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev 
 pascal.voitot@gmail.com wrote:

  The problem I was talking about is when you try to use typeclass
 converters
  and make them contravariant/covariant for input/output. Something like:
 
  Reader[-I, +O] { def read(i:I): O }
 
  Doing this, you soon have implicit collisions and philosophical concerns
  about what it means to serialize/deserialize a Parent class and a Child
  class...
 


 You should (almost) never make a typeclass param contravariant. It's almost
 certainly not what you want:

 https://issues.scala-lang.org/browse/SI-2509

 -- David


I confirm that it's a pain and I must say I never do it but I've inherited
historical code that did it :)


Re: Making RDDs Covariant

2014-03-22 Thread Michael Armbrust
Hi Pascal,

Thanks for the input.  I think we are going to be okay here since, as Koert
said, the current serializers use runtime type information.  We could also
keep at ClassTag around for the original type when the RDD was created.
 Good things to be aware of though.

Michael

On Sat, Mar 22, 2014 at 12:42 PM, Pascal Voitot Dev 
pascal.voitot@gmail.com wrote:

 On Sat, Mar 22, 2014 at 8:38 PM, David Hall d...@cs.berkeley.edu wrote:

  On Sat, Mar 22, 2014 at 8:59 AM, Pascal Voitot Dev 
  pascal.voitot@gmail.com wrote:
 
   The problem I was talking about is when you try to use typeclass
  converters
   and make them contravariant/covariant for input/output. Something like:
  
   Reader[-I, +O] { def read(i:I): O }
  
   Doing this, you soon have implicit collisions and philosophical
 concerns
   about what it means to serialize/deserialize a Parent class and a Child
   class...
  
 
 
  You should (almost) never make a typeclass param contravariant. It's
 almost
  certainly not what you want:
 
  https://issues.scala-lang.org/browse/SI-2509
 
  -- David
 

 I confirm that it's a pain and I must say I never do it but I've inherited
 historical code that did it :)



Re: Making RDDs Covariant

2014-03-21 Thread Jey Kottalam
That would be awesome. I support this!

On Fri, Mar 21, 2014 at 7:28 PM, Michael Armbrust
mich...@databricks.com wrote:
 Hey Everyone,

 Here is a pretty major (but source compatible) change we are considering
 making to the RDD API for 1.0.  Java and Python APIs would remain the same,
 but users of Scala would likely need to use less casts.  This would be
 especially true for libraries whose functions take RDDs as parameters.  Any
 comments would be appreciated!

 https://spark-project.atlassian.net/browse/SPARK-1296

 Michael