Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Colin McCabe Fri, 30 May 2014 14:57:36 -0700

On Fri, May 30, 2014 at 2:11 PM, Patrick Wendell <pwend...@gmail.com> wrote:


> Hey guys, thanks for the insights. Also, I realize Hadoop has gotten
> way better about this with 2.2+ and I think it's great progress.
>
> We have well defined API levels in Spark and also automated checking
> of API violations for new pull requests. When doing code reviews we
> always enforce the narrowest possible visibility:
>
> 1. private
> 2. private[spark]
> 3. @Experimental or @DeveloperApi
> 4. public
>
> Our automated checks exclude 1-3. Anything that breaks 4 will trigger
> a build failure.
>
>
That's really excellent.  Great job.

I like the private[spark] visibility level-- sounds like this is another
way Scala has greatly improved on Java.

The Scala compiler prevents anyone external from using 1 or 2. We do
> have "bytecode public but annotated" (3) API's that we might change.
> We spent a lot of time looking into whether these can offer compiler
> warnings, but we haven't found a way to do this and do not see a
> better alternative at this point.
>

It would be nice if the production build could strip this stuff out.
 Otherwise, it feels a lot like a @private, @unstable Hadoop API... and we
know how those turned out.


> Regarding Scala compatibility, Scala 2.11+ is "source code
> compatible", meaning we'll be able to cross-compile Spark for
> different versions of Scala. We've already been in touch with Typesafe
> about this and they've offered to integrate Spark into their
> compatibility test suite. They've also committed to patching 2.11 with
> a minor release if bugs are found.
>

Thanks, I hadn't heard about this plan.  Hopefully we can get everyone on
2.11 ASAP.


> Anyways, my point is we've actually thought a lot about this already.
>
> The CLASSPATH thing is different than API stability, but indeed also a
> form of compatibility. This is something where I'd also like to see
> Spark have better isolation of user classes from Spark's own
> execution...
>
>
I think the best thing to do is just "shade" all the dependencies.  Then
they will be in a different namespace, and clients can have their own
versions of whatever dependencies they like without conflicting.  As
Marcelo mentioned, there might be a few edge cases where this breaks
reflection, but I don't think that's an issue for most libraries.  So at
worst case we could end up needing apps to follow us in lockstep for Kryo
or maybe Akka, but not the whole kit and caboodle like with Hadoop.

best,
Colin


- Patrick
>
>
>
> On Fri, May 30, 2014 at 12:30 PM, Marcelo Vanzin <van...@cloudera.com>
> wrote:
> > On Fri, May 30, 2014 at 12:05 PM, Colin McCabe <cmcc...@alumni.cmu.edu>
> wrote:
> >> I don't know if Scala provides any mechanisms to do this beyond what
> Java provides.
> >
> > In fact it does. You can say something like "private[foo]" and the
> > annotated element will be visible for all classes under "foo" (where
> > "foo" is any package in the hierarchy leading up to the class). That's
> > used a lot in Spark.
> >
> > I haven't fully looked at how the @DeveloperApi is used, but I agree
> > with you - annotations are not a good way to do this. The Scala
> > feature above would be much better, but it might still leak things at
> > the Java bytecode level (don't know how Scala implements it under the
> > cover, but I assume it's not by declaring the element as a Java
> > "private").
> >
> > Another thing is that in Scala the default visibility is public, which
> > makes it very easy to inadvertently add things to the API. I'd like to
> > see more care in making things have the proper visibility - I
> > generally declare things private first, and relax that as needed.
> > Using @VisibleForTesting would be great too, when the Scala
> > private[foo] approach doesn't work.
> >
> >> Does Spark also expose its CLASSPATH in
> >> this way to executors?  I was under the impression that it did.
> >
> > If you're using the Spark assemblies, yes, there is a lot of things
> > that your app gets exposed to. For example, you can see Guava and
> > Jetty (and many other things) there. This is something that has always
> > bugged me, but I don't really have a good suggestion of how to fix it;
> > shading goes a certain way, but it also breaks codes that uses
> > reflection (e.g. Class.forName()-style class loading).
> >
> > What is worse is that Spark doesn't even agree with the Hadoop code it
> > depends on; e.g., Spark uses Guava 14.x while Hadoop is still in Guava
> > 11.x. So when you run your Scala app, what gets loaded?
> >
> >> At some point we will also have to confront the Scala version issue.
>  Will
> >> there be flag days where Spark jobs need to be upgraded to a new,
> >> incompatible version of Scala to run on the latest Spark?
> >
> > Yes, this could be an issue - I'm not sure Scala has a policy towards
> > this, but updates (at least minor, e.g. 2.9 -> 2.10) tend to break
> > binary compatibility.
> >
> > Scala also makes some API updates tricky - e.g., adding a new named
> > argument to a Scala method is not a binary compatible change (while,
> > e.g., adding a new keyword argument in a python method is just fine).
> > The use of implicits and other Scala features make this even more
> > opaque...
> >
> > Anyway, not really any solutions in this message, just a few comments
> > I wanted to throw out there. :-)
> >
> > --
> > Marcelo
>

Re: [VOTE] Release Apache Spark 1.0.0 (rc5)

Reply via email to