Re: Calling external classes added by sc.addJar needs to be through reflection

Sandy Ryza Wed, 21 May 2014 14:28:30 -0700

Is that an assumption we can make?  I think we'd run into an issue in this
situation:


*In primary jar:*
def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()

*In app code:*
sc.addJar("dynamicjar.jar")
...
rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))

It might be fair to say that the user should make sure to use the context
classloader when instantiating dynamic classes, but I think it's weird that
this code would work on Spark standalone but not on YARN.

-Sandy


On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <men...@gmail.com> wrote:

> I think adding jars dynamically should work as long as the primary jar
> and the secondary jars do not depend on dynamically added jars, which
> should be the correct logic. -Xiangrui
>
> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <dbt...@stanford.edu> wrote:
> > This will be another separate story.
> >
> > Since in the yarn deployment, as Sandy said, the app.jar will be always
> in
> > the systemclassloader which means any object instantiated in app.jar will
> > have parent loader of systemclassloader instead of custom one. As a
> result,
> > the custom class loader in yarn will never work without specifically
> using
> > reflection.
> >
> > Solution will be not using system classloader in the classloader
> hierarchy,
> > and add all the resources in system one into custom one. This is the
> > approach of tomcat takes.
> >
> > Or we can directly overwirte the system class loader by calling the
> > protected method `addURL` which will not work and throw exception if the
> > code is wrapped in security manager.
> >
> >
> > Sincerely,
> >
> > DB Tsai
> > -------------------------------------------------------
> > My Blog: https://www.dbtsai.com
> > LinkedIn: https://www.linkedin.com/in/dbtsai
> >
> >
> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> >
> >> This will solve the issue for jars added upon application submission,
> but,
> >> on top of this, we need to make sure that anything dynamically added
> >> through sc.addJar works as well.
> >>
> >> To do so, we need to make sure that any jars retrieved via the driver's
> >> HTTP server are loaded by the same classloader that loads the jars
> given on
> >> app submission.  To achieve this, we need to either use the same
> >> classloader for both system jars and user jars, or make sure that the
> user
> >> jars given on app submission are under the same classloader used for
> >> dynamically added jars.
> >>
> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <men...@gmail.com>
> wrote:
> >>
> >> > Talked with Sandy and DB offline. I think the best solution is sending
> >> > the secondary jars to the distributed cache of all containers rather
> >> > than just the master, and set the classpath to include spark jar,
> >> > primary app jar, and secondary jars before executor starts. In this
> >> > way, user only needs to specify secondary jars via --jars instead of
> >> > calling sc.addJar inside the code. It also solves the scalability
> >> > problem of serving all the jars via http.
> >> >
> >> > If this solution sounds good, I can try to make a patch.
> >> >
> >> > Best,
> >> > Xiangrui
> >> >
> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <dbt...@stanford.edu>
> wrote:
> >> > > In 1.0, there is a new option for users to choose which classloader
> has
> >> > > higher priority via spark.files.userClassPathFirst, I decided to
> submit
> >> > the
> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
> >> jars
> >> > > added by sc.addJar without reflection.
> >> > >
> >> > > https://github.com/apache/spark/pull/834
> >> > >
> >> > > Can anyone comment if it's a good approach?
> >> > >
> >> > > Thanks.
> >> > >
> >> > >
> >> > > Sincerely,
> >> > >
> >> > > DB Tsai
> >> > > -------------------------------------------------------
> >> > > My Blog: https://www.dbtsai.com
> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >
> >> > >
> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbt...@stanford.edu>
> wrote:
> >> > >
> >> > >> Good summary! We fixed it in branch 0.9 since our production is
> still
> >> in
> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
> 1.0
> >> > >> tonight.
> >> > >>
> >> > >>
> >> > >> Sincerely,
> >> > >>
> >> > >> DB Tsai
> >> > >> -------------------------------------------------------
> >> > >> My Blog: https://www.dbtsai.com
> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>
> >> > >>
> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
> sandy.r...@cloudera.com
> >> > >wrote:
> >> > >>
> >> > >>> It just hit me why this problem is showing up on YARN and not on
> >> > >>> standalone.
> >> > >>>
> >> > >>> The relevant difference between YARN and standalone is that, on
> YARN,
> >> > the
> >> > >>> app jar is loaded by the system classloader instead of Spark's
> custom
> >> > URL
> >> > >>> classloader.
> >> > >>>
> >> > >>> On YARN, the system classloader knows about [the classes in the
> spark
> >> > >>> jars,
> >> > >>> the classes in the primary app jar].   The custom classloader
> knows
> >> > about
> >> > >>> [the classes in secondary app jars] and has the system
> classloader as
> >> > its
> >> > >>> parent.
> >> > >>>
> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
> out):
> >> > >>> * Every class has a classloader that loaded it.
> >> > >>> * When an object of class B is instantiated inside of class A, the
> >> > >>> classloader used for loading B is the classloader that was used
> for
> >> > >>> loading
> >> > >>> A.
> >> > >>> * When a classloader fails to load a class, it lets its parent
> >> > classloader
> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
> >> that
> >> > >>> loaded it".
> >> > >>>
> >> > >>> So suppose class B is in a secondary app jar and class A is in the
> >> > primary
> >> > >>> app jar:
> >> > >>> 1. The custom classloader will try to load class A.
> >> > >>> 2. It will fail, because it only knows about the secondary jars.
> >> > >>> 3. It will delegate to its parent, the system classloader.
> >> > >>> 4. The system classloader will succeed, because it knows about the
> >> > primary
> >> > >>> app jar.
> >> > >>> 5. A's classloader will be the system classloader.
> >> > >>> 6. A tries to instantiate an instance of class B.
> >> > >>> 7. B will be loaded with A's classloader, which is the system
> >> > classloader.
> >> > >>> 8. Loading B will fail, because A's classloader, which is the
> system
> >> > >>> classloader, doesn't know about the secondary app jars.
> >> > >>>
> >> > >>> In Spark standalone, A and B are both loaded by the custom
> >> > classloader, so
> >> > >>> this issue doesn't come up.
> >> > >>>
> >> > >>> -Sandy
> >> > >>>
> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
> pwend...@gmail.com
> >> >
> >> > >>> wrote:
> >> > >>>
> >> > >>> > Having a user add define a custom class inside of an added jar
> and
> >> > >>> > instantiate it directly inside of an executor is definitely
> >> supported
> >> > >>> > in Spark and has been for a really long time (several years).
> This
> >> is
> >> > >>> > something we do all the time in Spark.
> >> > >>> >
> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
> >> > >>> > exactly what is causing the bug you are running into.
> >> > >>> >
> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
> >> executor,
> >> > >>> > it will ask the driver for the class over HTTP using a custom
> >> > >>> > classloader. Something in that pipeline is breaking here,
> possibly
> >> > >>> > related to the YARN deployment stuff.
> >> > >>> >
> >> > >>> >
> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com
> >
> >> > wrote:
> >> > >>> > > I don't think a customer classloader is necessary.
> >> > >>> > >
> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
> >> Tomcat,
> >> > etc
> >> > >>> > > all run custom user code that creates new user objects without
> >> > >>> > > reflection. I should go see how that's done. Maybe it's
> totally
> >> > valid
> >> > >>> > > to set the thread's context classloader for just this purpose,
> >> and
> >> > I
> >> > >>> > > am not thinking clearly.
> >> > >>> > >
> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
> >> and...@andrewash.com>
> >> > >>> > wrote:
> >> > >>> > >> Sounds like the problem is that classloaders always look in
> >> their
> >> > >>> > parents
> >> > >>> > >> before themselves, and Spark users want executors to pick up
> >> > classes
> >> > >>> > from
> >> > >>> > >> their custom code before the ones in Spark plus its
> >> dependencies.
> >> > >>> > >>
> >> > >>> > >> Would a custom classloader that delegates to the parent after
> >> > first
> >> > >>> > >> checking itself fix this up?
> >> > >>> > >>
> >> > >>> > >>
> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
> dbt...@stanford.edu>
> >> > >>> wrote:
> >> > >>> > >>
> >> > >>> > >>> Hi Sean,
> >> > >>> > >>>
> >> > >>> > >>> It's true that the issue here is classloader, and due to the
> >> > >>> > classloader
> >> > >>> > >>> delegation model, users have to use reflection in the
> executors
> >> > to
> >> > >>> > pick up
> >> > >>> > >>> the classloader in order to use those classes added by
> >> sc.addJars
> >> > >>> APIs.
> >> > >>> > >>> However, it's very inconvenience for users, and not
> documented
> >> in
> >> > >>> > spark.
> >> > >>> > >>>
> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
> >> > method
> >> > >>> > addURL
> >> > >>> > >>> in URLClassLoader to update the current default
> classloader, so
> >> > no
> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
> to
> >> go.
> >> > >>> > >>>
> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
> >> > >>> > >>>     try {
> >> > >>> > >>>       val method: Method =
> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
> >> classOf[URL])
> >> > >>> > >>>       method.setAccessible(true)
> >> > >>> > >>>       method.invoke(loader, url)
> >> > >>> > >>>     }
> >> > >>> > >>>     catch {
> >> > >>> > >>>       case t: Throwable => {
> >> > >>> > >>>         throw new IOException("Error, could not add URL to
> >> system
> >> > >>> > >>> classloader")
> >> > >>> > >>>       }
> >> > >>> > >>>     }
> >> > >>> > >>>   }
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>> Sincerely,
> >> > >>> > >>>
> >> > >>> > >>> DB Tsai
> >> > >>> > >>> -------------------------------------------------------
> >> > >>> > >>> My Blog: https://www.dbtsai.com
> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
> >> > >>> > >>>
> >> > >>> > >>>
> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
> >> so...@cloudera.com>
> >> > >>> > wrote:
> >> > >>> > >>>
> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
> >> > here is
> >> > >>> > not
> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
> The
> >> > >>> basic
> >> > >>> > >>> > rules are this.
> >> > >>> > >>> >
> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
> is
> >> > >>> usually
> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
> >> > referenced
> >> > >>> Foo
> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
> holding
> >> > your
> >> > >>> > >>> > other app classes.
> >> > >>> > >>> >
> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
> >> ClassLoaders
> >> > >>> > always
> >> > >>> > >>> > look in their parent before themselves.
> >> > >>> > >>> >
> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
> your
> >> > app
> >> > >>> is
> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
> that
> >> > >>> Hadoop
> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
> >> > container's
> >> > >>> > >>> > version!)
> >> > >>> > >>> >
> >> > >>> > >>> > When you load an external JAR it has a separate
> ClassLoader
> >> > which
> >> > >>> > does
> >> > >>> > >>> > not necessarily bear any relation to the one containing
> your
> >> > app
> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
> Foo"
> >> > work.
> >> > >>> > >>> >
> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
> >> > >>> > >>> >
> >> > >>> > >>> > I would not call setContextClassLoader.
> >> > >>> > >>> >
> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
> >> > >>> > sandy.r...@cloudera.com>
> >> > >>> > >>> > wrote:
> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
> and
> >> he
> >> > >>> > confirmed
> >> > >>> > >>> > that
> >> > >>> > >>> > > he was able to access the jar from the driver.
> >> > >>> > >>> > >
> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
> >> > directly
> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
> >> > >>> > >>> > >
> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
> >> > >>> > >>> > > ---
> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
> URLClassLoader(new
> >> > >>> URL[] {
> >> > >>> > new
> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
> >> > >>> > >>> > >
> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
> >> MyClassFromMyOtherJar();
> >> > >>> > >>> > > ---
> >> > >>> > >>> > >
> >> > >>> > >>> > > I was able to load the class with reflection.
> >> > >>> > >>> >
> >> > >>> > >>>
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> >
> >>
>

Re: Calling external classes added by sc.addJar needs to be through reflection

Reply via email to