Hey I just looked at the fix here:
https://github.com/apache/spark/pull/848

Given that this is quite simple, maybe it's best to just go with this
and just explain that we don't support adding jars dynamically in YARN
in Spark 1.0. That seems like a reasonable thing to do.

On Wed, May 21, 2014 at 3:15 PM, Patrick Wendell <pwend...@gmail.com> wrote:
> Of these two solutions I'd definitely prefer 2 in the short term. I'd
> imagine the fix is very straightforward (it would mostly just be
> remove code), and we'd be making this more consistent with the
> standalone mode which makes things way easier to reason about.
>
> In the long term we'll definitely want to exploit the distributed
> cache more, but at this point it's premature optimization at a high
> complexity cost. Writing stuff to HDFS through is so slow anyways I'd
> guess that serving it directly from the driver is still faster in most
> cases (though for very large jar sizes or very large clusters, yes,
> we'll need the distributed cache).
>
> - Patrick
>
> On Wed, May 21, 2014 at 2:41 PM, Xiangrui Meng <men...@gmail.com> wrote:
>> That's a good example. If we really want to cover that case, there are
>> two solutions:
>>
>> 1. Follow DB's patch, adding jars to the system classloader. Then we
>> cannot put a user class in front of an existing class.
>> 2. Do not send the primary jar and secondary jars to executors'
>> distributed cache. Instead, add them to "spark.jars" in SparkSubmit
>> and serve them via http by called sc.addJar in SparkContext.
>>
>> What is your preference?
>>
>> On Wed, May 21, 2014 at 2:27 PM, Sandy Ryza <sandy.r...@cloudera.com> wrote:
>>> Is that an assumption we can make?  I think we'd run into an issue in this
>>> situation:
>>>
>>> *In primary jar:*
>>> def makeDynamicObject(clazz: String) = Class.forName(clazz).newInstance()
>>>
>>> *In app code:*
>>> sc.addJar("dynamicjar.jar")
>>> ...
>>> rdd.map(x => makeDynamicObject("some.class.from.DynamicJar"))
>>>
>>> It might be fair to say that the user should make sure to use the context
>>> classloader when instantiating dynamic classes, but I think it's weird that
>>> this code would work on Spark standalone but not on YARN.
>>>
>>> -Sandy
>>>
>>>
>>> On Wed, May 21, 2014 at 2:10 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>>
>>>> I think adding jars dynamically should work as long as the primary jar
>>>> and the secondary jars do not depend on dynamically added jars, which
>>>> should be the correct logic. -Xiangrui
>>>>
>>>> On Wed, May 21, 2014 at 1:40 PM, DB Tsai <dbt...@stanford.edu> wrote:
>>>> > This will be another separate story.
>>>> >
>>>> > Since in the yarn deployment, as Sandy said, the app.jar will be always
>>>> in
>>>> > the systemclassloader which means any object instantiated in app.jar will
>>>> > have parent loader of systemclassloader instead of custom one. As a
>>>> result,
>>>> > the custom class loader in yarn will never work without specifically
>>>> using
>>>> > reflection.
>>>> >
>>>> > Solution will be not using system classloader in the classloader
>>>> hierarchy,
>>>> > and add all the resources in system one into custom one. This is the
>>>> > approach of tomcat takes.
>>>> >
>>>> > Or we can directly overwirte the system class loader by calling the
>>>> > protected method `addURL` which will not work and throw exception if the
>>>> > code is wrapped in security manager.
>>>> >
>>>> >
>>>> > Sincerely,
>>>> >
>>>> > DB Tsai
>>>> > -------------------------------------------------------
>>>> > My Blog: https://www.dbtsai.com
>>>> > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >
>>>> >
>>>> > On Wed, May 21, 2014 at 1:13 PM, Sandy Ryza <sandy.r...@cloudera.com>
>>>> wrote:
>>>> >
>>>> >> This will solve the issue for jars added upon application submission,
>>>> but,
>>>> >> on top of this, we need to make sure that anything dynamically added
>>>> >> through sc.addJar works as well.
>>>> >>
>>>> >> To do so, we need to make sure that any jars retrieved via the driver's
>>>> >> HTTP server are loaded by the same classloader that loads the jars
>>>> given on
>>>> >> app submission.  To achieve this, we need to either use the same
>>>> >> classloader for both system jars and user jars, or make sure that the
>>>> user
>>>> >> jars given on app submission are under the same classloader used for
>>>> >> dynamically added jars.
>>>> >>
>>>> >> On Tue, May 20, 2014 at 5:59 PM, Xiangrui Meng <men...@gmail.com>
>>>> wrote:
>>>> >>
>>>> >> > Talked with Sandy and DB offline. I think the best solution is sending
>>>> >> > the secondary jars to the distributed cache of all containers rather
>>>> >> > than just the master, and set the classpath to include spark jar,
>>>> >> > primary app jar, and secondary jars before executor starts. In this
>>>> >> > way, user only needs to specify secondary jars via --jars instead of
>>>> >> > calling sc.addJar inside the code. It also solves the scalability
>>>> >> > problem of serving all the jars via http.
>>>> >> >
>>>> >> > If this solution sounds good, I can try to make a patch.
>>>> >> >
>>>> >> > Best,
>>>> >> > Xiangrui
>>>> >> >
>>>> >> > On Mon, May 19, 2014 at 10:04 PM, DB Tsai <dbt...@stanford.edu>
>>>> wrote:
>>>> >> > > In 1.0, there is a new option for users to choose which classloader
>>>> has
>>>> >> > > higher priority via spark.files.userClassPathFirst, I decided to
>>>> submit
>>>> >> > the
>>>> >> > > PR for 0.9 first. We use this patch in our lab and we can use those
>>>> >> jars
>>>> >> > > added by sc.addJar without reflection.
>>>> >> > >
>>>> >> > > https://github.com/apache/spark/pull/834
>>>> >> > >
>>>> >> > > Can anyone comment if it's a good approach?
>>>> >> > >
>>>> >> > > Thanks.
>>>> >> > >
>>>> >> > >
>>>> >> > > Sincerely,
>>>> >> > >
>>>> >> > > DB Tsai
>>>> >> > > -------------------------------------------------------
>>>> >> > > My Blog: https://www.dbtsai.com
>>>> >> > > LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> > >
>>>> >> > >
>>>> >> > > On Mon, May 19, 2014 at 7:42 PM, DB Tsai <dbt...@stanford.edu>
>>>> wrote:
>>>> >> > >
>>>> >> > >> Good summary! We fixed it in branch 0.9 since our production is
>>>> still
>>>> >> in
>>>> >> > >> 0.9. I'm porting it to 1.0 now, and hopefully will submit PR for
>>>> 1.0
>>>> >> > >> tonight.
>>>> >> > >>
>>>> >> > >>
>>>> >> > >> Sincerely,
>>>> >> > >>
>>>> >> > >> DB Tsai
>>>> >> > >> -------------------------------------------------------
>>>> >> > >> My Blog: https://www.dbtsai.com
>>>> >> > >> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> > >>
>>>> >> > >>
>>>> >> > >> On Mon, May 19, 2014 at 7:38 PM, Sandy Ryza <
>>>> sandy.r...@cloudera.com
>>>> >> > >wrote:
>>>> >> > >>
>>>> >> > >>> It just hit me why this problem is showing up on YARN and not on
>>>> >> > >>> standalone.
>>>> >> > >>>
>>>> >> > >>> The relevant difference between YARN and standalone is that, on
>>>> YARN,
>>>> >> > the
>>>> >> > >>> app jar is loaded by the system classloader instead of Spark's
>>>> custom
>>>> >> > URL
>>>> >> > >>> classloader.
>>>> >> > >>>
>>>> >> > >>> On YARN, the system classloader knows about [the classes in the
>>>> spark
>>>> >> > >>> jars,
>>>> >> > >>> the classes in the primary app jar].   The custom classloader
>>>> knows
>>>> >> > about
>>>> >> > >>> [the classes in secondary app jars] and has the system
>>>> classloader as
>>>> >> > its
>>>> >> > >>> parent.
>>>> >> > >>>
>>>> >> > >>> A few relevant facts (mostly redundant with what Sean pointed
>>>> out):
>>>> >> > >>> * Every class has a classloader that loaded it.
>>>> >> > >>> * When an object of class B is instantiated inside of class A, the
>>>> >> > >>> classloader used for loading B is the classloader that was used
>>>> for
>>>> >> > >>> loading
>>>> >> > >>> A.
>>>> >> > >>> * When a classloader fails to load a class, it lets its parent
>>>> >> > classloader
>>>> >> > >>> try.  If its parent succeeds, its parent becomes the "classloader
>>>> >> that
>>>> >> > >>> loaded it".
>>>> >> > >>>
>>>> >> > >>> So suppose class B is in a secondary app jar and class A is in the
>>>> >> > primary
>>>> >> > >>> app jar:
>>>> >> > >>> 1. The custom classloader will try to load class A.
>>>> >> > >>> 2. It will fail, because it only knows about the secondary jars.
>>>> >> > >>> 3. It will delegate to its parent, the system classloader.
>>>> >> > >>> 4. The system classloader will succeed, because it knows about the
>>>> >> > primary
>>>> >> > >>> app jar.
>>>> >> > >>> 5. A's classloader will be the system classloader.
>>>> >> > >>> 6. A tries to instantiate an instance of class B.
>>>> >> > >>> 7. B will be loaded with A's classloader, which is the system
>>>> >> > classloader.
>>>> >> > >>> 8. Loading B will fail, because A's classloader, which is the
>>>> system
>>>> >> > >>> classloader, doesn't know about the secondary app jars.
>>>> >> > >>>
>>>> >> > >>> In Spark standalone, A and B are both loaded by the custom
>>>> >> > classloader, so
>>>> >> > >>> this issue doesn't come up.
>>>> >> > >>>
>>>> >> > >>> -Sandy
>>>> >> > >>>
>>>> >> > >>> On Mon, May 19, 2014 at 7:07 PM, Patrick Wendell <
>>>> pwend...@gmail.com
>>>> >> >
>>>> >> > >>> wrote:
>>>> >> > >>>
>>>> >> > >>> > Having a user add define a custom class inside of an added jar
>>>> and
>>>> >> > >>> > instantiate it directly inside of an executor is definitely
>>>> >> supported
>>>> >> > >>> > in Spark and has been for a really long time (several years).
>>>> This
>>>> >> is
>>>> >> > >>> > something we do all the time in Spark.
>>>> >> > >>> >
>>>> >> > >>> > DB - I'd hold off on a re-architecting of this until we identify
>>>> >> > >>> > exactly what is causing the bug you are running into.
>>>> >> > >>> >
>>>> >> > >>> > In a nutshell, when the bytecode "new Foo()" is run on the
>>>> >> executor,
>>>> >> > >>> > it will ask the driver for the class over HTTP using a custom
>>>> >> > >>> > classloader. Something in that pipeline is breaking here,
>>>> possibly
>>>> >> > >>> > related to the YARN deployment stuff.
>>>> >> > >>> >
>>>> >> > >>> >
>>>> >> > >>> > On Mon, May 19, 2014 at 12:29 AM, Sean Owen <so...@cloudera.com
>>>> >
>>>> >> > wrote:
>>>> >> > >>> > > I don't think a customer classloader is necessary.
>>>> >> > >>> > >
>>>> >> > >>> > > Well, it occurs to me that this is no new problem. Hadoop,
>>>> >> Tomcat,
>>>> >> > etc
>>>> >> > >>> > > all run custom user code that creates new user objects without
>>>> >> > >>> > > reflection. I should go see how that's done. Maybe it's
>>>> totally
>>>> >> > valid
>>>> >> > >>> > > to set the thread's context classloader for just this purpose,
>>>> >> and
>>>> >> > I
>>>> >> > >>> > > am not thinking clearly.
>>>> >> > >>> > >
>>>> >> > >>> > > On Mon, May 19, 2014 at 8:26 AM, Andrew Ash <
>>>> >> and...@andrewash.com>
>>>> >> > >>> > wrote:
>>>> >> > >>> > >> Sounds like the problem is that classloaders always look in
>>>> >> their
>>>> >> > >>> > parents
>>>> >> > >>> > >> before themselves, and Spark users want executors to pick up
>>>> >> > classes
>>>> >> > >>> > from
>>>> >> > >>> > >> their custom code before the ones in Spark plus its
>>>> >> dependencies.
>>>> >> > >>> > >>
>>>> >> > >>> > >> Would a custom classloader that delegates to the parent after
>>>> >> > first
>>>> >> > >>> > >> checking itself fix this up?
>>>> >> > >>> > >>
>>>> >> > >>> > >>
>>>> >> > >>> > >> On Mon, May 19, 2014 at 12:17 AM, DB Tsai <
>>>> dbt...@stanford.edu>
>>>> >> > >>> wrote:
>>>> >> > >>> > >>
>>>> >> > >>> > >>> Hi Sean,
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> It's true that the issue here is classloader, and due to the
>>>> >> > >>> > classloader
>>>> >> > >>> > >>> delegation model, users have to use reflection in the
>>>> executors
>>>> >> > to
>>>> >> > >>> > pick up
>>>> >> > >>> > >>> the classloader in order to use those classes added by
>>>> >> sc.addJars
>>>> >> > >>> APIs.
>>>> >> > >>> > >>> However, it's very inconvenience for users, and not
>>>> documented
>>>> >> in
>>>> >> > >>> > spark.
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> I'm working on a patch to solve it by calling the protected
>>>> >> > method
>>>> >> > >>> > addURL
>>>> >> > >>> > >>> in URLClassLoader to update the current default
>>>> classloader, so
>>>> >> > no
>>>> >> > >>> > >>> customClassLoader anymore. I wonder if this is an good way
>>>> to
>>>> >> go.
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>   private def addURL(url: URL, loader: URLClassLoader){
>>>> >> > >>> > >>>     try {
>>>> >> > >>> > >>>       val method: Method =
>>>> >> > >>> > >>> classOf[URLClassLoader].getDeclaredMethod("addURL",
>>>> >> classOf[URL])
>>>> >> > >>> > >>>       method.setAccessible(true)
>>>> >> > >>> > >>>       method.invoke(loader, url)
>>>> >> > >>> > >>>     }
>>>> >> > >>> > >>>     catch {
>>>> >> > >>> > >>>       case t: Throwable => {
>>>> >> > >>> > >>>         throw new IOException("Error, could not add URL to
>>>> >> system
>>>> >> > >>> > >>> classloader")
>>>> >> > >>> > >>>       }
>>>> >> > >>> > >>>     }
>>>> >> > >>> > >>>   }
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> Sincerely,
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> DB Tsai
>>>> >> > >>> > >>> -------------------------------------------------------
>>>> >> > >>> > >>> My Blog: https://www.dbtsai.com
>>>> >> > >>> > >>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>>> >> > >>> > >>>
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> On Sun, May 18, 2014 at 11:57 PM, Sean Owen <
>>>> >> so...@cloudera.com>
>>>> >> > >>> > wrote:
>>>> >> > >>> > >>>
>>>> >> > >>> > >>> > I might be stating the obvious for everyone, but the issue
>>>> >> > here is
>>>> >> > >>> > not
>>>> >> > >>> > >>> > reflection or the source of the JAR, but the ClassLoader.
>>>> The
>>>> >> > >>> basic
>>>> >> > >>> > >>> > rules are this.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > "new Foo" will use the ClassLoader that defines Foo. This
>>>> is
>>>> >> > >>> usually
>>>> >> > >>> > >>> > the ClassLoader that loaded whatever it is that first
>>>> >> > referenced
>>>> >> > >>> Foo
>>>> >> > >>> > >>> > and caused it to be loaded -- usually the ClassLoader
>>>> holding
>>>> >> > your
>>>> >> > >>> > >>> > other app classes.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > ClassLoaders can have a parent-child relationship.
>>>> >> ClassLoaders
>>>> >> > >>> > always
>>>> >> > >>> > >>> > look in their parent before themselves.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > (Careful then -- in contexts like Hadoop or Tomcat where
>>>> your
>>>> >> > app
>>>> >> > >>> is
>>>> >> > >>> > >>> > loaded in a child ClassLoader, and you reference a class
>>>> that
>>>> >> > >>> Hadoop
>>>> >> > >>> > >>> > or Tomcat also has (like a lib class) you will get the
>>>> >> > container's
>>>> >> > >>> > >>> > version!)
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > When you load an external JAR it has a separate
>>>> ClassLoader
>>>> >> > which
>>>> >> > >>> > does
>>>> >> > >>> > >>> > not necessarily bear any relation to the one containing
>>>> your
>>>> >> > app
>>>> >> > >>> > >>> > classes, so yeah it is not generally going to make "new
>>>> Foo"
>>>> >> > work.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > Reflection lets you pick the ClassLoader, yes.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > I would not call setContextClassLoader.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>> > On Mon, May 19, 2014 at 12:00 AM, Sandy Ryza <
>>>> >> > >>> > sandy.r...@cloudera.com>
>>>> >> > >>> > >>> > wrote:
>>>> >> > >>> > >>> > > I spoke with DB offline about this a little while ago
>>>> and
>>>> >> he
>>>> >> > >>> > confirmed
>>>> >> > >>> > >>> > that
>>>> >> > >>> > >>> > > he was able to access the jar from the driver.
>>>> >> > >>> > >>> > >
>>>> >> > >>> > >>> > > The issue appears to be a general Java issue: you can't
>>>> >> > directly
>>>> >> > >>> > >>> > > instantiate a class from a dynamically loaded jar.
>>>> >> > >>> > >>> > >
>>>> >> > >>> > >>> > > I reproduced it locally outside of Spark with:
>>>> >> > >>> > >>> > > ---
>>>> >> > >>> > >>> > >     URLClassLoader urlClassLoader = new
>>>> URLClassLoader(new
>>>> >> > >>> URL[] {
>>>> >> > >>> > new
>>>> >> > >>> > >>> > > File("myotherjar.jar").toURI().toURL() }, null);
>>>> >> > >>> > >>> > >
>>>> >> > >>> Thread.currentThread().setContextClassLoader(urlClassLoader);
>>>> >> > >>> > >>> > >     MyClassFromMyOtherJar obj = new
>>>> >> MyClassFromMyOtherJar();
>>>> >> > >>> > >>> > > ---
>>>> >> > >>> > >>> > >
>>>> >> > >>> > >>> > > I was able to load the class with reflection.
>>>> >> > >>> > >>> >
>>>> >> > >>> > >>>
>>>> >> > >>> >
>>>> >> > >>>
>>>> >> > >>
>>>> >> > >>
>>>> >> >
>>>> >>
>>>>

Reply via email to