Re: Spark-on-Yarn ClassNotFound Exception

Hariharan Tue, 13 Dec 2022 20:17:24 -0800

Hi scrypso,

Thanks for the help so far, and I think you're definitely on to something
here. I tried loading the class as you suggested with the code below:


try {
    
Thread.currentThread().getContextClassLoader().loadClass(MyS3ClientFactory.class.getCanonicalName());
    logger.info("Loaded custom class");
} catch (ClassNotFoundException e) {
    logger.error("Unable to load class", e);
}
return spark.read().option("mode",
"DROPMALFORMED").format("avro").load(<paths>);

I am able to load the custom class as above
*2022-12-14 04:12:34,158 INFO  [Driver] utils.S3Reader - Loaded custom
class*

But the spark.read code below it tries to initialize the s3 client and is
not able to load the same class.

I tried adding
*--conf spark.executor.extraClassPath=myjar*

as well, but no luck :-(

Thanks again!

On Tue, Dec 13, 2022 at 10:09 PM scrypso <scry...@gmail.com> wrote:

> I'm on my phone, so can't compare with the Spark source, but that looks to
> me like it should be well after the ctx loader has been set. You could try
> printing the classpath of the loader
> Thread.currentThread().getThreadContextClassLoader(), or try to load your
> class from that yourself to see if you get the same error.
>
> Can you see which thread is throwing the exception? If it is a different
> thread than the "main" application thread it might not have the thread ctx
> loader set correctly. I can't see any of your classes in the stacktrace - I
> assume that is because of your scrubbing, but it could also be because this
> is run in separate thread without ctx loader set.
>
> It also looks like Hadoop is caching the FileSystems somehow - perhaps you
> can create the S3A filesystem yourself and hope it picks that up? (Wild
> guess, no idea if that works or how hard it would be.)
>
>
> On Tue, Dec 13, 2022, 17:29 Hariharan <hariharan...@gmail.com> wrote:
>
>> Thanks for the response, scrypso! I will try adding the extraClassPath
>> option. Meanwhile, please find the full stack trace below (I have
>> masked/removed references to proprietary code)
>>
>> java.lang.RuntimeException: java.lang.RuntimeException:
>> java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found
>>         at
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720)
>>         at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888)
>>         at
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542)
>>         at
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
>>         at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
>>         at
>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
>>         at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
>>         at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
>>         at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>>         at
>> org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
>>         at scala.collection.immutable.List.map(List.scala:293)
>>         at
>> org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
>>         at
>> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
>>         at
>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
>>         at
>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>>         at
>> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>>         at scala.Option.getOrElse(Option.scala:189)
>>         at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>>
>> Thanks again!
>>
>> On Tue, Dec 13, 2022 at 9:52 PM scrypso <scry...@gmail.com> wrote:
>>
>>> Two ideas you could try:
>>>
>>> You can try spark.driver.extraClassPath as well. Spark loads the user's
>>> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes
>>> reflectively. Hadoop's Configuration should use the thread ctx classloader,
>>> and Spark should set that to the loader that loads your jar. The
>>> extraClassPath option just adds jars directly to the Java command that
>>> creates the driver/executor.
>>>
>>> I can't immediately tell how your error might arise, unless there is
>>> some timing issue with the Spark and Hadoop setup. Can you share the full
>>> stacktrace of the ClassNotFound exception? That might tell us when Hadoop
>>> is looking up this class.
>>>
>>> Good luck!
>>> - scrypso
>>>
>>>
>>> On Tue, Dec 13, 2022, 17:05 Hariharan <hariharan...@gmail.com> wrote:
>>>
>>>> Missed to mention it above, but just to add, the error is coming from
>>>> the driver. I tried using *--driver-class-path /path/to/my/jar* as
>>>> well, but no luck.
>>>>
>>>> Thanks!
>>>>
>>>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan <hariharan...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello folks,
>>>>>
>>>>> I have a spark app with a custom implementation of
>>>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
>>>>> Output of *jar tf*
>>>>>
>>>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>>>>>
>>>>> However when I run the my spark app with spark-submit in cluster mode,
>>>>> it fails with the following error:
>>>>>
>>>>> *java.lang.RuntimeException: java.lang.RuntimeException:
>>>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
>>>>> found*
>>>>>
>>>>> I tried:
>>>>> 1. passing in the jar to the *--jars* option (with the local path)
>>>>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path
>>>>>
>>>>> but still the same error.
>>>>>
>>>>> Any suggestions on what I'm missing?
>>>>>
>>>>> Other pertinent details:
>>>>> Spark version: 3.3.0
>>>>> Hadoop version: 3.3.4
>>>>>
>>>>> Command used to run the app
>>>>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster
>>>>> --master yarn  --conf spark.executor.instances=6   /path/to/my/jar*
>>>>>
>>>>> TIA!
>>>>>
>>>>

Re: Spark-on-Yarn ClassNotFound Exception

Reply via email to