Thanks for the response, scrypso! I will try adding the extraClassPath
option. Meanwhile, please find the full stack trace below (I have
masked/removed references to proprietary code)

java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found
        at
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720)
        at
org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888)
        at
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542)
        at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
        at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
        at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
        at
org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752)
        at scala.collection.immutable.List.map(List.scala:293)
        at
org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750)
        at
org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579)
        at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
        at
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
        at
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
        at scala.Option.getOrElse(Option.scala:189)
        at
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)

Thanks again!

On Tue, Dec 13, 2022 at 9:52 PM scrypso <scry...@gmail.com> wrote:

> Two ideas you could try:
>
> You can try spark.driver.extraClassPath as well. Spark loads the user's
> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes
> reflectively. Hadoop's Configuration should use the thread ctx classloader,
> and Spark should set that to the loader that loads your jar. The
> extraClassPath option just adds jars directly to the Java command that
> creates the driver/executor.
>
> I can't immediately tell how your error might arise, unless there is some
> timing issue with the Spark and Hadoop setup. Can you share the full
> stacktrace of the ClassNotFound exception? That might tell us when Hadoop
> is looking up this class.
>
> Good luck!
> - scrypso
>
>
> On Tue, Dec 13, 2022, 17:05 Hariharan <hariharan...@gmail.com> wrote:
>
>> Missed to mention it above, but just to add, the error is coming from the
>> driver. I tried using *--driver-class-path /path/to/my/jar* as well, but
>> no luck.
>>
>> Thanks!
>>
>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan <hariharan...@gmail.com> wrote:
>>
>>> Hello folks,
>>>
>>> I have a spark app with a custom implementation of
>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar.
>>> Output of *jar tf*
>>>
>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class*
>>>
>>> However when I run the my spark app with spark-submit in cluster mode,
>>> it fails with the following error:
>>>
>>> *java.lang.RuntimeException: java.lang.RuntimeException:
>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not
>>> found*
>>>
>>> I tried:
>>> 1. passing in the jar to the *--jars* option (with the local path)
>>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path
>>>
>>> but still the same error.
>>>
>>> Any suggestions on what I'm missing?
>>>
>>> Other pertinent details:
>>> Spark version: 3.3.0
>>> Hadoop version: 3.3.4
>>>
>>> Command used to run the app
>>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster
>>> --master yarn  --conf spark.executor.instances=6   /path/to/my/jar*
>>>
>>> TIA!
>>>
>>

Reply via email to