Thanks for the response, scrypso! I will try adding the extraClassPath option. Meanwhile, please find the full stack trace below (I have masked/removed references to proprietary code)
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720) at org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888) at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750) at org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) Thanks again! On Tue, Dec 13, 2022 at 9:52 PM scrypso <scry...@gmail.com> wrote: > Two ideas you could try: > > You can try spark.driver.extraClassPath as well. Spark loads the user's > jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes > reflectively. Hadoop's Configuration should use the thread ctx classloader, > and Spark should set that to the loader that loads your jar. The > extraClassPath option just adds jars directly to the Java command that > creates the driver/executor. > > I can't immediately tell how your error might arise, unless there is some > timing issue with the Spark and Hadoop setup. Can you share the full > stacktrace of the ClassNotFound exception? That might tell us when Hadoop > is looking up this class. > > Good luck! > - scrypso > > > On Tue, Dec 13, 2022, 17:05 Hariharan <hariharan...@gmail.com> wrote: > >> Missed to mention it above, but just to add, the error is coming from the >> driver. I tried using *--driver-class-path /path/to/my/jar* as well, but >> no luck. >> >> Thanks! >> >> On Mon, Dec 12, 2022 at 4:21 PM Hariharan <hariharan...@gmail.com> wrote: >> >>> Hello folks, >>> >>> I have a spark app with a custom implementation of >>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar. >>> Output of *jar tf* >>> >>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class* >>> >>> However when I run the my spark app with spark-submit in cluster mode, >>> it fails with the following error: >>> >>> *java.lang.RuntimeException: java.lang.RuntimeException: >>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not >>> found* >>> >>> I tried: >>> 1. passing in the jar to the *--jars* option (with the local path) >>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path >>> >>> but still the same error. >>> >>> Any suggestions on what I'm missing? >>> >>> Other pertinent details: >>> Spark version: 3.3.0 >>> Hadoop version: 3.3.4 >>> >>> Command used to run the app >>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster >>> --master yarn --conf spark.executor.instances=6 /path/to/my/jar* >>> >>> TIA! >>> >>