Hi scrypso, Thanks for the help so far, and I think you're definitely on to something here. I tried loading the class as you suggested with the code below:
try { Thread.currentThread().getContextClassLoader().loadClass(MyS3ClientFactory.class.getCanonicalName()); logger.info("Loaded custom class"); } catch (ClassNotFoundException e) { logger.error("Unable to load class", e); } return spark.read().option("mode", "DROPMALFORMED").format("avro").load(<paths>); I am able to load the custom class as above *2022-12-14 04:12:34,158 INFO [Driver] utils.S3Reader - Loaded custom class* But the spark.read code below it tries to initialize the s3 client and is not able to load the same class. I tried adding *--conf spark.executor.extraClassPath=myjar* as well, but no luck :-( Thanks again! On Tue, Dec 13, 2022 at 10:09 PM scrypso <scry...@gmail.com> wrote: > I'm on my phone, so can't compare with the Spark source, but that looks to > me like it should be well after the ctx loader has been set. You could try > printing the classpath of the loader > Thread.currentThread().getThreadContextClassLoader(), or try to load your > class from that yourself to see if you get the same error. > > Can you see which thread is throwing the exception? If it is a different > thread than the "main" application thread it might not have the thread ctx > loader set correctly. I can't see any of your classes in the stacktrace - I > assume that is because of your scrubbing, but it could also be because this > is run in separate thread without ctx loader set. > > It also looks like Hadoop is caching the FileSystems somehow - perhaps you > can create the S3A filesystem yourself and hope it picks that up? (Wild > guess, no idea if that works or how hard it would be.) > > > On Tue, Dec 13, 2022, 17:29 Hariharan <hariharan...@gmail.com> wrote: > >> Thanks for the response, scrypso! I will try adding the extraClassPath >> option. Meanwhile, please find the full stack trace below (I have >> masked/removed references to proprietary code) >> >> java.lang.RuntimeException: java.lang.RuntimeException: >> java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found >> at >> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720) >> at >> org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888) >> at >> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542) >> at >> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) >> at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) >> at >> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) >> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) >> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) >> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) >> at >> org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752) >> at scala.collection.immutable.List.map(List.scala:293) >> at >> org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750) >> at >> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579) >> at >> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) >> at >> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) >> at >> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) >> at scala.Option.getOrElse(Option.scala:189) >> at >> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) >> >> Thanks again! >> >> On Tue, Dec 13, 2022 at 9:52 PM scrypso <scry...@gmail.com> wrote: >> >>> Two ideas you could try: >>> >>> You can try spark.driver.extraClassPath as well. Spark loads the user's >>> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes >>> reflectively. Hadoop's Configuration should use the thread ctx classloader, >>> and Spark should set that to the loader that loads your jar. The >>> extraClassPath option just adds jars directly to the Java command that >>> creates the driver/executor. >>> >>> I can't immediately tell how your error might arise, unless there is >>> some timing issue with the Spark and Hadoop setup. Can you share the full >>> stacktrace of the ClassNotFound exception? That might tell us when Hadoop >>> is looking up this class. >>> >>> Good luck! >>> - scrypso >>> >>> >>> On Tue, Dec 13, 2022, 17:05 Hariharan <hariharan...@gmail.com> wrote: >>> >>>> Missed to mention it above, but just to add, the error is coming from >>>> the driver. I tried using *--driver-class-path /path/to/my/jar* as >>>> well, but no luck. >>>> >>>> Thanks! >>>> >>>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan <hariharan...@gmail.com> >>>> wrote: >>>> >>>>> Hello folks, >>>>> >>>>> I have a spark app with a custom implementation of >>>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar. >>>>> Output of *jar tf* >>>>> >>>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class* >>>>> >>>>> However when I run the my spark app with spark-submit in cluster mode, >>>>> it fails with the following error: >>>>> >>>>> *java.lang.RuntimeException: java.lang.RuntimeException: >>>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not >>>>> found* >>>>> >>>>> I tried: >>>>> 1. passing in the jar to the *--jars* option (with the local path) >>>>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path >>>>> >>>>> but still the same error. >>>>> >>>>> Any suggestions on what I'm missing? >>>>> >>>>> Other pertinent details: >>>>> Spark version: 3.3.0 >>>>> Hadoop version: 3.3.4 >>>>> >>>>> Command used to run the app >>>>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster >>>>> --master yarn --conf spark.executor.instances=6 /path/to/my/jar* >>>>> >>>>> TIA! >>>>> >>>>