Hmm, did you mean spark.*driver*.extraClassPath? That is very odd then - if you check the logs directory for the driver (on the cluster) I think there should be a launch container log, where you can see the exact command used to start the JVM (at the very end), and a line starting "export CLASSPATH" - you can double check that your jar looks to be included correctly there. If it is I think you have a really "interesting" issue on your hands!
- scrypso On Wed, Dec 14, 2022, 05:17 Hariharan <hariharan...@gmail.com> wrote: > Hi scrypso, > > Thanks for the help so far, and I think you're definitely on to something > here. I tried loading the class as you suggested with the code below: > > try { > > Thread.currentThread().getContextClassLoader().loadClass(MyS3ClientFactory.class.getCanonicalName()); > logger.info("Loaded custom class"); > } catch (ClassNotFoundException e) { > logger.error("Unable to load class", e); > } > return spark.read().option("mode", > "DROPMALFORMED").format("avro").load(<paths>); > > I am able to load the custom class as above > *2022-12-14 04:12:34,158 INFO [Driver] utils.S3Reader - Loaded custom > class* > > But the spark.read code below it tries to initialize the s3 client and is > not able to load the same class. > > I tried adding > *--conf spark.executor.extraClassPath=myjar* > > as well, but no luck :-( > > Thanks again! > > On Tue, Dec 13, 2022 at 10:09 PM scrypso <scry...@gmail.com> wrote: > >> I'm on my phone, so can't compare with the Spark source, but that looks >> to me like it should be well after the ctx loader has been set. You could >> try printing the classpath of the loader >> Thread.currentThread().getThreadContextClassLoader(), or try to load your >> class from that yourself to see if you get the same error. >> >> Can you see which thread is throwing the exception? If it is a different >> thread than the "main" application thread it might not have the thread ctx >> loader set correctly. I can't see any of your classes in the stacktrace - I >> assume that is because of your scrubbing, but it could also be because this >> is run in separate thread without ctx loader set. >> >> It also looks like Hadoop is caching the FileSystems somehow - perhaps >> you can create the S3A filesystem yourself and hope it picks that up? (Wild >> guess, no idea if that works or how hard it would be.) >> >> >> On Tue, Dec 13, 2022, 17:29 Hariharan <hariharan...@gmail.com> wrote: >> >>> Thanks for the response, scrypso! I will try adding the extraClassPath >>> option. Meanwhile, please find the full stack trace below (I have >>> masked/removed references to proprietary code) >>> >>> java.lang.RuntimeException: java.lang.RuntimeException: >>> java.lang.ClassNotFoundException: Class foo.bar.MyS3ClientFactory not found >>> at >>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2720) >>> at >>> org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:888) >>> at >>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:542) >>> at >>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469) >>> at >>> org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174) >>> at >>> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574) >>> at >>> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521) >>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540) >>> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365) >>> at >>> org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:752) >>> at scala.collection.immutable.List.map(List.scala:293) >>> at >>> org.apache.spark.sql.execution.datasources.DataSource$.checkAndGlobPathIfNecessary(DataSource.scala:750) >>> at >>> org.apache.spark.sql.execution.datasources.DataSource.checkAndGlobPathIfNecessary(DataSource.scala:579) >>> at >>> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408) >>> at >>> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) >>> at >>> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) >>> at scala.Option.getOrElse(Option.scala:189) >>> at >>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) >>> >>> Thanks again! >>> >>> On Tue, Dec 13, 2022 at 9:52 PM scrypso <scry...@gmail.com> wrote: >>> >>>> Two ideas you could try: >>>> >>>> You can try spark.driver.extraClassPath as well. Spark loads the user's >>>> jar in a child classloader, so Spark/Yarn/Hadoop can only see your classes >>>> reflectively. Hadoop's Configuration should use the thread ctx classloader, >>>> and Spark should set that to the loader that loads your jar. The >>>> extraClassPath option just adds jars directly to the Java command that >>>> creates the driver/executor. >>>> >>>> I can't immediately tell how your error might arise, unless there is >>>> some timing issue with the Spark and Hadoop setup. Can you share the full >>>> stacktrace of the ClassNotFound exception? That might tell us when Hadoop >>>> is looking up this class. >>>> >>>> Good luck! >>>> - scrypso >>>> >>>> >>>> On Tue, Dec 13, 2022, 17:05 Hariharan <hariharan...@gmail.com> wrote: >>>> >>>>> Missed to mention it above, but just to add, the error is coming from >>>>> the driver. I tried using *--driver-class-path /path/to/my/jar* as >>>>> well, but no luck. >>>>> >>>>> Thanks! >>>>> >>>>> On Mon, Dec 12, 2022 at 4:21 PM Hariharan <hariharan...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hello folks, >>>>>> >>>>>> I have a spark app with a custom implementation of >>>>>> *fs.s3a.s3.client.factory.impl* which is packaged into the same jar. >>>>>> Output of *jar tf* >>>>>> >>>>>> *2620 Mon Dec 12 11:23:00 IST 2022 aws/utils/MyS3ClientFactory.class* >>>>>> >>>>>> However when I run the my spark app with spark-submit in cluster >>>>>> mode, it fails with the following error: >>>>>> >>>>>> *java.lang.RuntimeException: java.lang.RuntimeException: >>>>>> java.lang.ClassNotFoundException: Class aws.utils.MyS3ClientFactory not >>>>>> found* >>>>>> >>>>>> I tried: >>>>>> 1. passing in the jar to the *--jars* option (with the local path) >>>>>> 2. Passing in the jar to *spark.yarn.jars* option with an HDFS path >>>>>> >>>>>> but still the same error. >>>>>> >>>>>> Any suggestions on what I'm missing? >>>>>> >>>>>> Other pertinent details: >>>>>> Spark version: 3.3.0 >>>>>> Hadoop version: 3.3.4 >>>>>> >>>>>> Command used to run the app >>>>>> */spark/bin/spark-submit --class MyMainClass --deploy-mode cluster >>>>>> --master yarn --conf spark.executor.instances=6 /path/to/my/jar* >>>>>> >>>>>> TIA! >>>>>> >>>>>