Hi Mich, If I correctly understood your problem, it is that the spark-kafka jar is shadowed by the installed kafka client jar at run time. I had been in that place earlier. I can recommend resolving the issue using the shade plugin. The example I am pasting here works for pom.xml. I am very sure you will find something for sbt as well. This is a maven shade plugin to change the name of the class while packaging. This will form an uber jar. <*relocations*> <*relocation*> <*pattern*>org.apache.kafka</*pattern*> <*shadedPattern*>shade.org.apache.kafka</*shadedPattern*> </*relocation*> </*relocations*>
Hope this helps. Regards Amit Joshi On Wed, Apr 7, 2021 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Did some tests. The concern is SSS job running under YARN > > > *Scenario 1)* use spark-sql-kafka-0-10_2.12-3.1.0.jar > > - Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on > CLASSPATH including $SPARK_HOME/jars > - Added the said jar file to spark-submit in client mode (the only > mode available to PySpark) with --jars > - spark-submit --master yarn --deploy-mode client --conf > spark.pyspark.virtualenv.enabled=true .. bla bla.. --driver-memory 4G > --executor-memory 4G --num-executors 2 --executor-cores 2 *--jars > $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar *xyz.py > > This works fine > > > *Scenario 2)* use spark-sql-kafka-0-10_2.12-3.1.1.jar in spark-submit > > > > - spark-submit --master yarn --deploy-mode client --conf > spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G > --executor-memory 4G --num-executors 2 --executor-cores 2 *--jars > $HOME/jars/spark-sql-kafka-0-10_2.12-*3.1.1*.jar *xyz.py > > it failed with > > > > - Caused by: java.lang.NoSuchMethodError: > > org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z > > Scenario 3) use the package as per Structured Streaming + Kafka > Integration Guide (Kafka broker version 0.10.0 or higher) - Spark 3.1.1 > Documentation (apache.org) > <https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying> > > > - spark-submit --master yarn --deploy-mode client --conf > spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G > --executor-memory 4G --num-executors 2 --executor-cores 2 *--packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 *xyz.py > > it failed with > > - Caused by: java.lang.NoSuchMethodError: > > org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z > > > HTH > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 7 Apr 2021 at 13:20, Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > >> +1 on Sean's opinion >> >> On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <sro...@gmail.com> wrote: >> >>> You shouldn't be modifying your cluster install. You may at this point >>> have conflicting, excess JARs in there somewhere. I'd start it over if you >>> can. >>> >>> On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <gabor.g.somo...@gmail.com> >>> wrote: >>> >>>> Not sure what you mean not working. You've added 3.1.1 to packages >>>> which uses: >>>> * 2.6.0 kafka-clients: >>>> https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L136 >>>> * 2.6.2 commons pool: >>>> https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L183 >>>> >>>> I think it worth an end-to-end dep-tree analysis what is really >>>> happening on the cluster... >>>> >>>> G >>>> >>>> >>>> On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Hi Gabor et. al., >>>>> >>>>> To be honest I am not convinced this package --packages >>>>> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working! >>>>> >>>>> I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works >>>>> fine. I reported the package working before because under $SPARK_HOME/jars >>>>> on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we >>>>> had the following entries: >>>>> >>>>> spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar >>>>> spark.driver.extraClassPath $SPARK_HOME/jars/*.jar >>>>> spark.executor.extraClassPath $SPARK_HOME/jars/*.jar >>>>> >>>>> So the jar file was picked up first anyway. >>>>> >>>>> The concern I have is that that the package uses older version of jar >>>>> files, namely: the following in .ivy2/jars >>>>> >>>>> -rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 >>>>> com.github.luben_zstd-jni-1.4.8-1.jar >>>>> -rw-r--r-- 1 hduser hadoop 129174 Apr 6 2019 >>>>> org.apache.commons_commons-pool2-2.6.2.jar >>>>> -rw-r--r-- 1 hduser hadoop 3754508 Jul 28 2020 >>>>> org.apache.kafka_kafka-clients-2.6.0.jar >>>>> -rw-r--r-- 1 hduser hadoop 387494 Feb 22 03:57 >>>>> org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar >>>>> -rw-r--r-- 1 hduser hadoop 55766 Feb 22 03:58 >>>>> org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar >>>>> -rw-r--r-- 1 hduser hadoop 649950 Jan 18 2020 >>>>> org.lz4_lz4-java-1.7.1.jar >>>>> -rw-r--r-- 1 hduser hadoop 41472 Dec 16 2019 >>>>> org.slf4j_slf4j-api-1.7.30.jar >>>>> -rw-r--r-- 1 hduser hadoop 2777 Oct 22 2014 >>>>> org.spark-project.spark_unused-1.0.0.jar >>>>> -rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 >>>>> org.xerial.snappy_snappy-java-1.1.8.2.jar >>>>> >>>>> >>>>> So I am not sure. Hence I want someone to verify this independently in >>>>> anger >>>>> >>>>>