Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh Wed, 07 Apr 2021 07:43:52 -0700

Did some tests. The concern is SSS job running under YARN


*Scenario 1)*  use spark-sql-kafka-0-10_2.12-3.1.0.jar

   - Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on CLASSPATH
   including $SPARK_HOME/jars
   - Added the said jar file to spark-submit in client mode (the only mode
   available to PySpark) with --jars
   - spark-submit --master yarn --deploy-mode client --conf
   spark.pyspark.virtualenv.enabled=true .. bla bla..  --driver-memory 4G
   --executor-memory 4G --num-executors 2 --executor-cores 2 *--jars
   $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar *xyz.py

This works fine


*Scenario 2)* use spark-sql-kafka-0-10_2.12-3.1.1.jar in spark-submit



   -  spark-submit --master yarn --deploy-mode client --conf
   spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G
   --executor-memory 4G --num-executors 2 --executor-cores 2 *--jars
   $HOME/jars/spark-sql-kafka-0-10_2.12-*3.1.1*.jar *xyz.py

it failed with



   - Caused by: java.lang.NoSuchMethodError:
   
org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z

Scenario 3) use the package as per Structured Streaming + Kafka Integration
Guide (Kafka broker version 0.10.0 or higher) - Spark 3.1.1 Documentation
(apache.org)
<https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying>


   - spark-submit --master yarn --deploy-mode client --conf
   spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G
   --executor-memory 4G --num-executors 2 --executor-cores 2 *--packages
   org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 *xyz.py

it failed with

   - Caused by: java.lang.NoSuchMethodError:
   
org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Apr 2021 at 13:20, Gabor Somogyi <[email protected]>
wrote:

> +1 on Sean's opinion
>
> On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <[email protected]> wrote:
>
>> You shouldn't be modifying your cluster install. You may at this point
>> have conflicting, excess JARs in there somewhere. I'd start it over if you
>> can.
>>
>> On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <[email protected]>
>> wrote:
>>
>>> Not sure what you mean not working. You've added 3.1.1 to packages which
>>> uses:
>>> * 2.6.0 kafka-clients:
>>> https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L136
>>> * 2.6.2 commons pool:
>>> https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L183
>>>
>>> I think it worth an end-to-end dep-tree analysis what is really
>>> happening on the cluster...
>>>
>>> G
>>>
>>>
>>> On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <
>>> [email protected]> wrote:
>>>
>>>> Hi Gabor et. al.,
>>>>
>>>> To be honest I am not convinced this package --packages
>>>> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!
>>>>
>>>> I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works
>>>> fine. I reported the package working before because under $SPARK_HOME/jars
>>>> on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we
>>>> had the following entries:
>>>>
>>>> spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
>>>> spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
>>>> spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar
>>>>
>>>> So the jar file was picked up first anyway.
>>>>
>>>> The concern I have is that that the package uses older version of jar
>>>> files, namely: the following in .ivy2/jars
>>>>
>>>> -rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14
>>>> com.github.luben_zstd-jni-1.4.8-1.jar
>>>> -rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019
>>>> org.apache.commons_commons-pool2-2.6.2.jar
>>>> -rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020
>>>> org.apache.kafka_kafka-clients-2.6.0.jar
>>>> -rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57
>>>> org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
>>>> -rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58
>>>> org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
>>>> -rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020
>>>> org.lz4_lz4-java-1.7.1.jar
>>>> -rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019
>>>> org.slf4j_slf4j-api-1.7.30.jar
>>>> -rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014
>>>> org.spark-project.spark_unused-1.0.0.jar
>>>> -rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10
>>>> org.xerial.snappy_snappy-java-1.1.8.2.jar
>>>>
>>>>
>>>> So I am not sure. Hence I want someone to verify this independently in
>>>> anger
>>>>
>>>>

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Reply via email to