Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Mich Talebzadeh Wed, 07 Apr 2021 09:13:46 -0700

Hi Amit,

Many thanks for your suggestion.


My problem is that I am using PySpark in this particular case and there is
no SBT or Maven which is used similar to Scala to build an Uber jar file
with shading.

So regrettably that is the only way I could resolve the problem by adding
the jar file in spark-submit through --jars.

Regards

Mich



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 7 Apr 2021 at 17:02, Amit Joshi <mailtojoshia...@gmail.com> wrote:

> Hi Mich,
>
> If I correctly understood your problem, it is that the spark-kafka jar is
> shadowed by the installed kafka client jar at run time.
> I had been in that place earlier.
> I can recommend resolving the issue using the shade plugin. The example I
> am pasting here works for pom.xml.
> I am very sure you will find something for sbt as well.
> This is a maven shade plugin to change the name of the class while
> packaging. This will form an uber jar.
> <*relocations*>
>     <*relocation*>
>         <*pattern*>org.apache.kafka</*pattern*>
>         <*shadedPattern*>shade.org.apache.kafka</*shadedPattern*>
>     </*relocation*>
> </*relocations*>
>
> Hope this helps.
>
> Regards
> Amit Joshi
>
> On Wed, Apr 7, 2021 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>>
>> Did some tests. The concern is SSS job running under YARN
>>
>>
>> *Scenario 1)*  use spark-sql-kafka-0-10_2.12-3.1.0.jar
>>
>>    - Removed spark-sql-kafka-0-10_2.12-3.1.0.jar from anywhere on
>>    CLASSPATH including $SPARK_HOME/jars
>>    - Added the said jar file to spark-submit in client mode (the only
>>    mode available to PySpark) with --jars
>>    - spark-submit --master yarn --deploy-mode client --conf
>>    spark.pyspark.virtualenv.enabled=true .. bla bla..  --driver-memory 4G
>>    --executor-memory 4G --num-executors 2 --executor-cores 2 *--jars
>>    $HOME/jars/spark-sql-kafka-0-10_2.12-3.1.0.jar *xyz.py
>>
>> This works fine
>>
>>
>> *Scenario 2)* use spark-sql-kafka-0-10_2.12-3.1.1.jar in spark-submit
>>
>>
>>
>>    -  spark-submit --master yarn --deploy-mode client --conf
>>    spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G
>>    --executor-memory 4G --num-executors 2 --executor-cores 2 *--jars
>>    $HOME/jars/spark-sql-kafka-0-10_2.12-*3.1.1*.jar *xyz.py
>>
>> it failed with
>>
>>
>>
>>    - Caused by: java.lang.NoSuchMethodError:
>>    
>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
>>
>> Scenario 3) use the package as per Structured Streaming + Kafka
>> Integration Guide (Kafka broker version 0.10.0 or higher) - Spark 3.1.1
>> Documentation (apache.org)
>> <https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying>
>>
>>
>>    - spark-submit --master yarn --deploy-mode client --conf
>>    spark.pyspark.virtualenv.enabled=true ..bla bla.. --driver-memory 4G
>>    --executor-memory 4G --num-executors 2 --executor-cores 2 *--packages
>>    org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 *xyz.py
>>
>> it failed with
>>
>>    - Caused by: java.lang.NoSuchMethodError:
>>    
>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z
>>
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 7 Apr 2021 at 13:20, Gabor Somogyi <gabor.g.somo...@gmail.com>
>> wrote:
>>
>>> +1 on Sean's opinion
>>>
>>> On Wed, Apr 7, 2021 at 2:17 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> You shouldn't be modifying your cluster install. You may at this point
>>>> have conflicting, excess JARs in there somewhere. I'd start it over if you
>>>> can.
>>>>
>>>> On Wed, Apr 7, 2021 at 7:15 AM Gabor Somogyi <gabor.g.somo...@gmail.com>
>>>> wrote:
>>>>
>>>>> Not sure what you mean not working. You've added 3.1.1 to packages
>>>>> which uses:
>>>>> * 2.6.0 kafka-clients:
>>>>> https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L136
>>>>> * 2.6.2 commons pool:
>>>>> https://github.com/apache/spark/blob/1d550c4e90275ab418b9161925049239227f3dc9/pom.xml#L183
>>>>>
>>>>> I think it worth an end-to-end dep-tree analysis what is really
>>>>> happening on the cluster...
>>>>>
>>>>> G
>>>>>
>>>>>
>>>>> On Wed, Apr 7, 2021 at 11:11 AM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hi Gabor et. al.,
>>>>>>
>>>>>> To be honest I am not convinced this package --packages
>>>>>> org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 is really working!
>>>>>>
>>>>>> I know for definite that spark-sql-kafka-0-10_2.12-3.1.0.jar works
>>>>>> fine. I reported the package working before because under 
>>>>>> $SPARK_HOME/jars
>>>>>> on all nodes there was a copy 3.0.1 jar file. Also in $SPARK_HOME/conf we
>>>>>> had the following entries:
>>>>>>
>>>>>> spark.yarn.archive=hdfs://rhes75:9000/jars/spark-libs.jar
>>>>>> spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
>>>>>> spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar
>>>>>>
>>>>>> So the jar file was picked up first anyway.
>>>>>>
>>>>>> The concern I have is that that the package uses older version of jar
>>>>>> files, namely: the following in .ivy2/jars
>>>>>>
>>>>>> -rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14
>>>>>> com.github.luben_zstd-jni-1.4.8-1.jar
>>>>>> -rw-r--r-- 1 hduser hadoop  129174 Apr  6  2019
>>>>>> org.apache.commons_commons-pool2-2.6.2.jar
>>>>>> -rw-r--r-- 1 hduser hadoop 3754508 Jul 28  2020
>>>>>> org.apache.kafka_kafka-clients-2.6.0.jar
>>>>>> -rw-r--r-- 1 hduser hadoop  387494 Feb 22 03:57
>>>>>> org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar
>>>>>> -rw-r--r-- 1 hduser hadoop   55766 Feb 22 03:58
>>>>>> org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar
>>>>>> -rw-r--r-- 1 hduser hadoop  649950 Jan 18  2020
>>>>>> org.lz4_lz4-java-1.7.1.jar
>>>>>> -rw-r--r-- 1 hduser hadoop   41472 Dec 16  2019
>>>>>> org.slf4j_slf4j-api-1.7.30.jar
>>>>>> -rw-r--r-- 1 hduser hadoop    2777 Oct 22  2014
>>>>>> org.spark-project.spark_unused-1.0.0.jar
>>>>>> -rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10
>>>>>> org.xerial.snappy_snappy-java-1.1.8.2.jar
>>>>>>
>>>>>>
>>>>>> So I am not sure. Hence I want someone to verify this independently
>>>>>> in anger
>>>>>>
>>>>>>

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

Reply via email to