Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app.
On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi G > > Thanks for the heads-up. > > In a thread on 3rd of March I reported that 3.1.1 works in yarn mode > > Spark 3.1.1 Preliminary results (mainly to do with Spark Structured > Streaming) (mail-archive.com) > <https://www.mail-archive.com/user@spark.apache.org/msg75979.html> > > From that mail > > > The needed jar files for version 3.1.1 to read from Kafka and write to > BigQuery for 3.1.1 are as follows: > > All under $SPARK_HOME/jars on all nodes. These are the latest available jar > files > > > - commons-pool2-2.9.0.jar > - spark-token-provider-kafka-0-10_2.12-3.1.0.jar > - spark-sql-kafka-0-10_2.12-3.1.0.jar > - kafka-clients-2.7.0.jar > - spark-bigquery-latest_2.12.jar > > > > I just tested it and in local mode single JVM it works fine without the > addition of package --> --packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 > BUT including all the above jars files > > Batch: 17 > ------------------------------------------- > +--------------------+------+-------------------+------+ > | rowkey|ticker| timeissued| price| > +--------------------+------+-------------------+------+ > |54651f0d-1be0-4d7...| IBM|2021-04-06 17:17:04| 91.92| > |8aa1ad79-4792-466...| SAP|2021-04-06 17:17:04| 34.93| > |8567f327-cfec-43d...| TSCO|2021-04-06 17:17:04| 324.5| > |138a1278-2f54-45b...| VOD|2021-04-06 17:17:04| 241.4| > |e02793c3-8e78-47e...| ORCL|2021-04-06 17:17:04| 17.6| > |0ab456fb-bd22-465...| SBRY|2021-04-06 17:17:04|350.45| > |74588e92-a3e2-48c...| MSFT|2021-04-06 17:17:04| 44.58| > |1e7203c6-6938-4ea...| BP|2021-04-06 17:17:04| 588.0| > |1e55021a-148d-4aa...| MRW|2021-04-06 17:17:04|171.21| > |229ad6f9-e4ed-475...| MKS|2021-04-06 17:17:04|439.17| > +--------------------+------+-------------------+------+ > > However, if I exclude the jar file spark-sql-kafka-0-10_2.12-3.1.0.jar and > include the packages as suggested in the link > > > spark-submit --master local[4] --conf > spark.pyspark.virtualenv.enabled=true --conf > spark.pyspark.virtualenv.type=native --conf > spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt > --conf > spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv > --conf > spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 > *--packages > org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1* xyz.py > > It cannot fetch the data > > root > |-- parsed_value: struct (nullable = true) > | |-- rowkey: string (nullable = true) > | |-- ticker: string (nullable = true) > | |-- timeissued: timestamp (nullable = true) > | |-- price: float (nullable = true) > > {'message': 'Initializing sources', 'isDataAvailable': False, > 'isTriggerActive': False} > ------------------------------------------- > Batch: 0 > ------------------------------------------- > +------+------+----------+-----+ > |rowkey|ticker|timeissued|price| > +------+------+----------+-----+ > +------+------+----------+-----+ > > 2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task > java.lang.NoSuchMethodError: > org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z > at > org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.getOrRetrieveConsumer(KafkaDataConsumer.scala:549) > at > org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.$anonfun$get$1(KafkaDataConsumer.scala:291) > at > org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) > at > org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:604) > at > org.apache.spark.sql.kafka010.consumer.KafkaDataConsumer.get(KafkaDataConsumer.scala:287) > at > org.apache.spark.sql.kafka010.KafkaBatchPartitionReader.next(KafkaBatchPartitionReader.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:79) > at > org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:112) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:413) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1473) > at > org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:452) > at > org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:360) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at > org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-04-06 17:20:11,492 ERROR util.Utils: Aborting task > java.lang.NoSuchMethodError: > org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z > > > Now I deleted ~/.ivy2 directory and ran the job again > > Ivy Default Cache set to: /home/hduser/.ivy2/cache > The jars for the packages stored in: /home/hduser/.ivy2/jars > org.apache.spark#spark-sql-kafka-0-10_2.12 added as a dependency > :: resolving dependencies :: > org.apache.spark#spark-submit-parent-2bab6bd2-3136-4783-b044-810f0800ef0e;1.0 > > let us go and have a look at the directory .ivy2/jars > > /home/hduser/.ivy2/jars> ltr > total 13108 > -rw-r--r-- 1 hduser hadoop 2777 Oct 22 2014 > org.spark-project.spark_unused-1.0.0.jar > -rw-r--r-- 1 hduser hadoop 129174 Apr 6 2019 > org.apache.commons_commons-pool2-2.6.2.jar > -rw-r--r-- 1 hduser hadoop 41472 Dec 16 2019 > org.slf4j_slf4j-api-1.7.30.jar > -rw-r--r-- 1 hduser hadoop 649950 Jan 18 2020 org.lz4_lz4-java-1.7.1.jar > -rw-r--r-- 1 hduser hadoop 3754508 Jul 28 2020 > org.apache.kafka_kafka-clients-2.6.0.jar > -rw-r--r-- 1 hduser hadoop 1969177 Nov 28 18:10 > org.xerial.snappy_snappy-java-1.1.8.2.jar > -rw-r--r-- 1 hduser hadoop 6407352 Dec 19 13:14 > com.github.luben_zstd-jni-1.4.8-1.jar > -rw-r--r-- 1 hduser hadoop 387494 Feb 22 03:57 > org.apache.spark_spark-sql-kafka-0-10_2.12-3.1.1.jar > -rw-r--r-- 1 hduser hadoop 55766 Feb 22 03:58 > org.apache.spark_spark-token-provider-kafka-0-10_2.12-3.1.1.jar > drwxr-xr-x 4 hduser hadoop 4096 Apr 6 17:25 .. > drwxr-xr-x 2 hduser hadoop 4096 Apr 6 17:25 . > > Strangely these jar files like org.apache.kafka_kafka-clients-2.6.0.jar > and org.apache.commons_commons-pool2-2.6.2.jar seem to be out of date. > > Very confusing. Sounds like we have changed something in the cluster that > as reported on 3rd March it used to work with those jar files and now not > working. > > So in summary *without those jar files added to $SPARK_HOME/jars i*t > fails totally even with the packages added. > > Cheers > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Tue, 6 Apr 2021 at 15:44, Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > >> > Anyway I unzipped the tarball for Spark-3.1.1 and there is >> no spark-sql-kafka-0-10_2.12-3.0.1.jar even >> >> Please see how Structured Streaming app with Kafka needs to be deployed >> here: >> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying >> I don't see the --packages option... >> >> G >> >> >> On Tue, Apr 6, 2021 at 2:40 PM Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> OK thanks for that. >>> >>> I am using spark-submit with PySpark as follows >>> >>> spark-submit --version >>> Welcome to >>> ____ __ >>> / __/__ ___ _____/ /__ >>> _\ \/ _ \/ _ `/ __/ '_/ >>> /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 >>> /_/ >>> >>> Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_201 >>> Branch HEAD >>> Compiled by user ubuntu on 2021-02-22T01:33:19Z >>> >>> >>> spark-submit --master yarn --deploy-mode client --conf >>> spark.pyspark.virtualenv.enabled=true --conf >>> spark.pyspark.virtualenv.type=native --conf >>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt >>> --conf >>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv >>> --conf >>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 >>> --driver-memory 16G --executor-memory 8G --num-executors 4 --executor-cores >>> 2 xyz.py >>> >>> enabling with virtual environment >>> >>> >>> That works fine with any job that does not do structured streaming in a >>> client mode. >>> >>> >>> Running on local node with >>> >>> >>> spark-submit --master local[4] --conf >>> spark.pyspark.virtualenv.enabled=true --conf >>> spark.pyspark.virtualenv.type=native --conf >>> spark.pyspark.virtualenv.requirements=/home/hduser/dba/bin/python/requirements.txt >>> --conf >>> spark.pyspark.virtualenv.bin.path=/usr/src/Python-3.7.3/airflow_virtualenv >>> --conf >>> spark.pyspark.python=/usr/src/Python-3.7.3/airflow_virtualenv/bin/python3 >>> xyz.py >>> >>> >>> works fine with the same spark version and $SPARK_HOME/jars >>> >>> >>> Cheers >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Tue, 6 Apr 2021 at 13:20, Sean Owen <sro...@gmail.com> wrote: >>> >>>> You may be compiling your app against 3.0.1 JARs but submitting to >>>> 3.1.1. >>>> You do not in general modify the Spark libs. You need to package libs >>>> like this with your app at the correct version. >>>> >>>> On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh < >>>> mich.talebza...@gmail.com> wrote: >>>> >>>>> Thanks Gabor. >>>>> >>>>> All nodes are running Spark /spark-3.1.1-bin-hadoop3.2 >>>>> >>>>> So $SPARK_HOME/jars contains all the required jars on all nodes >>>>> including the jar file commons-pool2-2.9.0.jar as well. >>>>> >>>>> They are installed identically on all nodes. >>>>> >>>>> I have looked at the Spark environment for classpath. Still I don't >>>>> see the reason why Spark 3.1.1 fails with spark-sql-kafka-0-10_2. >>>>> 12-3.1.1.jar >>>>> but works ok with spark-sql-kafka-0-10_2.12-3.1.0.jar >>>>> >>>>> Anyway I unzipped the tarball for Spark-3.1.1 and there is >>>>> no spark-sql-kafka-0-10_2.12-3.0.1.jar even >>>>> >>>>> I had to add spark-sql-kafka-0-10_2.12-3.0.1.jar to make it work. Then >>>>> I enquired the availability of new version from Maven that pointed to >>>>> *spark-sql-kafka-0-10_2.12-3.1.1.jar* >>>>> >>>>> So to confirm Spark out of the tarball does not have any >>>>> >>>>> ltr spark-sql-kafka-* >>>>> ls: cannot access spark-sql-kafka-*: No such file or directory >>>>> >>>>> >>>>> For SSS, I had to add these >>>>> >>>>> add commons-pool2-2.9.0.jar. The one shipped is >>>>> commons-pool-1.5.4.jar! >>>>> >>>>> add kafka-clients-2.7.0.jar Did not have any >>>>> >>>>> add spark-sql-kafka-0-10_2.12-3.0.1.jar Did not have any >>>>> >>>>> I gather from your second mail, there seems to be an issue with >>>>> spark-sql-kafka-0-10_2.12-3.*1*.1.jar ? >>>>> >>>>> HTH >>>>> >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> >>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>>> any loss, damage or destruction of data or any other property which may >>>>> arise from relying on this email's technical content is explicitly >>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>> arising from such loss, damage or destruction. >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, 6 Apr 2021 at 11:54, Gabor Somogyi <gabor.g.somo...@gmail.com> >>>>> wrote: >>>>> >>>>>> Since you've not shared too much details I presume you've updated the >>>>>> spark-sql-kafka >>>>>> jar only. >>>>>> KafkaTokenUtil is in the token provider jar. >>>>>> >>>>>> As a general note if I'm right, please update Spark as a whole on all >>>>>> nodes and not just jars independently. >>>>>> >>>>>> BR, >>>>>> G >>>>>> >>>>>> >>>>>> On Tue, Apr 6, 2021 at 10:21 AM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> >>>>>>> Any chance of someone testing the latest >>>>>>> spark-sql-kafka-0-10_2.12-3.1.1.jar >>>>>>> for Spark. It throws >>>>>>> >>>>>>> >>>>>>> java.lang.NoSuchMethodError: >>>>>>> org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z >>>>>>> >>>>>>> >>>>>>> However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar >>>>>>> works fine >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>>> for any loss, damage or destruction of data or any other property which >>>>>>> may >>>>>>> arise from relying on this email's technical content is explicitly >>>>>>> disclaimed. The author will in no case be liable for any monetary >>>>>>> damages >>>>>>> arising from such loss, damage or destruction. >>>>>>> >>>>>>> >>>>>>> >>>>>>