Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
OK we found out the root cause of this issue. We were writing to Redis from Spark and downloaded a recently compiled version of Redis jar with scala 2.12. spark-redis_2.12-2.4.1-SNAPSHOT-jar-with-dependencies.jar It was giving grief. We removed that one. So the job runs with either

Re: Spark performance over S3

2021-04-06 Thread Gourav Sengupta
Hi Tzahi, that is a huge cost. So that I can understand the question before answering it: 1. what is the SPARK version that you are using? 2. what is the SQL code that you are using to read and write? There are several other questions that are pertinent, but the above will be a great starting

Spark performance over S3

2021-04-06 Thread Tzahi File
Hi All, We have a spark cluster on aws ec2 that has 60 X i3.4xlarge. The spark job running on that cluster reads from an S3 bucket and writes to that bucket. the bucket and the ec2 run in the same region. As part of our efforts to reduce the runtime of our spark jobs we found there's serious

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Fine. Just to clarify please. With SBT assembly and Scala I would create an Uber jar file and used that one with spark-submit As I understand (and stand corrected) with PySpark one can only run spark-submit in client mode by directly using a py file? So hence spark-submit --master local[4]

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Sean Owen
Gabor's point is that these are not libraries you typically install in your cluster itself. You package them with your app. On Tue, Apr 6, 2021 at 11:35 AM Mich Talebzadeh wrote: > Hi G > > Thanks for the heads-up. > > In a thread on 3rd of March I reported that 3.1.1 works in yarn mode > >

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Hi G Thanks for the heads-up. In a thread on 3rd of March I reported that 3.1.1 works in yarn mode Spark 3.1.1 Preliminary results (mainly to do with Spark Structured Streaming) (mail-archive.com) >From that mail The needed

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
> Anyway I unzipped the tarball for Spark-3.1.1 and there is no spark-sql-kafka-0-10_2.12-3.0.1.jar even Please see how Structured Streaming app with Kafka needs to be deployed here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying I don't see the

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
OK thanks for that. I am using spark-submit with PySpark as follows spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.9, Java HotSpot(TM) 64-Bit

Re: Tuning spark job to make count faster.

2021-04-06 Thread Sean Owen
Hard to say without a lot more info, but 76.5K tasks is very large. How big are the tasks / how long do they take? if very short, you should repartition down. Do you end up with 800 executors? if so why 2 per machine? that generally is a loss at this scale of worker. I'm confused because you have

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Sean Owen
You may be compiling your app against 3.0.1 JARs but submitting to 3.1.1. You do not in general modify the Spark libs. You need to package libs like this with your app at the correct version. On Tue, Apr 6, 2021 at 6:42 AM Mich Talebzadeh wrote: > Thanks Gabor. > > All nodes are running Spark

Dynamic Allocation Backlog Property in Spark on Kubernetes

2021-04-06 Thread Ranju Jain
Hi All, I have set dynamic allocation enabled while running spark on Kubernetes . But new executors are requested if pending tasks are backlogged for more than configured duration in property "spark.dynamicAllocation.schedulerBacklogTimeout". My Use Case is: There are number of parallel jobs

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Thanks Gabor. All nodes are running Spark /spark-3.1.1-bin-hadoop3.2 So $SPARK_HOME/jars contains all the required jars on all nodes including the jar file commons-pool2-2.9.0.jar as well. They are installed identically on all nodes. I have looked at the Spark environment for classpath. Still

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
I've just had a deeper look at the possible issue and here are my findings: * In 3.0.1 KafkaTokenUtil.needTokenUpdate has 3 params * In 3.1.1 KafkaTokenUtil.needTokenUpdate has 2 params * I've decompiled spark-token-provider-kafka-0-10_2.12-3.1.1.jar and KafkaTokenUtil.needTokenUpdate has 2 params

Re: jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Gabor Somogyi
Since you've not shared too much details I presume you've updated the spark-sql-kafka jar only. KafkaTokenUtil is in the token provider jar. As a general note if I'm right, please update Spark as a whole on all nodes and not just jars independently. BR, G On Tue, Apr 6, 2021 at 10:21 AM Mich

unsubscribe

2021-04-06 Thread Latha Appanna

Re: Spark Structured Streaming with PySpark throwing error in execution

2021-04-06 Thread Mich Talebzadeh
Hi all, Following the upgrade to 3.1.1, I see a couple of issues. Spark Structured Streaming (SSS) does not seem to work with the newer spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws java.lang.NoSuchMethodError:

jar incompatibility with Spark 3.1.1 for structured streaming with kafka

2021-04-06 Thread Mich Talebzadeh
Hi, Any chance of someone testing the latest spark-sql-kafka-0-10_2.12-3.1.1.jar for Spark. It throws java.lang.NoSuchMethodError: org.apache.spark.kafka010.KafkaTokenUtil$.needTokenUpdate(Ljava/util/Map;Lscala/Option;)Z However, the previous version spark-sql-kafka-0-10_2.12-3.0.1.jar

Re: Ordering pushdown for Spark Datasources

2021-04-06 Thread Mich Talebzadeh
Lucene. I came across it years ago. Does Lucene support JDBC connection at all? How about Solr? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction

Tuning spark job to make count faster.

2021-04-06 Thread Krishna Chakka
Hi, I am working on a spark job. It takes 10 mins for the job just for the count() function. Question is How can I make it faster ? From the above image, what I understood is that there 4001 tasks are running in parallel. Total tasks are 76,553 . Here are the parameters that I am using for

Re: Ordering pushdown for Spark Datasources

2021-04-06 Thread Kohki Nishio
The log data is stored in Lucene and I have a custom data source to access it. For example, the condition is log-level = INFO, this brings in a couple of million records per partition. Then there are hundreds of partitions involved in a query. Spark has to go through all the entries to show the