Spark streaming giving error for version 2.4

2021-03-15 Thread Renu Yadav
Hi Team, I have upgraded my spark streaming from 2.2 to 2.4 but getting below error: spark-streaming-kafka_0-10.2.11_2.4.0 scala 2.11 Any Idea? main" java.lang.AbstractMethodError at org.apache.spark.util.ListenerBus$class.$init$(ListenerBus.scala:34)

How default partitioning in spark is deployed

2021-03-15 Thread Renganathan Mutthiah
Hi, I have a question with respect to default partitioning in RDD. *case class Animal(id:Int, name:String) val myRDD = session.sparkContext.parallelize( (Array( Animal(1, "Lion"), Animal(2,"Elephant"), Animal(3,"Jaguar"), Animal(4,"Tiger"), Animal(5, "Chetah") ) ))Console println

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Stephen Coy
Hi there, At risk of stating the obvious, the first step is to ensure that your Spark application and S3 bucket are colocated in the same AWS region. Steve C On 16 Mar 2021, at 3:31 am, Alchemist mailto:alchemistsrivast...@gmail.com>> wrote: How to optimize s3 list S3 file using

Re: Spark Structured Streaming and Kafka message schema evolution

2021-03-15 Thread Jungtaek Lim
If I understand correctly, SQL semantics are strict on column schema. Reading via Kafka data source doesn't require you to specify the schema as it provides the key and value as binary, but once you deserialize them, unless you keep the type as primitive (e.g. String), you'll need to specify the

Re: [k8s] PersistentVolumeClaim support in 3.1.1 on minikube

2021-03-15 Thread Jacek Laskowski
Hi, I think I found it. I should be using OnDemand claim name so it gets replaced to be unique per executor (?) Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twitter.com/jaceklaskowski

Re: Spark Structured Streaming from GCS files

2021-03-15 Thread Gowrishankar Sunder
Our online services running in GCP collect data from our clients and write it to GCS under time-partitioned folders like /mm/dd/hh/mm (current-time) or similar ones. We need these files to be processed in real-time from Spark. As for the runtime, we plan to run it either on Dataproc or K8s. -

[k8s] PersistentVolumeClaim support in 3.1.1 on minikube

2021-03-15 Thread Jacek Laskowski
Hi, I've been toying with persistent volumes using Spark 3.1.1 on minikube and am wondering whether it's a supported platform. I'd not be surprised if not given all the surprises I've been experiencing lately. Can I use spark-shell or any Spark app in client mode with PVCs with the default 2

Re: Spark Structured Streaming from GCS files

2021-03-15 Thread Mich Talebzadeh
Hi, I looked at the stackoverflow reference. The first question that comes to my mind is how you are populating these gcs buckets? Are you shifting data from on-prem and landing them in the buckets and creating a new folder at the given interval? Where will you be running your Spark Structured

Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Ben Kaylor
Not sure on answer on this, but am solving similar issues. So looking for additional feedback on how to do this. My thoughts if unable to do via spark and S3 boto commands, then have apps self report those changes. Where instead of having just mappers discovering the keys, you have services self

Spark Structured Streaming from GCS files

2021-03-15 Thread Gowrishankar Sunder
Hi, We have a use case to stream files from GCS time-partitioned folders and perform structured streaming queries on top of them. I have detailed the use cases and requirements in this Stackoverflow question

How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Alchemist
How to optimize s3 list S3 file using wholeTextFile(): We are using wholeTextFile to read data from S3.  As per my understanding wholeTextFile first list files of given path.  Since we are using S3 as input source, then listing files in a bucket is single-threaded, the S3 API for listing the

Spark Structured Streaming and Kafka message schema evolution

2021-03-15 Thread Mich Talebzadeh
This is just a query. In general Kafka-connect requires means to register that schema such that producers and consumers understand that. It also allows schema evolution, i.e. changes to metadata that identifies the structure of data sent via topic. When we stream a kafka topic into (Spark

Re: DB Config data update across multiple Spark Streaming Jobs

2021-03-15 Thread forece85
Any suggestion on this? How to update configuration data on all executors with out downtime? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark on k8s driver pod exception

2021-03-15 Thread Attila Zsolt Piros
Sure, that is expected, see the "How it works" section in "Running Spark on Kubernetes" page , quote: When the application completes, the executor pods terminate and are cleaned > up, but the driver pod persists logs and

Re: spark on k8s driver pod exception

2021-03-15 Thread 040840219
when driver pod throws exception , driver pod still running ? kubectl logs wordcount-e3141c7834d3dd68-driver 21/03/15 07:40:19 DEBUG Analyzer$ResolveReferences: Resolving 'value1 to 'value1 Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '`value1`'