using facebook Prophet + pyspark for forecasting - Dataframe has less than 2 non-NaN rows

2023-09-29 Thread karan alang
Hello - Anyone used Prophet + pyspark for forecasting ? I'm trying to backfill forecasts, and running into issues (error - Dataframe has less than 2 non-NaN rows) I'm removing all records with NaN values, yet getting this error. details are in stackoverflow link ->

Apache Spark with watermark - processing data different LogTypes in same kafka topic

2023-06-24 Thread karan alang
Hello All - I'm using Apache Spark Structured Streaming to read data from Kafka topic, and do some processing. I'm using watermark to account for late-coming records and the code works fine. Here is the working(sample) code: ``` from pyspark.sql import SparkSessionfrom pyspark.sql.functions

Apache Spark not reading UTC timestamp from MongoDB correctly

2023-06-08 Thread karan alang
ref : https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly Hello All, I've data stored in MongoDB collection and the timestamp column is not being read by Apache Spark correctly. I'm running Apache Spark on GCP Dataproc. Here is sample data :

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-17 Thread karan alang
e or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 16 Mar 2023

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-16 Thread karan alang
Fyi .. apache spark version is 3.1.3 On Wed, Mar 15, 2023 at 4:34 PM karan alang wrote: > Hi Mich, this doesn't seem to be working for me .. the watermark seems to > be getting ignored ! > > Here is the data pu

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-15 Thread karan alang
ndow.start").alias("startOfWindowFrame") \ > , F.col("window.end").alias("endOfWindowFrame") \ > , > F.col("avg(temperature)").alias("AVGTemperature")) > > re

Re: Spark StructuredStreaming - watermark not working as expected

2023-03-10 Thread karan alang
perty which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 10 Mar 2023 at 07:16, karan alang wrote:

Spark StructuredStreaming - watermark not working as expected

2023-03-09 Thread karan alang
Hello All - I've a structured Streaming job which has a trigger of 10 minutes, and I'm using watermark to account for late data coming in. However, the watermark is not working - and instead of a single record with total aggregated value, I see 2 records. Here is the code : ``` 1)

Re: Online classes for spark topics

2023-03-08 Thread karan alang
+1 .. I'm happy to be part of these discussions as well ! On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh wrote: > Hi, > > I guess I can schedule this work over a course of time. I for myself can > contribute plus learn from others. > > So +1 for me. > > Let us see if anyone else is

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-15 Thread karan alang
adeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which

Re: Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-14 Thread karan alang
cases, you don’t need to set this configuration. > Sent from my iPhone > > On Feb 14, 2023, at 5:43 AM, karan alang wrote: > >  > Hello All, > > I'm trying to run a simple application on GKE (Kubernetes), and it is > failing: > Note : I have spark(bitnami spark c

Running Spark on Kubernetes (GKE) - failing on spark-submit

2023-02-13 Thread karan alang
Hello All, I'm trying to run a simple application on GKE (Kubernetes), and it is failing: Note : I have spark(bitnami spark chart) installed on GKE using helm install Here is what is done : 1. created a docker image using Dockerfile Dockerfile : ``` FROM python:3.7-slim RUN apt-get update &&

Re: Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang
here is the stackoverflow link https://stackoverflow.com/questions/73780259/spark-structured-streaming-stderr-getting-filled-up On Mon, Sep 19, 2022 at 4:41 PM karan alang wrote: > I've created a stackoverflow ticket for this as well > > On Mon, Sep 19, 2022 at 4:37 PM karan ala

Re: Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang
I've created a stackoverflow ticket for this as well On Mon, Sep 19, 2022 at 4:37 PM karan alang wrote: > Hello All, > I've a Spark Structured Streaming job on GCP Dataproc - which picks up > data from Kafka, does processing and pushes data back into kafka topics. > > Couple of

Spark Structured Streaming - stderr getting filled up

2022-09-19 Thread karan alang
Hello All, I've a Spark Structured Streaming job on GCP Dataproc - which picks up data from Kafka, does processing and pushes data back into kafka topics. Couple of questions : 1. Does Spark put all the log (incl. INFO, WARN etc) into stderr ? What I notice is that stdout is empty, while all the

Spark Structured Streaming - unable to change max.poll.records (showing as 1)

2022-09-07 Thread karan alang
Hello All, i've a Spark structured streaming job which reads from Kafka, does processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is processing heavy) I'm consistently seeing the following warnings: ``` 22/09/06 16:55:03 INFO

Structured Streaming - data not being read (offsets not getting committed ?)

2022-08-26 Thread karan alang
Hello All, i've a long-running Apache Spark structured streaming job running in GCP Dataproc, which reads data from Kafka every 10 mins, and does some processing. Kafka topic has 3 partitions, and a retention period of 3 days. The issue i'm facing is that after few hours, the program stops

StructuredStreaming - read from Kafka, writing data into Mongo every 10 minutes

2022-06-22 Thread karan alang
Hello All, I have data in Kafka topic(data published every 10 mins) and I'm planning to read this data using Apache Spark Structured Stream(batch mode) and push it in MongoDB. Pls note : This will be scheduled using Composer/Airflow on GCP - which will create a Dataproc cluster, run the spark

Spark Structured streaming(batch mode) - running dependent jobs concurrently

2022-06-15 Thread karan alang
Hello All, I've a Structured Streaming program running on GCP dataproc which reads data from Kafka every 10 mins, and then does processing. This is a multi-tenant system i.e. the program will read data from multiple customers. In my current code, i'm looping over the customers passing it to the

GCP Dataproc - adding multiple packages(kafka, mongodb) while submitting spark jobs not working

2022-05-24 Thread karan alang
Hello All, I've a Structured Streaming job on GCP Dataproc, and i'm trying to pass multiple packages (kafka, mongoDB) to the dataproc submit command, and that is not working. Command that is working (when i add single dependency eg. Kafka) : ``` gcloud dataproc jobs submit pyspark main.py \

Re: StructuredStreaming - processing data based on new events in Kafka topic

2022-03-13 Thread karan alang
ently) One idea is to store the maxOffset read from each batch read from topic_reloadpred, and when the next batch is read - compare that with 'stored' maxOffset, to determine if new records have been added to the topic. What is the best way to fulfill this requirement ? regds, Karan Alang On

StructuredStreaming - processing data based on new events in Kafka topic

2022-03-11 Thread karan alang
Hello All, I have a structured Streaming program, which reads data from Kafka topic, and does some processing, and finally puts data into target Kafka Topic. Note : the processing is done in function - convertToDictForEachBatch(), which is called using - foreachBatch(convertToDictForEachBatch)

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
ry to use RocksDB which is now natively integrated with SPARK :) KA : Stateful - since i'm using windowing+watermark in the aggregation queries. Also, thnx - will check the links you provided. regds, Karan Alang On Sat, Feb 26, 2022 at 3:31 AM Gourav Sengupta wrote: > Hi, > > Can you pl

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Mich, thnx .. i'll check the thread you forwarded, and revert back. regds, Karan Alang On Sat, Feb 26, 2022 at 2:44 AM Mich Talebzadeh wrote: > Check the thread I forwarded on how to gracefully shutdown spark > structured streaming > > HTH > > On Fri, 25 Feb 2022 at

Re: StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-27 Thread karan alang
Hi Gabor, i just responded to your comment on stackoverflow. regds, Karan Alang On Sat, Feb 26, 2022 at 3:06 PM Gabor Somogyi wrote: > Hi Karan, > > Plz have a look at the stackoverflow comment I've had 2 days ago > > G > > On Fri, 25 Feb 2022, 23:31 karan alang,

StructuredStreaming error - pyspark.sql.utils.StreamingQueryException: batch 44 doesn't exist

2022-02-25 Thread karan alang
Hello All, I'm running a StructuredStreaming program on GCP Dataproc, which reads data from Kafka, does some processing and puts processed data back into Kafka. The program was running fine, when I killed it (to make minor changes), and then re-started it. It is giving me the error -

Structured Streaming + UDF - logic based on checking if a column is present in the Dataframe

2022-02-23 Thread karan alang
Hello All, I'm using StructuredStreaming, and am trying to use UDF to parse each row. Here is the requirement: - we can get alerts of a particular KPI with type 'major' OR 'critical' - for a KPI, if we get alerts of type 'major' eg _major, and we have a critical alert as well _critical,

Re: StructuredStreaming - foreach/foreachBatch

2022-02-21 Thread karan alang
Thanks, Gourav - will check out the book. regds, Karan Alang On Thu, Feb 17, 2022 at 9:05 AM Gourav Sengupta wrote: > Hi, > > The following excellent documentation may help as well: > https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#u

GCP Dataproc - error in importing KafkaProducer

2022-02-17 Thread karan alang
Hello All, I've a GCP Dataproc cluster, and I'm running a Spark StructuredStreaming job on this. I'm trying to use KafkaProducer to push aggregated data into a Kafka topic, however when i import KafkaProducer (from kafka import KafkaProducer), it gives error ``` Traceback (most recent call

writing a Dataframe (with one of the columns as struct) into Kafka

2022-02-17 Thread karan alang
Hello All, I've a pyspark dataframe which i need to write to Kafka topic. Structure of the DF is : root |-- window: struct (nullable = true) ||-- start: timestamp (nullable = false) ||-- end: timestamp (nullable = false) |-- processedAlarmCnt: integer (nullable = false) |--

Re: SparkStructured Streaming using withWatermark - TypeError: 'module' object is not callable

2022-02-16 Thread karan alang
is taken. >From what I understand, the watermark is done within the processing .. since it is done per microbatch pulled with each trigger. Pls let me know if you have comments/suggestions on this approach. thanks, Karan Alang On Wed, Feb 16, 2022 at 12:52 AM Mich Talebzadeh wrote: > OK

SparkStructured Streaming using withWatermark - TypeError: 'module' object is not callable

2022-02-15 Thread karan alang
Hello All, I have a Structured Streaming pyspark program running on GCP Dataproc, which reads data from Kafka, and does some data massaging, and aggregation. I'm trying to use withWatermark(), and it is giving error. py4j.Py4JException: An exception was raised by the Python Proxy. Return

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Hi Gaurav, All, I'm doing a spark-submit from my local system to a GCP Dataproc cluster .. This is more for dev/testing. I can run a -- 'gcloud dataproc jobs submit' command as well, which is what will be done in Production. Hope that clarifies. regds, Karan Alang On Sat, Feb 12, 2022 at 10:31

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Hi Holden, when you mention - GS Access jar - which jar is this ? Can you pls clarify ? thanks, Karan Alang On Sat, Feb 12, 2022 at 11:10 AM Holden Karau wrote: > You can also put the GS access jar with your Spark jars — that’s what the > class not found exception is pointing you t

Re: Unable to access Google buckets using spark-submit

2022-02-13 Thread karan alang
Thanks, Mich - will check this and update. regds, Karan Alang On Sat, Feb 12, 2022 at 1:57 AM Mich Talebzadeh wrote: > BTW I also answered you in in stackoverflow : > > > https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit > > HT

Unable to access Google buckets using spark-submit

2022-02-11 Thread karan alang
Hello All, I'm trying to access gcp buckets while running spark-submit from local, and running into issues. I'm getting error : ``` 22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Exception in thread

Re: StructuredStreaming - foreach/foreachBatch

2022-02-09 Thread karan alang
Thanks, Mich .. will check it out regds, Karan Alang On Tue, Feb 8, 2022 at 3:06 PM Mich Talebzadeh wrote: > BTW you can check this Linkedin article of mine on Processing Change Data > Capture with Spark Structured Streaming > <https://www.linkedin.com/pulse/processing-change-

Re: StructuredStreaming - foreach/foreachBatch

2022-02-07 Thread karan alang
.. > > else: > > print("DataFrame is empty") > > *HTH* > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for

StructuredStreaming - foreach/foreachBatch

2022-02-07 Thread karan alang
Hello All, I'm using StructuredStreaming to read data from Kafka, and need to do transformation on each individual row. I'm trying to use 'foreach' (or foreachBatch), and running into issues. Basic question - how is the row passed to the function when foreach is used ? Also, when I use

Re: GCP Dataproc - Failed to construct kafka consumer, Failed to load SSL keystore dataproc-versa-sase-p12-1.jks of type JKS

2022-02-02 Thread karan alang
re-checking to see if there is any suggestion on this issue. On Wed, Feb 2, 2022 at 3:36 PM karan alang wrote: > Hello All, > > I'm trying to run a Structured Streaming program on GCP Dataproc, which > accesses the data from Kafka and prints it. > > Access to

GCP Dataproc - Failed to construct kafka consumer, Failed to load SSL keystore dataproc-versa-sase-p12-1.jks of type JKS

2022-02-02 Thread karan alang
Hello All, I'm trying to run a Structured Streaming program on GCP Dataproc, which accesses the data from Kafka and prints it. Access to Kafka is using SSL, and the truststore and keystore files are stored in buckets. I'm using Google Storage API to access the bucket, and store the file in the

Re: Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

2022-02-02 Thread karan alang
content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 2 Feb 2022 at 06:51, karan alang wrote: > >> Hello All, >> >> I'm running a simple Structu

Structured Streaming on GCP Dataproc - java.lang.NoClassDefFoundError: org/apache/kafka/common/serialization/ByteArraySerializer

2022-02-01 Thread karan alang
Hello All, I'm running a simple Structured Streaming on GCP, which reads data from Kafka and prints onto console. Command : cloud dataproc jobs submit pyspark /Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py --cluster dataproc-ss-poc

Re: Structured Streaming - not showing records on console

2022-02-01 Thread karan alang
uot;DataFrame s empty") > > HTH > > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any o

Structured Streaming - not showing records on console

2022-02-01 Thread karan alang
Hello Spark Experts, I've a simple Structured Streaming program, which reads data from Kafka, and writes on the console. This is working in batch mode (i.e spark.read or df.write), not not working in streaming mode. Details are in the stackoverflow

Databricks notebook - cluster taking a long time to get created, often timing out

2021-08-17 Thread karan alang
Hello - i've been using the Databricks notebook(for pyspark or scala/spark development), and recently have had issues wherein the cluster creation takes a long time to get created, often timing out. Any ideas on how to resolve this ? Any other alternatives to databricks notebook ?

spark-submit not running on macbook pro

2021-08-16 Thread karan alang
Hello Experts, i'm trying to run spark-submit on my macbook pro(commandline or using PyCharm), and it seems to be giving error -> Exception: Java gateway process exited before sending its port number i've tried setting values to variable in the program (based on the recommendations by people on

pyspark on pycharm error

2019-05-08 Thread karan alang
Hello - anyone has any ideas on this PySpark/PyCharm error in SO, pls. let me know. https://stackoverflow.com/questions/56028402/java-util-nosuchelementexception-key-not-found-pyspark-driver-callback-host

Jupyter Notebook (Scala, kernel - Apache Toree) with Vegas, Graph not showing data

2019-02-01 Thread karan alang
Anybody used Vegas-viz for charting/graphics on Jupyter Spark/Scala notebook ? I'm looking for input on the following issue - https://stackoverflow.com/questions/54473744/jupyter-notebook-scala-kernel-apache-toree-with-vegas-graph-not-showing-da Pls. let me know. thanks!

java vs scala for Apache Spark - is there a performance difference ?

2018-10-26 Thread karan alang
Hello - is there a "performance" difference when using Java or Scala for Apache Spark ? I understand, there are other obvious differences (less code with scala, easier to focus on logic etc), but wrt performance - i think there would not be much of a difference since both of them are JVM based,

Re: Error - Dropping SparkListenerEvent because no remaining room in event queue

2018-10-24 Thread karan alang
Pls note - Spark version is 2.2.0 On Wed, Oct 24, 2018 at 3:57 PM karan alang wrote: > Hello - > we are running a Spark job, and getting the following error - > > "LiveListenerBus: Dropping SparkListenerEvent because no remaining room in > event queue" > > As per

Error - Dropping SparkListenerEvent because no remaining room in event queue

2018-10-24 Thread karan alang
Hello - we are running a Spark job, and getting the following error - "LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue" As per the recommendation in the Spark Docs - I've increased the value of property spark.scheduler.listenerbus.eventqueue.capacity to

modeling timestamp in Avro messages (read using Spark Structured Streaming)

2018-07-29 Thread karan alang
i've a questing regarding modeling timestamp column in Avro messages The options are - ISO 8601 "String" (UTC Time) - "int" 32bit signed UNIX Epoch time - Long (modeled as Logica datatype - timestamp in schema) what would be the best way to model the timestamp ? fyi. we are using Apache

scala question (in spark project)- not able to call getClassSchema method in avro generated class

2018-02-24 Thread karan alang
i’ve an Avro generated class - com.avro.Person which has a method -> getClassSchema I’m passing className to a method, and in the method - i need to get the Avro schema . Here is the code i'm trying to use - val pr = Class.forName(productCls) //where productCls = classOf[Product].getName How

not able to read git info from Scala Test Suite

2018-02-13 Thread karan alang
Hello - I'm writing a scala unittest for my Spark project which checks the git information, and somehow it is not working from the Unit Test Added in pom.xml -- pl.project13.maven git-commit-id-plugin 2.2.4

spark-shell not getting launched - Queue's AM resource limit exceeded.

2017-08-06 Thread karan alang
Hello - i've HDP 2.5.x and i'm trying to launch spark-shell .. ApplicationMaster gets launched, but YARN is not able to assign containers. *Command ->* ./bin/spark-shell --master yarn-client --driver-memory 512m --executor-memory 512m *Error ->* [Sun Aug 06 19:33:29 + 2017] Application is

Re: spark-shell - modes

2017-08-06 Thread karan alang
update - seems 'spark-shell' does not support mode -> yarn-cluster (i guess since it is an interactive shell) The only modes supported include -> yarn-client & local Pls let me know if my understanding is incorrect. Thanks! On Sun, Aug 6, 2017 at 10:07 AM, karan alang <karan.al

spark-shell - modes

2017-08-06 Thread karan alang
Hello all - i'd a basic question on the modes in which spark-shell can be run .. when i run the following command, does Spark run in local mode i.e. outside of YARN & using the local cores ? (since '--master' option is missing) ./bin/spark-shell --driver-memory 512m --executor-memory 512m

Re: error in running StructuredStreaming-Kafka integration code (Spark 2.x & Kafka 10)

2017-07-10 Thread karan alang
egds, Karan Alang On Mon, Jul 10, 2017 at 11:48 AM, David Newberger <da...@phdata.io> wrote: > Karen, > > It looks like the Kafka version is incorrect. You mention Kafka 0.10 > however the classpath references Kafka 0.9 > > Thanks, > > David > > On July 10, 2

error in running StructuredStreaming-Kafka integration code (Spark 2.x & Kafka 10)

2017-07-10 Thread karan alang
Hi All, I'm running Spark Streaming - Kafka integration using Spark 2.x & Kafka 10. & seems to be running into issues. I compiled the program using sbt, and the compilation went through fine. I was able able to import this into Eclipse & run the program from Eclipse. However, when i run the

Re: Spark-Kafka integration - build failing with sbt

2017-06-19 Thread karan alang
through fine So, I assume additional libraries are being downloaded when i provide the appropriate packages in LibraryDependencies ? which ones would have helped compile this ? On Sat, Jun 17, 2017 at 2:53 PM, karan alang <karan.al...@gmail.com> wrote: > Thanks, Cody .. yes, was ab

Re: Spark-Kafka integration - build failing with sbt

2017-06-17 Thread karan alang
k.apache.org/docs/latest/streaming-kafka-integration.html > > On Fri, Jun 16, 2017 at 6:51 PM, karan alang <karan.al...@gmail.com> > wrote: > > I'm trying to compile kafka & Spark Streaming integration code i.e. > reading > > from Kafka using Spark Streaming, > &

Spark-Kafka integration - build failing with sbt

2017-06-16 Thread karan alang
I'm trying to compile kafka & Spark Streaming integration code i.e. reading from Kafka using Spark Streaming, and the sbt build is failing with error - [error] (*:update) sbt.ResolveException: unresolved dependency: org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found Scala version