Hello - Anyone used Prophet + pyspark for forecasting ?
I'm trying to backfill forecasts, and running into issues (error -
Dataframe has less than 2 non-NaN rows)
I'm removing all records with NaN values, yet getting this error.
details are in stackoverflow link ->
Hello All -
I'm using Apache Spark Structured Streaming to read data from Kafka topic,
and do some processing. I'm using watermark to account for late-coming
records and the code works fine.
Here is the working(sample) code:
```
from pyspark.sql import SparkSessionfrom pyspark.sql.functions
ref :
https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
Hello All,
I've data stored in MongoDB collection and the timestamp column is not
being read by Apache Spark correctly. I'm running Apache Spark on GCP
Dataproc.
Here is sample data :
e or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 16 Mar 2023
Fyi .. apache spark version is 3.1.3
On Wed, Mar 15, 2023 at 4:34 PM karan alang wrote:
> Hi Mich, this doesn't seem to be working for me .. the watermark seems to
> be getting ignored !
>
> Here is the data pu
ndow.start").alias("startOfWindowFrame") \
> , F.col("window.end").alias("endOfWindowFrame") \
> ,
> F.col("avg(temperature)").alias("AVGTemperature"))
>
> re
perty which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 10 Mar 2023 at 07:16, karan alang wrote:
Hello All -
I've a structured Streaming job which has a trigger of 10 minutes, and I'm
using watermark to account for late data coming in. However, the watermark
is not working - and instead of a single record with total aggregated
value, I see 2 records.
Here is the code :
```
1)
+1 .. I'm happy to be part of these discussions as well !
On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh
wrote:
> Hi,
>
> I guess I can schedule this work over a course of time. I for myself can
> contribute plus learn from others.
>
> So +1 for me.
>
> Let us see if anyone else is
adeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which
cases, you don’t need to set this configuration.
> Sent from my iPhone
>
> On Feb 14, 2023, at 5:43 AM, karan alang wrote:
>
>
> Hello All,
>
> I'm trying to run a simple application on GKE (Kubernetes), and it is
> failing:
> Note : I have spark(bitnami spark c
Hello All,
I'm trying to run a simple application on GKE (Kubernetes), and it is
failing:
Note : I have spark(bitnami spark chart) installed on GKE using helm
install
Here is what is done :
1. created a docker image using Dockerfile
Dockerfile :
```
FROM python:3.7-slim
RUN apt-get update &&
here is the stackoverflow link
https://stackoverflow.com/questions/73780259/spark-structured-streaming-stderr-getting-filled-up
On Mon, Sep 19, 2022 at 4:41 PM karan alang wrote:
> I've created a stackoverflow ticket for this as well
>
> On Mon, Sep 19, 2022 at 4:37 PM karan ala
I've created a stackoverflow ticket for this as well
On Mon, Sep 19, 2022 at 4:37 PM karan alang wrote:
> Hello All,
> I've a Spark Structured Streaming job on GCP Dataproc - which picks up
> data from Kafka, does processing and pushes data back into kafka topics.
>
> Couple of
Hello All,
I've a Spark Structured Streaming job on GCP Dataproc - which picks up data
from Kafka, does processing and pushes data back into kafka topics.
Couple of questions :
1. Does Spark put all the log (incl. INFO, WARN etc) into stderr ?
What I notice is that stdout is empty, while all the
Hello All,
i've a Spark structured streaming job which reads from Kafka, does
processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is
processing heavy)
I'm consistently seeing the following warnings:
```
22/09/06 16:55:03 INFO
Hello All,
i've a long-running Apache Spark structured streaming job running in GCP
Dataproc, which reads data from Kafka every 10 mins, and does some
processing. Kafka topic has 3 partitions, and a retention period of 3 days.
The issue i'm facing is that after few hours, the program stops
Hello All,
I have data in Kafka topic(data published every 10 mins) and I'm planning
to read this data using Apache Spark Structured Stream(batch mode) and push
it in MongoDB.
Pls note : This will be scheduled using Composer/Airflow on GCP - which
will create a Dataproc cluster, run the spark
Hello All,
I've a Structured Streaming program running on GCP dataproc which reads
data from Kafka every 10 mins, and then does processing.
This is a multi-tenant system i.e. the program will read data from multiple
customers.
In my current code, i'm looping over the customers passing it to the
Hello All,
I've a Structured Streaming job on GCP Dataproc, and i'm trying to pass
multiple packages (kafka, mongoDB) to the dataproc submit command, and
that is not working.
Command that is working (when i add single dependency eg. Kafka) :
```
gcloud dataproc jobs submit pyspark main.py \
ently)
One idea is to store the maxOffset read from each batch read from
topic_reloadpred, and when the next batch is read - compare that with
'stored' maxOffset,
to determine if new records have been added to the topic.
What is the best way to fulfill this requirement ?
regds,
Karan Alang
On
Hello All,
I have a structured Streaming program, which reads data from Kafka topic,
and does some processing, and finally puts data into target Kafka Topic.
Note : the processing is done in function - convertToDictForEachBatch(),
which is called using - foreachBatch(convertToDictForEachBatch)
ry to use RocksDB which is now natively
integrated with SPARK :)
KA : Stateful - since i'm using windowing+watermark in the aggregation
queries.
Also, thnx - will check the links you provided.
regds,
Karan Alang
On Sat, Feb 26, 2022 at 3:31 AM Gourav Sengupta
wrote:
> Hi,
>
> Can you pl
Hi Mich,
thnx .. i'll check the thread you forwarded, and revert back.
regds,
Karan Alang
On Sat, Feb 26, 2022 at 2:44 AM Mich Talebzadeh
wrote:
> Check the thread I forwarded on how to gracefully shutdown spark
> structured streaming
>
> HTH
>
> On Fri, 25 Feb 2022 at
Hi Gabor,
i just responded to your comment on stackoverflow.
regds,
Karan Alang
On Sat, Feb 26, 2022 at 3:06 PM Gabor Somogyi
wrote:
> Hi Karan,
>
> Plz have a look at the stackoverflow comment I've had 2 days ago
>
> G
>
> On Fri, 25 Feb 2022, 23:31 karan alang,
Hello All,
I'm running a StructuredStreaming program on GCP Dataproc, which reads data
from Kafka, does some processing and puts processed data back into Kafka.
The program was running fine, when I killed it (to make minor changes), and
then re-started it.
It is giving me the error -
Hello All,
I'm using StructuredStreaming, and am trying to use UDF to parse each row.
Here is the requirement:
- we can get alerts of a particular KPI with type 'major' OR 'critical'
- for a KPI, if we get alerts of type 'major' eg _major, and we have a
critical alert as well _critical,
Thanks, Gourav - will check out the book.
regds,
Karan Alang
On Thu, Feb 17, 2022 at 9:05 AM Gourav Sengupta
wrote:
> Hi,
>
> The following excellent documentation may help as well:
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#u
Hello All,
I've a GCP Dataproc cluster, and I'm running a Spark StructuredStreaming
job on this.
I'm trying to use KafkaProducer to push aggregated data into a Kafka
topic, however when i import KafkaProducer
(from kafka import KafkaProducer),
it gives error
```
Traceback (most recent call
Hello All,
I've a pyspark dataframe which i need to write to Kafka topic.
Structure of the DF is :
root
|-- window: struct (nullable = true)
||-- start: timestamp (nullable = false)
||-- end: timestamp (nullable = false)
|-- processedAlarmCnt: integer (nullable = false)
|--
is taken.
>From what I understand, the watermark is done within the processing ..
since it is done per microbatch pulled with each trigger.
Pls let me know if you have comments/suggestions on this approach.
thanks,
Karan Alang
On Wed, Feb 16, 2022 at 12:52 AM Mich Talebzadeh
wrote:
> OK
Hello All,
I have a Structured Streaming pyspark program running on GCP Dataproc,
which reads data from Kafka, and does some data massaging, and aggregation.
I'm trying to use withWatermark(), and it is giving error.
py4j.Py4JException: An exception was raised by the Python Proxy. Return
Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster ..
This is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what
will be done in Production.
Hope that clarifies.
regds,
Karan Alang
On Sat, Feb 12, 2022 at 10:31
Hi Holden,
when you mention - GS Access jar - which jar is this ?
Can you pls clarify ?
thanks,
Karan Alang
On Sat, Feb 12, 2022 at 11:10 AM Holden Karau wrote:
> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you t
Thanks, Mich - will check this and update.
regds,
Karan Alang
On Sat, Feb 12, 2022 at 1:57 AM Mich Talebzadeh
wrote:
> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HT
Hello All,
I'm trying to access gcp buckets while running spark-submit from local, and
running into issues.
I'm getting error :
```
22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
Exception in thread
Thanks, Mich .. will check it out
regds,
Karan Alang
On Tue, Feb 8, 2022 at 3:06 PM Mich Talebzadeh
wrote:
> BTW you can check this Linkedin article of mine on Processing Change Data
> Capture with Spark Structured Streaming
> <https://www.linkedin.com/pulse/processing-change-
..
>
> else:
>
> print("DataFrame is empty")
>
> *HTH*
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
Hello All,
I'm using StructuredStreaming to read data from Kafka, and need to do
transformation on each individual row.
I'm trying to use 'foreach' (or foreachBatch), and running into issues.
Basic question - how is the row passed to the function when foreach is used
?
Also, when I use
re-checking to see if there is any suggestion on this issue.
On Wed, Feb 2, 2022 at 3:36 PM karan alang wrote:
> Hello All,
>
> I'm trying to run a Structured Streaming program on GCP Dataproc, which
> accesses the data from Kafka and prints it.
>
> Access to
Hello All,
I'm trying to run a Structured Streaming program on GCP Dataproc, which
accesses the data from Kafka and prints it.
Access to Kafka is using SSL, and the truststore and keystore files are
stored in buckets. I'm using Google Storage API to access the bucket, and
store the file in the
content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 2 Feb 2022 at 06:51, karan alang wrote:
>
>> Hello All,
>>
>> I'm running a simple Structu
Hello All,
I'm running a simple Structured Streaming on GCP, which reads data from
Kafka and prints onto console.
Command :
cloud dataproc jobs submit pyspark
/Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py
--cluster dataproc-ss-poc
uot;DataFrame s empty")
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any o
Hello Spark Experts,
I've a simple Structured Streaming program, which reads data from Kafka,
and writes on the console. This is working in batch mode (i.e spark.read or
df.write), not not working in streaming mode.
Details are in the stackoverflow
Hello - i've been using the Databricks notebook(for pyspark or scala/spark
development), and recently have had issues wherein the cluster creation
takes a long time to get created, often timing out.
Any ideas on how to resolve this ?
Any other alternatives to databricks notebook ?
Hello Experts,
i'm trying to run spark-submit on my macbook pro(commandline or using
PyCharm), and it seems to be giving error ->
Exception: Java gateway process exited before sending its port number
i've tried setting values to variable in the program (based on the
recommendations by people on
Hello - anyone has any ideas on this PySpark/PyCharm error in SO, pls. let
me know.
https://stackoverflow.com/questions/56028402/java-util-nosuchelementexception-key-not-found-pyspark-driver-callback-host
Anybody used Vegas-viz for charting/graphics on Jupyter Spark/Scala
notebook ?
I'm looking for input on the following issue -
https://stackoverflow.com/questions/54473744/jupyter-notebook-scala-kernel-apache-toree-with-vegas-graph-not-showing-da
Pls. let me know.
thanks!
Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?
I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are JVM based,
Pls note - Spark version is 2.2.0
On Wed, Oct 24, 2018 at 3:57 PM karan alang wrote:
> Hello -
> we are running a Spark job, and getting the following error -
>
> "LiveListenerBus: Dropping SparkListenerEvent because no remaining room in
> event queue"
>
> As per
Hello -
we are running a Spark job, and getting the following error -
"LiveListenerBus: Dropping SparkListenerEvent because no remaining room in
event queue"
As per the recommendation in the Spark Docs -
I've increased the value of property
spark.scheduler.listenerbus.eventqueue.capacity to
i've a questing regarding modeling timestamp column in Avro messages
The options are
- ISO 8601 "String" (UTC Time)
- "int" 32bit signed UNIX Epoch time
- Long (modeled as Logica datatype - timestamp in schema)
what would be the best way to model the timestamp ?
fyi. we are using Apache
i’ve an Avro generated class - com.avro.Person which has a method ->
getClassSchema
I’m passing className to a method, and in the method - i need to get the
Avro schema . Here is the code i'm trying to use -
val pr = Class.forName(productCls) //where productCls =
classOf[Product].getName
How
Hello - I'm writing a scala unittest for my Spark project
which checks the git information, and somehow it is not working from the
Unit Test
Added in pom.xml
--
pl.project13.maven
git-commit-id-plugin
2.2.4
Hello - i've HDP 2.5.x and i'm trying to launch spark-shell ..
ApplicationMaster gets launched, but YARN is not able to assign containers.
*Command ->*
./bin/spark-shell --master yarn-client --driver-memory 512m
--executor-memory 512m
*Error ->*
[Sun Aug 06 19:33:29 + 2017] Application is
update - seems 'spark-shell' does not support mode -> yarn-cluster (i guess
since it is an interactive shell)
The only modes supported include -> yarn-client & local
Pls let me know if my understanding is incorrect.
Thanks!
On Sun, Aug 6, 2017 at 10:07 AM, karan alang <karan.al
Hello all - i'd a basic question on the modes in which spark-shell can be
run ..
when i run the following command,
does Spark run in local mode i.e. outside of YARN & using the local cores ?
(since '--master' option is missing)
./bin/spark-shell --driver-memory 512m --executor-memory 512m
egds,
Karan Alang
On Mon, Jul 10, 2017 at 11:48 AM, David Newberger <da...@phdata.io> wrote:
> Karen,
>
> It looks like the Kafka version is incorrect. You mention Kafka 0.10
> however the classpath references Kafka 0.9
>
> Thanks,
>
> David
>
> On July 10, 2
Hi All,
I'm running Spark Streaming - Kafka integration using Spark 2.x & Kafka 10.
& seems to be running into issues.
I compiled the program using sbt, and the compilation went through fine.
I was able able to import this into Eclipse & run the program from Eclipse.
However, when i run the
through fine
So, I assume additional libraries are being downloaded when i provide the
appropriate packages in LibraryDependencies ?
which ones would have helped compile this ?
On Sat, Jun 17, 2017 at 2:53 PM, karan alang <karan.al...@gmail.com> wrote:
> Thanks, Cody .. yes, was ab
k.apache.org/docs/latest/streaming-kafka-integration.html
>
> On Fri, Jun 16, 2017 at 6:51 PM, karan alang <karan.al...@gmail.com>
> wrote:
> > I'm trying to compile kafka & Spark Streaming integration code i.e.
> reading
> > from Kafka using Spark Streaming,
> &
I'm trying to compile kafka & Spark Streaming integration code i.e. reading
from Kafka using Spark Streaming,
and the sbt build is failing with error -
[error] (*:update) sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
Scala version
63 matches
Mail list logo