Hi All,
I'm running Spark Streaming - Kafka integration using Spark 2.x & Kafka 10.
& seems to be running into issues.
I compiled the program using sbt, and the compilation went through fine.
I was able able to import this into Eclipse & run the program from Eclipse.
However, when i run the
egds,
Karan Alang
On Mon, Jul 10, 2017 at 11:48 AM, David Newberger <da...@phdata.io> wrote:
> Karen,
>
> It looks like the Kafka version is incorrect. You mention Kafka 0.10
> however the classpath references Kafka 0.9
>
> Thanks,
>
> David
>
> On July 10, 2
Hello all - i'd a basic question on the modes in which spark-shell can be
run ..
when i run the following command,
does Spark run in local mode i.e. outside of YARN & using the local cores ?
(since '--master' option is missing)
./bin/spark-shell --driver-memory 512m --executor-memory 512m
update - seems 'spark-shell' does not support mode -> yarn-cluster (i guess
since it is an interactive shell)
The only modes supported include -> yarn-client & local
Pls let me know if my understanding is incorrect.
Thanks!
On Sun, Aug 6, 2017 at 10:07 AM, karan alang <karan.al
Hello - i've HDP 2.5.x and i'm trying to launch spark-shell ..
ApplicationMaster gets launched, but YARN is not able to assign containers.
*Command ->*
./bin/spark-shell --master yarn-client --driver-memory 512m
--executor-memory 512m
*Error ->*
[Sun Aug 06 19:33:29 + 2017] Application is
through fine
So, I assume additional libraries are being downloaded when i provide the
appropriate packages in LibraryDependencies ?
which ones would have helped compile this ?
On Sat, Jun 17, 2017 at 2:53 PM, karan alang <karan.al...@gmail.com> wrote:
> Thanks, Cody .. yes, was ab
I'm trying to compile kafka & Spark Streaming integration code i.e. reading
from Kafka using Spark Streaming,
and the sbt build is failing with error -
[error] (*:update) sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-streaming-kafka_2.11;2.1.0: not found
Scala version
k.apache.org/docs/latest/streaming-kafka-integration.html
>
> On Fri, Jun 16, 2017 at 6:51 PM, karan alang <karan.al...@gmail.com>
> wrote:
> > I'm trying to compile kafka & Spark Streaming integration code i.e.
> reading
> > from Kafka using Spark Streaming,
> &
Hello - I'm writing a scala unittest for my Spark project
which checks the git information, and somehow it is not working from the
Unit Test
Added in pom.xml
--
pl.project13.maven
git-commit-id-plugin
2.2.4
i’ve an Avro generated class - com.avro.Person which has a method ->
getClassSchema
I’m passing className to a method, and in the method - i need to get the
Avro schema . Here is the code i'm trying to use -
val pr = Class.forName(productCls) //where productCls =
classOf[Product].getName
How
i've a questing regarding modeling timestamp column in Avro messages
The options are
- ISO 8601 "String" (UTC Time)
- "int" 32bit signed UNIX Epoch time
- Long (modeled as Logica datatype - timestamp in schema)
what would be the best way to model the timestamp ?
fyi. we are using Apache
Hello -
we are running a Spark job, and getting the following error -
"LiveListenerBus: Dropping SparkListenerEvent because no remaining room in
event queue"
As per the recommendation in the Spark Docs -
I've increased the value of property
spark.scheduler.listenerbus.eventqueue.capacity to
Pls note - Spark version is 2.2.0
On Wed, Oct 24, 2018 at 3:57 PM karan alang wrote:
> Hello -
> we are running a Spark job, and getting the following error -
>
> "LiveListenerBus: Dropping SparkListenerEvent because no remaining room in
> event queue"
>
> As per
Hello
- is there a "performance" difference when using Java or Scala for Apache
Spark ?
I understand, there are other obvious differences (less code with scala,
easier to focus on logic etc),
but wrt performance - i think there would not be much of a difference since
both of them are JVM based,
Anybody used Vegas-viz for charting/graphics on Jupyter Spark/Scala
notebook ?
I'm looking for input on the following issue -
https://stackoverflow.com/questions/54473744/jupyter-notebook-scala-kernel-apache-toree-with-vegas-graph-not-showing-da
Pls. let me know.
thanks!
Hello - anyone has any ideas on this PySpark/PyCharm error in SO, pls. let
me know.
https://stackoverflow.com/questions/56028402/java-util-nosuchelementexception-key-not-found-pyspark-driver-callback-host
Hello Experts,
i'm trying to run spark-submit on my macbook pro(commandline or using
PyCharm), and it seems to be giving error ->
Exception: Java gateway process exited before sending its port number
i've tried setting values to variable in the program (based on the
recommendations by people on
Hello - i've been using the Databricks notebook(for pyspark or scala/spark
development), and recently have had issues wherein the cluster creation
takes a long time to get created, often timing out.
Any ideas on how to resolve this ?
Any other alternatives to databricks notebook ?
Hi Holden,
when you mention - GS Access jar - which jar is this ?
Can you pls clarify ?
thanks,
Karan Alang
On Sat, Feb 12, 2022 at 11:10 AM Holden Karau wrote:
> You can also put the GS access jar with your Spark jars — that’s what the
> class not found exception is pointing you t
Thanks, Mich - will check this and update.
regds,
Karan Alang
On Sat, Feb 12, 2022 at 1:57 AM Mich Talebzadeh
wrote:
> BTW I also answered you in in stackoverflow :
>
>
> https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit
>
> HT
Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster ..
This is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what
will be done in Production.
Hope that clarifies.
regds,
Karan Alang
On Sat, Feb 12, 2022 at 10:31
ently)
One idea is to store the maxOffset read from each batch read from
topic_reloadpred, and when the next batch is read - compare that with
'stored' maxOffset,
to determine if new records have been added to the topic.
What is the best way to fulfill this requirement ?
regds,
Karan Alang
On
Hello All,
I have a structured Streaming program, which reads data from Kafka topic,
and does some processing, and finally puts data into target Kafka Topic.
Note : the processing is done in function - convertToDictForEachBatch(),
which is called using - foreachBatch(convertToDictForEachBatch)
Thanks, Gourav - will check out the book.
regds,
Karan Alang
On Thu, Feb 17, 2022 at 9:05 AM Gourav Sengupta
wrote:
> Hi,
>
> The following excellent documentation may help as well:
> https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#u
Hello All,
I'm running a StructuredStreaming program on GCP Dataproc, which reads data
from Kafka, does some processing and puts processed data back into Kafka.
The program was running fine, when I killed it (to make minor changes), and
then re-started it.
It is giving me the error -
Hi Mich,
thnx .. i'll check the thread you forwarded, and revert back.
regds,
Karan Alang
On Sat, Feb 26, 2022 at 2:44 AM Mich Talebzadeh
wrote:
> Check the thread I forwarded on how to gracefully shutdown spark
> structured streaming
>
> HTH
>
> On Fri, 25 Feb 2022 at
ry to use RocksDB which is now natively
integrated with SPARK :)
KA : Stateful - since i'm using windowing+watermark in the aggregation
queries.
Also, thnx - will check the links you provided.
regds,
Karan Alang
On Sat, Feb 26, 2022 at 3:31 AM Gourav Sengupta
wrote:
> Hi,
>
> Can you pl
Hello All,
I've a GCP Dataproc cluster, and I'm running a Spark StructuredStreaming
job on this.
I'm trying to use KafkaProducer to push aggregated data into a Kafka
topic, however when i import KafkaProducer
(from kafka import KafkaProducer),
it gives error
```
Traceback (most recent call
Hello All,
I've a pyspark dataframe which i need to write to Kafka topic.
Structure of the DF is :
root
|-- window: struct (nullable = true)
||-- start: timestamp (nullable = false)
||-- end: timestamp (nullable = false)
|-- processedAlarmCnt: integer (nullable = false)
|--
is taken.
>From what I understand, the watermark is done within the processing ..
since it is done per microbatch pulled with each trigger.
Pls let me know if you have comments/suggestions on this approach.
thanks,
Karan Alang
On Wed, Feb 16, 2022 at 12:52 AM Mich Talebzadeh
wrote:
> OK
Hello All,
I have a Structured Streaming pyspark program running on GCP Dataproc,
which reads data from Kafka, and does some data massaging, and aggregation.
I'm trying to use withWatermark(), and it is giving error.
py4j.Py4JException: An exception was raised by the Python Proxy. Return
Hello All,
I'm using StructuredStreaming, and am trying to use UDF to parse each row.
Here is the requirement:
- we can get alerts of a particular KPI with type 'major' OR 'critical'
- for a KPI, if we get alerts of type 'major' eg _major, and we have a
critical alert as well _critical,
Hello All,
I'm trying to access gcp buckets while running spark-submit from local, and
running into issues.
I'm getting error :
```
22/02/11 20:06:59 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
Exception in thread
Hi Gabor,
i just responded to your comment on stackoverflow.
regds,
Karan Alang
On Sat, Feb 26, 2022 at 3:06 PM Gabor Somogyi
wrote:
> Hi Karan,
>
> Plz have a look at the stackoverflow comment I've had 2 days ago
>
> G
>
> On Fri, 25 Feb 2022, 23:31 karan alang,
Hello Spark Experts,
I've a simple Structured Streaming program, which reads data from Kafka,
and writes on the console. This is working in batch mode (i.e spark.read or
df.write), not not working in streaming mode.
Details are in the stackoverflow
uot;DataFrame s empty")
>
> HTH
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any o
Hello All,
I'm running a simple Structured Streaming on GCP, which reads data from
Kafka and prints onto console.
Command :
cloud dataproc jobs submit pyspark
/Users/karanalang/Documents/Technology/gcp/DataProc/StructuredStreaming_Kafka_GCP-Batch-feb1.py
--cluster dataproc-ss-poc
Hello All,
I'm trying to run a Structured Streaming program on GCP Dataproc, which
accesses the data from Kafka and prints it.
Access to Kafka is using SSL, and the truststore and keystore files are
stored in buckets. I'm using Google Storage API to access the bucket, and
store the file in the
content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 2 Feb 2022 at 06:51, karan alang wrote:
>
>> Hello All,
>>
>> I'm running a simple Structu
re-checking to see if there is any suggestion on this issue.
On Wed, Feb 2, 2022 at 3:36 PM karan alang wrote:
> Hello All,
>
> I'm trying to run a Structured Streaming program on GCP Dataproc, which
> accesses the data from Kafka and prints it.
>
> Access to
Hello All,
I'm using StructuredStreaming to read data from Kafka, and need to do
transformation on each individual row.
I'm trying to use 'foreach' (or foreachBatch), and running into issues.
Basic question - how is the row passed to the function when foreach is used
?
Also, when I use
..
>
> else:
>
> print("DataFrame is empty")
>
> *HTH*
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
Thanks, Mich .. will check it out
regds,
Karan Alang
On Tue, Feb 8, 2022 at 3:06 PM Mich Talebzadeh
wrote:
> BTW you can check this Linkedin article of mine on Processing Change Data
> Capture with Spark Structured Streaming
> <https://www.linkedin.com/pulse/processing-change-
Hello - Anyone used Prophet + pyspark for forecasting ?
I'm trying to backfill forecasts, and running into issues (error -
Dataframe has less than 2 non-NaN rows)
I'm removing all records with NaN values, yet getting this error.
details are in stackoverflow link ->
Hello All,
I've a Structured Streaming job on GCP Dataproc, and i'm trying to pass
multiple packages (kafka, mongoDB) to the dataproc submit command, and
that is not working.
Command that is working (when i add single dependency eg. Kafka) :
```
gcloud dataproc jobs submit pyspark main.py \
Hello All,
I have data in Kafka topic(data published every 10 mins) and I'm planning
to read this data using Apache Spark Structured Stream(batch mode) and push
it in MongoDB.
Pls note : This will be scheduled using Composer/Airflow on GCP - which
will create a Dataproc cluster, run the spark
Hello All,
I've a Structured Streaming program running on GCP dataproc which reads
data from Kafka every 10 mins, and then does processing.
This is a multi-tenant system i.e. the program will read data from multiple
customers.
In my current code, i'm looping over the customers passing it to the
Hello All,
i've a long-running Apache Spark structured streaming job running in GCP
Dataproc, which reads data from Kafka every 10 mins, and does some
processing. Kafka topic has 3 partitions, and a retention period of 3 days.
The issue i'm facing is that after few hours, the program stops
Hello All,
i've a Spark structured streaming job which reads from Kafka, does
processing and puts data into Mongo/Kafka/GCP Buckets (i.e. it is
processing heavy)
I'm consistently seeing the following warnings:
```
22/09/06 16:55:03 INFO
I've created a stackoverflow ticket for this as well
On Mon, Sep 19, 2022 at 4:37 PM karan alang wrote:
> Hello All,
> I've a Spark Structured Streaming job on GCP Dataproc - which picks up
> data from Kafka, does processing and pushes data back into kafka topics.
>
> Couple of
Hello All,
I've a Spark Structured Streaming job on GCP Dataproc - which picks up data
from Kafka, does processing and pushes data back into kafka topics.
Couple of questions :
1. Does Spark put all the log (incl. INFO, WARN etc) into stderr ?
What I notice is that stdout is empty, while all the
here is the stackoverflow link
https://stackoverflow.com/questions/73780259/spark-structured-streaming-stderr-getting-filled-up
On Mon, Sep 19, 2022 at 4:41 PM karan alang wrote:
> I've created a stackoverflow ticket for this as well
>
> On Mon, Sep 19, 2022 at 4:37 PM karan ala
Hello All -
I've a structured Streaming job which has a trigger of 10 minutes, and I'm
using watermark to account for late data coming in. However, the watermark
is not working - and instead of a single record with total aggregated
value, I see 2 records.
Here is the code :
```
1)
Fyi .. apache spark version is 3.1.3
On Wed, Mar 15, 2023 at 4:34 PM karan alang wrote:
> Hi Mich, this doesn't seem to be working for me .. the watermark seems to
> be getting ignored !
>
> Here is the data pu
e or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 16 Mar 2023
ndow.start").alias("startOfWindowFrame") \
> , F.col("window.end").alias("endOfWindowFrame") \
> ,
> F.col("avg(temperature)").alias("AVGTemperature"))
>
> re
+1 .. I'm happy to be part of these discussions as well !
On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh
wrote:
> Hi,
>
> I guess I can schedule this work over a course of time. I for myself can
> contribute plus learn from others.
>
> So +1 for me.
>
> Let us see if anyone else is
perty which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 10 Mar 2023 at 07:16, karan alang wrote:
Hello All,
I'm trying to run a simple application on GKE (Kubernetes), and it is
failing:
Note : I have spark(bitnami spark chart) installed on GKE using helm
install
Here is what is done :
1. created a docker image using Dockerfile
Dockerfile :
```
FROM python:3.7-slim
RUN apt-get update &&
cases, you don’t need to set this configuration.
> Sent from my iPhone
>
> On Feb 14, 2023, at 5:43 AM, karan alang wrote:
>
>
> Hello All,
>
> I'm trying to run a simple application on GKE (Kubernetes), and it is
> failing:
> Note : I have spark(bitnami spark c
adeh-ph-d-5205b2/>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which
ref :
https://stackoverflow.com/questions/76436159/apache-spark-not-reading-utc-timestamp-from-mongodb-correctly
Hello All,
I've data stored in MongoDB collection and the timestamp column is not
being read by Apache Spark correctly. I'm running Apache Spark on GCP
Dataproc.
Here is sample data :
Hello All -
I'm using Apache Spark Structured Streaming to read data from Kafka topic,
and do some processing. I'm using watermark to account for late-coming
records and the code works fine.
Here is the working(sample) code:
```
from pyspark.sql import SparkSessionfrom pyspark.sql.functions
63 matches
Mail list logo