I agree with what is stated. This is the gist of my understanding having
tested it.
When working with Spark Structured Streaming, each streaming query runs in
its own separate Spark session to ensure isolation and avoid conflicts
between different queries.
So here I have:
def process_data(self, df
hm.
In your logic here
def process_micro_batch(micro_batch_df, batchId) :
micro_batch_df.createOrReplaceTempView("temp_view")
df = spark.sql(f"select * from temp_view")
return df
Is this function called and if so do you check if micro_batch_df contains
rows -> if len(micro_batch_df
Hi everyone! I followed this guide
https://dev.to/mvillarrealb/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-2021-update-6l4
to create a Spark cluster on an Ubuntu server with Docker. However, when I
try to submit my PySpark code to the master, the jobs are registered in the
S
Hi
I justed wanted to check if there is a way to create custom log in Spark
I want to write selective/custom log messages to S3 , running spark submit
on EMR
I would not want all the spark generated logs ... I would just need the
log messages that are logged as part of Spark Application
Hi,
Streaming query clones the spark session - when you create a temp view from
DataFrame, the temp view is created under the cloned session. You will need
to use micro_batch_df.sparkSession to access the cloned session.
Thanks,
Jungtaek Lim (HeartSaVioR)
On Wed, Jan 31, 2024 at 3:29 PM Karthick
based on this blog post
https://sergei-ivanov.medium.com/why-you-should-not-use-randomsplit-in-pyspark-to-split-data-into-train-and-test-58576d539a36
, I noticed a recommendation against using randomSplit for data splitting due
to data sorting. Is the information provided in the blog accurate? I