Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
I agree with what is stated. This is the gist of my understanding having tested it. When working with Spark Structured Streaming, each streaming query runs in its own separate Spark session to ensure isolation and avoid conflicts between different queries. So here I have: def process_data(self, df

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Mich Talebzadeh
hm. In your logic here def process_micro_batch(micro_batch_df, batchId) : micro_batch_df.createOrReplaceTempView("temp_view") df = spark.sql(f"select * from temp_view") return df Is this function called and if so do you check if micro_batch_df contains rows -> if len(micro_batch_df

deploy spark as cluster

2024-01-31 Thread ali sharifi
Hi everyone! I followed this guide https://dev.to/mvillarrealb/creating-a-spark-standalone-cluster-with-docker-and-docker-compose-2021-update-6l4 to create a Spark cluster on an Ubuntu server with Docker. However, when I try to submit my PySpark code to the master, the jobs are registered in the S

Create Custom Logs

2024-01-31 Thread PRASHANT L
Hi I justed wanted to check if there is a way to create custom log in Spark I want to write selective/custom log messages to S3 , running spark submit on EMR I would not want all the spark generated logs ... I would just need the log messages that are logged as part of Spark Application

Re: Issue in Creating Temp_view in databricks and using spark.sql().

2024-01-31 Thread Jungtaek Lim
Hi, Streaming query clones the spark session - when you create a temp view from DataFrame, the temp view is created under the cloned session. You will need to use micro_batch_df.sparkSession to access the cloned session. Thanks, Jungtaek Lim (HeartSaVioR) On Wed, Jan 31, 2024 at 3:29 PM Karthick

randomsplit has issue?

2024-01-31 Thread second_co...@yahoo.com.INVALID
based on this blog post https://sergei-ivanov.medium.com/why-you-should-not-use-randomsplit-in-pyspark-to-split-data-into-train-and-test-58576d539a36 , I noticed a recommendation against using randomSplit for data splitting due to data sorting. Is the information provided in the blog accurate? I