Spark StructuredStreaming - watermark not working as expected

2023-03-09 Thread karan alang
Hello All - I've a structured Streaming job which has a trigger of 10 minutes, and I'm using watermark to account for late data coming in. However, the watermark is not working - and instead of a single record with total aggregated value, I see 2 records. Here is the code : ``` 1)

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread hueiyuan su
Dear Mich, Sure, that is a good idea. If we have a pause() function, we can temporarily stop streaming and adjust configuration, maybe from environment variable. Once these parameters are adjust, we can restart the streaming to apply the newest parameter without stop spark streaming application.

Re: How to share a dataset file across nodes

2023-03-09 Thread Mich Talebzadeh
Try something like below 1) Put your csv say cities.csv in HDFS as below hdfs dfs -put cities.csv /data/stg/test 2) Read it into dataframe in PySpark as below csv_file="hdfs://:PORT/data/stg/test/cities.csv" # read it in spark listing_df =

Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen
Put the file on HDFS, if you have a Hadoop cluster? On Thu, Mar 9, 2023 at 3:02 PM sam smith wrote: > Hello, > > I use Yarn client mode to submit my driver program to Hadoop, the dataset > I load is from the local file system, when i invoke load("file://path") > Spark complains about the csv

How to share a dataset file across nodes

2023-03-09 Thread sam smith
Hello, I use Yarn client mode to submit my driver program to Hadoop, the dataset I load is from the local file system, when i invoke load("file://path") Spark complains about the csv file being not found, which i totally understand, since the dataset is not in any of the workers or the

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
Yeah, that's the right answer! Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com Book a time on Calendly On Thu, Mar 9,

Re: read a binary file and save in another location

2023-03-09 Thread Mich Talebzadeh
Does this need any action in PySpark? How about importing using the shutil package? https://sparkbyexamples.com/python/how-to-copy-files-in-python/ view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh

Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html This says "Binary file data source does not support writing a DataFrame back to the original files." which I take to mean this isn't possible... I haven't done this, but going from the docs, it would be:

Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread Mich Talebzadeh
most probably we will require an additional method pause() https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html to allow us to pause (as opposed to stop()) the streaming process and resume after changing the parameters. The state of streaming

Re: Online classes for spark topics

2023-03-09 Thread neeraj bhadani
I am happy to be a part of this discussion as well. Regards, Neeraj On Wed, 8 Mar 2023 at 22:41, Winston Lai wrote: > +1, any webinar on Spark related topic is appreciated  > > Thank You & Best Regards > Winston Lai > -- > *From:* asma zgolli > *Sent:* Thursday,

eqNullSafe breaks Sorted Merge Bucket Join?

2023-03-09 Thread Thomas Wang
Hi, I have two tables t1 and t2. Both are bucketed and sorted on user_id into 32 buckets. When I use a regular equal join, Spark triggers the expected Sorted Merge Bucket Join. Please see my code and the physical plan below. from pyspark.sql import SparkSession def

Re: [EXTERNAL] Spark Thrift Server - Autoscaling on K8

2023-03-09 Thread Saurabh Gulati
Hey Jayabindu, We use thriftserver for on K8S. May I ask why you are not going for Trino instead? I know it didn't support autoscaling when we tested it in the past but not sure if it does now. Autoscaling also means that users might have to wait for the cluster to autoscale but that usually

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread asma zgolli
Hello spark community, Adding a new topic. - Spark UI - Dynamic allocation - Tuning of jobs - Collecting spark metrics for monitoring and alerting - For those who prefer to use Pandas API on Spark since the release of Spark 3.2, What are some important notes for those users?

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Winston Lai
Hi everyone, I would like to add one topic to Saurabh's list as well. * Spark UI * Dynamic allocation * Tuning of jobs * Collecting spark metrics for monitoring and alerting * For those who prefer to use Pandas API on Spark since the release of Spark 3.2, What are some

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Saurabh Gulati
Hey guys, Its a nice idea and appreciate the effort you guys are taking. I can add to the list of topics which might be of interest: * Spark UI * Dynamic allocation * Tuning of jobs * Collecting spark metrics for monitoring and alerting HTH From:

read a binary file and save in another location

2023-03-09 Thread second_co...@yahoo.com.INVALID
any example on how to read a binary file using pySpark and save it in another location . copy feature Thank you,Teoh

Re: Online classes for spark topics

2023-03-09 Thread Mich Talebzadeh
Hi Deepak, The priority list of topics is a very good point. The theard owner mentioned Spark on k8s, Data Science and Spark Structured Streaming. What other topics need to be included I guess it depends on demand.. I suggest we wait a couple of days to see the demand . We just need to create a