Spark StructuredStreaming - watermark not working as expected

2023-03-09 Thread karan alang
Hello All -

I've a structured Streaming job which has a trigger of 10 minutes, and I'm
using watermark to account for late data coming in. However, the watermark
is not working - and instead of a single record with total aggregated
value, I see 2 records.

Here is the code :

```

1) StructuredStreaming - Reading from Kafka every 10 mins


df_stream = self.spark.readStream.format('kafka') \
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location",
self.ssl_truststore_location) \
.option("kafka.ssl.truststore.password",
self.ssl_truststore_password) \
.option("kafka.ssl.keystore.location",
self.ssl_keystore_location_bandwidth_intermediate) \
.option("kafka.ssl.keystore.password",
self.ssl_keystore_password_bandwidth_intermediate) \
.option("kafka.bootstrap.servers", self.kafkaBrokers) \
.option("subscribe", topic) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.option("kafka.metadata.max.age.ms", "1000") \
.option("kafka.ssl.keystore.type", "PKCS12") \
.option("kafka.ssl.truststore.type", "PKCS12") \
.load()

2. calling foreachBatch(self.process)
# note - outputMode is set to "update" (tried setting
outputMode = append as well)

# 03/09 ::: outputMode - update instead of append
query = df_stream.selectExpr("CAST(value AS STRING)",
"timestamp", "topic").writeStream \
.outputMode("update") \
.trigger(processingTime='10 minutes') \
.option("truncate", "false") \
.option("checkpointLocation", self.checkpoint) \
.foreachBatch(self.process) \
.start()


self.process - where i do the bulk of the  processing, which calls the
function  'aggIntfLogs'

In function aggIntfLogs - i'm using watermark of 15 mins, and doing
groupBy to calculate the sum of sentOctets & recvdOctets


def aggIntfLogs(ldf):
if ldf and ldf.count() > 0:

ldf = ldf.select('applianceName', 'timeslot',
'sentOctets', 'recvdOctets','ts', 'customer') \
.withColumn('sentOctets',
ldf["sentOctets"].cast(LongType())) \
.withColumn('recvdOctets',
ldf["recvdOctets"].cast(LongType())) \
.withWatermark("ts", "15 minutes")

ldf = ldf.groupBy("applianceName", "timeslot", "customer",

window(col("ts"), "15 minutes")) \
.agg({'sentOctets':"sum", 'recvdOctets':"sum"}) \
.withColumnRenamed('sum(sentOctets)', 'sentOctets') \
.withColumnRenamed('sum(recvdOctets)', 'recvdOctets') \
.fillna(0)
return ldf
return ldf


Dataframe 'ldf' returned from the function aggIntfLogs - is
written to Kafka topic

```

I was expecting that using the watermark will account for late coming data
.. i.e. the sentOctets & recvdOctets are calculated for the consolidated
data
(including late-coming data, since the late coming data comes within 15
mins), however, I'm seeing 2 records for some of the data (i.e. key -
applianceName/timeslot/customer) i.e. the aggregated data is calculated
individually for the records and I see 2 records instead of single record
accounting for late coming data within watermark.

What needs to be done to fix this & make this work as desired?

tia!


Here is the Stackoverflow link as well -

https://stackoverflow.com/questions/75693171/spark-structuredstreaming-watermark-not-working-as-expected


Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread hueiyuan su
Dear Mich,

Sure, that is a good idea. If we have a pause() function, we can
temporarily stop streaming and adjust configuration, maybe from environment
variable.
Once these parameters are adjust, we can restart the streaming to apply the
newest parameter without stop spark streaming application.

Mich Talebzadeh  於 2023年3月10日 週五 上午12:26寫道:

> most probably we will require an  additional method pause()
>
>
> https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html
>
> to allow us to pause (as opposed to stop()) the streaming process and
> resume after changing the parameters. The state of streaming needs to be
> preserved.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 7 Mar 2023 at 17:25, Mich Talebzadeh 
> wrote:
>
>> hm interesting proposition. I guess you mean altering one of following
>> parameters in flight
>>
>>
>>   streamingDataFrame = self.spark \
>> .readStream \
>> .format("kafka") \
>> .option("kafka.bootstrap.servers",
>> config['MDVariables']['bootstrapServers'],) \
>> .option("schema.registry.url",
>> config['MDVariables']['schemaRegistryURL']) \
>> .option("group.id", config['common']['appName']) \
>> .option("zookeeper.connection.timeout.ms",
>> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
>> .option("rebalance.backoff.ms",
>> config['MDVariables']['rebalanceBackoffMS']) \
>> .option("zookeeper.session.timeout.ms",
>> config['MDVariables']['zookeeperSessionTimeOutMs']) \
>> .option("auto.commit.interval.ms",
>> config['MDVariables']['autoCommitIntervalMS']) \
>> .option("subscribe", config['MDVariables']['topic']) \
>> .option("failOnDataLoss", "false") \
>> .option("includeHeaders", "true") \
>> .option("startingOffsets", "latest") \
>> .load() \
>> .select(from_json(col("value").cast("string"),
>> schema).alias("parsed_value"))
>>
>> Ok, one secure way of doing it though shutting down the streaming process
>> gracefully without loss of data that impacts consumers. The other method
>> implies inflight changes as suggested by the topic with zeio interruptions.
>> Interestingly one of our clients requested a similar solution. As solutions
>> architect /engineering manager I should come back with few options. I am on
>> the case so to speak. There is a considerable interest in Spark Structured
>> Streaming across the board, especially in trading systems.
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 16 Feb 2023 at 04:12, hueiyuan su  wrote:
>>
>>> *Component*: Spark Structured Streaming
>>> *Level*: Advanced
>>> *Scenario*: How-to
>>>
>>> -
>>> *Problems Description*
>>> I would like to confirm could we directly apply new options of
>>> readStream/writeStream without stopping current running spark structured
>>> streaming applications? For example, if we just want to adjust throughput
>>> properties of readStream with kafka. Do we have method can just adjust it
>>> without stopping application? If you have any ideas, please let me know. I
>>> will be appreciate it and your answer.
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Mars Su
>>> *Phone*: 0988-661-013
>>> *Email*: hueiyua...@gmail.com
>>>
>>

-- 
Best Regards,

Mars Su
*Phone*: 0988-661-013
*Email*: hueiyua...@gmail.com


Re: How to share a dataset file across nodes

2023-03-09 Thread Mich Talebzadeh
Try something like below

1) Put your csv say cities.csv in HDFS as below
hdfs dfs -put cities.csv /data/stg/test
2) Read it into dataframe in PySpark as below
csv_file="hdfs://:PORT/data/stg/test/cities.csv"
# read it in spark
listing_df =
spark.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load(csv_file)
 listing_df.printSchema()


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 21:07, Sean Owen  wrote:

> Put the file on HDFS, if you have a Hadoop cluster?
>
> On Thu, Mar 9, 2023 at 3:02 PM sam smith 
> wrote:
>
>> Hello,
>>
>> I use Yarn client mode to submit my driver program to Hadoop, the dataset
>> I load is from the local file system, when i invoke load("file://path")
>> Spark complains about the csv file being not found, which i totally
>> understand, since the dataset is not in any of the workers or the
>> applicationMaster but only where the driver program resides.
>> I tried to share the file using the configurations:
>>
>>> *spark.yarn.dist.files* OR *spark.files *
>>
>> but both ain't working.
>> My question is how to share the csv dataset across the nodes at the
>> specified path?
>>
>> Thanks.
>>
>


Re: How to share a dataset file across nodes

2023-03-09 Thread Sean Owen
Put the file on HDFS, if you have a Hadoop cluster?

On Thu, Mar 9, 2023 at 3:02 PM sam smith  wrote:

> Hello,
>
> I use Yarn client mode to submit my driver program to Hadoop, the dataset
> I load is from the local file system, when i invoke load("file://path")
> Spark complains about the csv file being not found, which i totally
> understand, since the dataset is not in any of the workers or the
> applicationMaster but only where the driver program resides.
> I tried to share the file using the configurations:
>
>> *spark.yarn.dist.files* OR *spark.files *
>
> but both ain't working.
> My question is how to share the csv dataset across the nodes at the
> specified path?
>
> Thanks.
>


How to share a dataset file across nodes

2023-03-09 Thread sam smith
Hello,

I use Yarn client mode to submit my driver program to Hadoop, the dataset I
load is from the local file system, when i invoke load("file://path") Spark
complains about the csv file being not found, which i totally understand,
since the dataset is not in any of the workers or the applicationMaster but
only where the driver program resides.
I tried to share the file using the configurations:

> *spark.yarn.dist.files* OR *spark.files *

but both ain't working.
My question is how to share the csv dataset across the nodes at the
specified path?

Thanks.


Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
Yeah, that's the right answer!

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com Book a time on Calendly



On Thu, Mar 9, 2023 at 10:14 AM Mich Talebzadeh 
wrote:

> Does this need any action in PySpark?
>
>
> How about importing using the shutil package?
>
>
> https://sparkbyexamples.com/python/how-to-copy-files-in-python/
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 9 Mar 2023 at 17:46, Russell Jurney 
> wrote:
>
>> https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html
>>
>> This says "Binary file data source does not support writing a DataFrame
>> back to the original files." which I take to mean this isn't possible...
>>
>> I haven't done this, but going from the docs, it would be:
>>
>> spark.read.format("binaryFile").option("pathGlobFilter", 
>> "*.png").load("/path/to/data").write.format("binaryFile").save("/new/path/to/data")
>>
>> Looking at the DataFrameWriter code on master branch
>> 
>> for DataFrameWriter, let's see if there is a binaryFile format option...
>>
>> At this point I get lost. I can't figure out how this works either, but
>> hopefully I have helped define the problem. The format() method of
>> DataFrameWriter isn't documented
>> 
>> .
>>
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com Book a time on Calendly
>> 
>>
>>
>> On Thu, Mar 9, 2023 at 12:52 AM second_co...@yahoo.com.INVALID
>>  wrote:
>>
>>> any example on how to read a binary file using pySpark and save it in
>>> another location . copy feature
>>>
>>>
>>> Thank you,
>>> Teoh
>>>
>>


Re: read a binary file and save in another location

2023-03-09 Thread Mich Talebzadeh
Does this need any action in PySpark?


How about importing using the shutil package?


https://sparkbyexamples.com/python/how-to-copy-files-in-python/



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 17:46, Russell Jurney 
wrote:

> https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html
>
> This says "Binary file data source does not support writing a DataFrame
> back to the original files." which I take to mean this isn't possible...
>
> I haven't done this, but going from the docs, it would be:
>
> spark.read.format("binaryFile").option("pathGlobFilter", 
> "*.png").load("/path/to/data").write.format("binaryFile").save("/new/path/to/data")
>
> Looking at the DataFrameWriter code on master branch
> 
> for DataFrameWriter, let's see if there is a binaryFile format option...
>
> At this point I get lost. I can't figure out how this works either, but
> hopefully I have helped define the problem. The format() method of
> DataFrameWriter isn't documented
> 
> .
>
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com Book a time on Calendly
> 
>
>
> On Thu, Mar 9, 2023 at 12:52 AM second_co...@yahoo.com.INVALID
>  wrote:
>
>> any example on how to read a binary file using pySpark and save it in
>> another location . copy feature
>>
>>
>> Thank you,
>> Teoh
>>
>


Re: read a binary file and save in another location

2023-03-09 Thread Russell Jurney
https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html

This says "Binary file data source does not support writing a DataFrame
back to the original files." which I take to mean this isn't possible...

I haven't done this, but going from the docs, it would be:

spark.read.format("binaryFile").option("pathGlobFilter",
"*.png").load("/path/to/data").write.format("binaryFile").save("/new/path/to/data")

Looking at the DataFrameWriter code on master branch

for DataFrameWriter, let's see if there is a binaryFile format option...

At this point I get lost. I can't figure out how this works either, but
hopefully I have helped define the problem. The format() method of
DataFrameWriter isn't documented

.

Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com Book a time on Calendly



On Thu, Mar 9, 2023 at 12:52 AM second_co...@yahoo.com.INVALID
 wrote:

> any example on how to read a binary file using pySpark and save it in
> another location . copy feature
>
>
> Thank you,
> Teoh
>


Re: [Spark Structured Streaming] Could we apply new options of readStream/writeStream without stopping spark application (zero downtime)?

2023-03-09 Thread Mich Talebzadeh
most probably we will require an  additional method pause()

https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.streaming.StreamingQuery.html

to allow us to pause (as opposed to stop()) the streaming process and
resume after changing the parameters. The state of streaming needs to be
preserved.

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 7 Mar 2023 at 17:25, Mich Talebzadeh 
wrote:

> hm interesting proposition. I guess you mean altering one of following
> parameters in flight
>
>
>   streamingDataFrame = self.spark \
> .readStream \
> .format("kafka") \
> .option("kafka.bootstrap.servers",
> config['MDVariables']['bootstrapServers'],) \
> .option("schema.registry.url",
> config['MDVariables']['schemaRegistryURL']) \
> .option("group.id", config['common']['appName']) \
> .option("zookeeper.connection.timeout.ms",
> config['MDVariables']['zookeeperConnectionTimeoutMs']) \
> .option("rebalance.backoff.ms",
> config['MDVariables']['rebalanceBackoffMS']) \
> .option("zookeeper.session.timeout.ms",
> config['MDVariables']['zookeeperSessionTimeOutMs']) \
> .option("auto.commit.interval.ms",
> config['MDVariables']['autoCommitIntervalMS']) \
> .option("subscribe", config['MDVariables']['topic']) \
> .option("failOnDataLoss", "false") \
> .option("includeHeaders", "true") \
> .option("startingOffsets", "latest") \
> .load() \
> .select(from_json(col("value").cast("string"),
> schema).alias("parsed_value"))
>
> Ok, one secure way of doing it though shutting down the streaming process
> gracefully without loss of data that impacts consumers. The other method
> implies inflight changes as suggested by the topic with zeio interruptions.
> Interestingly one of our clients requested a similar solution. As solutions
> architect /engineering manager I should come back with few options. I am on
> the case so to speak. There is a considerable interest in Spark Structured
> Streaming across the board, especially in trading systems.
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 16 Feb 2023 at 04:12, hueiyuan su  wrote:
>
>> *Component*: Spark Structured Streaming
>> *Level*: Advanced
>> *Scenario*: How-to
>>
>> -
>> *Problems Description*
>> I would like to confirm could we directly apply new options of
>> readStream/writeStream without stopping current running spark structured
>> streaming applications? For example, if we just want to adjust throughput
>> properties of readStream with kafka. Do we have method can just adjust it
>> without stopping application? If you have any ideas, please let me know. I
>> will be appreciate it and your answer.
>>
>>
>> --
>> Best Regards,
>>
>> Mars Su
>> *Phone*: 0988-661-013
>> *Email*: hueiyua...@gmail.com
>>
>


Re: Online classes for spark topics

2023-03-09 Thread neeraj bhadani
I am happy to be a part of this discussion as well.

Regards,
Neeraj

On Wed, 8 Mar 2023 at 22:41, Winston Lai  wrote:

> +1, any webinar on Spark related topic is appreciated 
>
> Thank You & Best Regards
> Winston Lai
> --
> *From:* asma zgolli 
> *Sent:* Thursday, March 9, 2023 5:43:06 AM
> *To:* karan alang 
> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
> ashok34...@yahoo.com>; User 
> *Subject:* Re: Online classes for spark topics
>
> +1
>
> Le mer. 8 mars 2023 à 21:32, karan alang  a écrit :
>
> +1 .. I'm happy to be part of these discussions as well !
>
>
>
>
> On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh 
> wrote:
>
> Hi,
>
> I guess I can schedule this work over a course of time. I for myself can
> contribute plus learn from others.
>
> So +1 for me.
>
> Let us see if anyone else is interested.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
> wrote:
>
>
> Hello Mich.
>
> Greetings. Would you be able to arrange for Spark Structured Streaming
> learning webinar.?
>
> This is something I haven been struggling with recently. it will be very
> helpful.
>
> Thanks and Regard
>
> AK
> On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Hi,
>
> This might  be a worthwhile exercise on the assumption that the
> contributors will find the time and bandwidth to chip in so to speak.
>
> I am sure there are many but on top of my head I can think of Holden Karau
> for k8s, and Sean Owen for data science stuff. They are both very
> experienced.
>
> Anyone else 樂
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
>  wrote:
>
> Hello gurus,
>
> Does Spark arranges online webinars for special topics like Spark on K8s,
> data science and Spark Structured Streaming?
>
> I would be most grateful if experts can share their experience with
> learners with intermediate knowledge like myself. Hopefully we will find
> the practical experiences told valuable.
>
> Respectively,
>
> AK
>
>
>
>


eqNullSafe breaks Sorted Merge Bucket Join?

2023-03-09 Thread Thomas Wang
Hi,

I have two tables t1 and t2. Both are bucketed and sorted on user_id into
32 buckets.

When I use a regular equal join, Spark triggers the expected Sorted Merge
Bucket Join. Please see my code and the physical plan below.

from pyspark.sql import SparkSession


def _gen_spark_session(job_name: str) -> SparkSession:
return (
SparkSession
.builder
.enableHiveSupport()
.appName(f'Job: {job_name}')
.config(
key='spark.sql.sources.bucketing.enabled',
value='true',
).config(
key='spark.sql.legacy.bucketedTableScan.outputOrdering',
value='true',
).config(
key='spark.hadoop.mapreduce.fileoutputcommitter'
'.algorithm.version',
value='2',
).config(
key='spark.speculation',
value='false',
).getOrCreate()
)


def run() -> None:
spark = _gen_spark_session(job_name='TEST')
joined_df = spark.sql(f'''
SELECT
COALESCE(t1.user_id, t2.user_id) AS user_id
FROM
t1 FULL OUTER JOIN t2
ON t1.user_id = t2.user_id
''')
joined_df.explain(True)


if __name__ == '__main__':
run()


Physical Plan:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [coalesce(user_id#0L, user_id#6L) AS user_id#12L]
   +- SortMergeJoin [user_id#0L], [user_id#6L], FullOuter
  :- FileScan parquet Batched: true, DataFilters: [], Format:
Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [],
PushedFilters: [], ReadSchema: struct,
SelectedBucketsCount: 32 out of 32
  +- FileScan parquet Batched: true, DataFilters: [], Format:
Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [],
PushedFilters: [], ReadSchema: struct,
SelectedBucketsCount: 32 out of 32

As you can see, there is no exchange and sort before the SorteMergeJoin
step.

However, if I switch to using eqNullSafe as the join condition, Spark
doesn't trigger the Sorted Merge Bucket Join any more.
def run() -> None:
spark = _gen_spark_session(job_name='TEST')
joined_df = spark.sql(f'''
SELECT
COALESCE(t1.user_id, t2.user_id) AS user_id
FROM
t1 FULL OUTER JOIN t2
ON t1.user_id <=> t2.user_id
''')
joined_df.explain(True)


The equal sign is the only thing I changed between the two runs.

Physical Plan:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [coalesce(user_id#0L, user_id#6L) AS user_id#12L]
   +- SortMergeJoin [coalesce(user_id#0L, 0), isnull(user_id#0L)],
[coalesce(user_id#6L, 0), isnull(user_id#6L)], FullOuter
  :- Sort [coalesce(user_id#0L, 0) ASC NULLS FIRST,
isnull(user_id#0L) ASC NULLS FIRST], false, 0
  :  +- Exchange hashpartitioning(coalesce(user_id#0L, 0),
isnull(user_id#0L), 1000), ENSURE_REQUIREMENTS, [id=#23]
  : +- FileScan parquet [user_id#0L] Batched: true,
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[...,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct
  +- Sort [coalesce(user_id#6L, 0) ASC NULLS FIRST,
isnull(user_id#6L) ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(coalesce(user_id#6L, 0),
isnull(user_id#6L), 1000), ENSURE_REQUIREMENTS, [id=#26]
+- FileScan parquet [user_id#6L] Batched: true,
DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[...,
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct


If I read this correctly, the eqNullSafe is just a syntactic sugar that
automatically applies a COALESCE to 0? Does Spark consider potential key
collisions in this case (e.g. I have a user_id = 0 in my original dataset)?

I know if we apply a UDF on the join condition, it would break the
bucketing, thus the rebucketing and resorting. However, I'm wondering in
this special case, can we make it work as well? Thanks.

Thomas


Re: [EXTERNAL] Spark Thrift Server - Autoscaling on K8

2023-03-09 Thread Saurabh Gulati
Hey Jayabindu,
We use thriftserver for on K8S. May I ask why you are not going for Trino 
instead? I know it didn't support autoscaling when we tested it in the past but 
not sure if it does now.
Autoscaling also means that users might have to wait for the cluster to 
autoscale but that usually happens not so slow and once its done then other 
queries have the new nodes available.
Also the workload on our thriftserver is not so large so it solves the purpose 
for now.
You can also take a look at Apache Kyuubi.

I can put in some details below and attach the config we use for spark 
thriftserver, you can pick whatever is relevant for you:

  *   We run thriftserver on default(stable) nodes and its executors on 
preemptible(spot) nodes
  *   We use driver and executor templates to make above possible by using node 
selectors
  *   We use fair scheduling to manage workload


Mvg/Regards
Saurabh

From: Jayabindu Singh 
Sent: 09 March 2023 06:31
To: u...@spark.incubator.apache.org 
Subject: [EXTERNAL] Spark Thrift Server - Autoscaling on K8

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi All,

We are in the process of moving our workloads to K8 and looking for some 
guidance to run Spark Thrift Server on K8.
We need the executor pods to autoscale based on the workload vs running it with 
a static number of executors.

If any one has done it and can share the details, it will be really appreciated.

Regards
Jayabindu Singh




spark-defaults.conf
Description: spark-defaults.conf

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread asma zgolli
Hello spark community,


Adding a new topic.

   - Spark UI
   - Dynamic allocation
   - Tuning of jobs
   - Collecting spark metrics for monitoring and alerting
   - For those who prefer to use Pandas API on Spark since the release of
   Spark 3.2, What are some important notes for those users? For example, what
   are the additional factors affecting the Spark performance using Pandas API
   on Spark? How to tune them in addition to the conventional Spark tuning
   methods applied to Spark SQL users.
   - Spark internals and/or comparing spark 3 and 2 (I can take care of
   this if the community finds the topic interesting)


Le jeu. 9 mars 2023 à 10:14, Winston Lai  a écrit :

> Hi everyone,
>
> I would like to add one topic to Saurabh's list as well.
>
>- Spark UI
>- Dynamic allocation
>- Tuning of jobs
>- Collecting spark metrics for monitoring and alerting
>- For those who prefer to use Pandas API on Spark since the release of
>Spark 3.2, What are some important notes for those users? For example, what
>are the additional factors affecting the Spark performance using Pandas API
>on Spark? How to tune them in addition to the conventional Spark tuning
>methods applied to Spark SQL users.
>
>
> Thank You & Best Regards
> Winston Lai
> --
> *From:* Saurabh Gulati 
> *Sent:* Thursday, March 9, 2023 5:04:35 PM
> *To:* Mich Talebzadeh ; Deepak Sharma <
> deepakmc...@gmail.com>
> *Cc:* Denny Lee ; Sofia’s World <
> mmistr...@gmail.com>; User ; Winston Lai <
> weiruanl...@gmail.com>; ashok34...@yahoo.com ; asma
> zgolli ; karan alang 
> *Subject:* Re: [EXTERNAL] Re: Online classes for spark topics
>
> Hey guys,
> Its a nice idea and appreciate the effort you guys are taking.
> I can add to the list of topics which might be of interest:
>
>- Spark UI
>- Dynamic allocation
>- Tuning of jobs
>- Collecting spark metrics for monitoring and alerting
>
>
> HTH
> --
> *From:* Mich Talebzadeh 
> *Sent:* 09 March 2023 09:00
> *To:* Deepak Sharma 
> *Cc:* Denny Lee ; Sofia’s World <
> mmistr...@gmail.com>; User ; Winston Lai <
> weiruanl...@gmail.com>; ashok34...@yahoo.com ; asma
> zgolli ; karan alang 
> *Subject:* [EXTERNAL] Re: Online classes for spark topics
>
> *Caution! This email originated outside of FedEx. Please do not open
> attachments or click links from an unknown or suspicious origin*.
> Hi Deepak,
>
> The priority list of topics is a very good point. The theard owner
> mentioned Spark on k8s, Data Science and Spark Structured Streaming. What
> other topics need to be included I guess it depends on demand.. I suggest
> we wait a couple of days to see the demand .
>
> We just need to create a draft list of topics of interest and share them
> in the forum to get the priority order.
>
> Well that is my thoughts.
>
> Cheers
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 9 Mar 2023 at 06:13, Deepak Sharma  wrote:
>
> I can prepare some topics and present as well , if we have a prioritised
> list of topics already .
>
> On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee  wrote:
>
> We used to run Spark webinars on the Apache Spark LinkedIn group
> 
>  but
> honestly the turnout was pretty low.  We had dove into various features.
> If there are particular topics that. you would like to discuss during a
> live session, please let me know and we can try to restart them.  HTH!
>
> On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:
>
> +1
>
> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai  wrote:
>
> +1, any webinar on Spark related topic is appreciated 
>
> Thank You & Best Regards
> Winston Lai
> --
> *From:* asma zgolli 
> *Sent:* Thursday, March 9, 2023 5:43:06 AM
> *To:* karan alang 
> *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com <
> ashok34...@yahoo.com>; User 
> *Subject:* Re: Online classes for spark topics
>
> +1
>
> Le mer. 8 mars 2023 à 21:32, karan alang  a écrit :
>
> +1 .. I'm happy to be part 

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Winston Lai
Hi everyone,

I would like to add one topic to Saurabh's list as well.

  *   Spark UI
  *   Dynamic allocation
  *   Tuning of jobs
  *   Collecting spark metrics for monitoring and alerting
  *   For those who prefer to use Pandas API on Spark since the release of 
Spark 3.2, What are some important notes for those users? For example, what are 
the additional factors affecting the Spark performance using Pandas API on 
Spark? How to tune them in addition to the conventional Spark tuning methods 
applied to Spark SQL users.

Thank You & Best Regards
Winston Lai

From: Saurabh Gulati 
Sent: Thursday, March 9, 2023 5:04:35 PM
To: Mich Talebzadeh ; Deepak Sharma 

Cc: Denny Lee ; Sofia’s World ; 
User ; Winston Lai ; 
ashok34...@yahoo.com ; asma zgolli 
; karan alang 
Subject: Re: [EXTERNAL] Re: Online classes for spark topics

Hey guys,
Its a nice idea and appreciate the effort you guys are taking.
I can add to the list of topics which might be of interest:

  *   Spark UI
  *   Dynamic allocation
  *   Tuning of jobs
  *   Collecting spark metrics for monitoring and alerting

HTH

From: Mich Talebzadeh 
Sent: 09 March 2023 09:00
To: Deepak Sharma 
Cc: Denny Lee ; Sofia’s World ; 
User ; Winston Lai ; 
ashok34...@yahoo.com ; asma zgolli 
; karan alang 
Subject: [EXTERNAL] Re: Online classes for spark topics

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Deepak,

The priority list of topics is a very good point. The theard owner mentioned 
Spark on k8s, Data Science and Spark Structured Streaming. What other topics 
need to be included I guess it depends on demand.. I suggest we wait a couple 
of days to see the demand .

We just need to create a draft list of topics of interest and share them in the 
forum to get the priority order.

Well that is my thoughts.

Cheers






 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 9 Mar 2023 at 06:13, Deepak Sharma 
mailto:deepakmc...@gmail.com>> wrote:
I can prepare some topics and present as well , if we have a prioritised list 
of topics already .

On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee 
mailto:denny.g@gmail.com>> wrote:
We used to run Spark webinars on the Apache Spark LinkedIn 
group
 but honestly the turnout was pretty low.  We had dove into various features.  
If there are particular topics that. you would like to discuss during a live 
session, please let me know and we can try to restart them.  HTH!

On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World 
mailto:mmistr...@gmail.com>> wrote:
+1

On Wed, Mar 8, 2023 at 10:40 PM Winston Lai 
mailto:weiruanl...@gmail.com>> wrote:
+1, any webinar on Spark related topic is appreciated 

Thank You & Best Regards
Winston Lai

From: asma zgolli mailto:zgollia...@gmail.com>>
Sent: Thursday, March 9, 2023 5:43:06 AM
To: karan alang mailto:karan.al...@gmail.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; 
ashok34...@yahoo.com 
mailto:ashok34...@yahoo.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: Online classes for spark topics

+1

Le mer. 8 mars 2023 à 21:32, karan alang 
mailto:karan.al...@gmail.com>> a écrit :
+1 .. I'm happy to be part of these discussions as well !




On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

I guess I can schedule this work over a course of time. I for myself can 
contribute plus learn from others.

So +1 for me.

Let us see if anyone else is interested.

HTH



 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Saurabh Gulati
Hey guys,
Its a nice idea and appreciate the effort you guys are taking.
I can add to the list of topics which might be of interest:

  *   Spark UI
  *   Dynamic allocation
  *   Tuning of jobs
  *   Collecting spark metrics for monitoring and alerting

HTH

From: Mich Talebzadeh 
Sent: 09 March 2023 09:00
To: Deepak Sharma 
Cc: Denny Lee ; Sofia’s World ; 
User ; Winston Lai ; 
ashok34...@yahoo.com ; asma zgolli 
; karan alang 
Subject: [EXTERNAL] Re: Online classes for spark topics

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Deepak,

The priority list of topics is a very good point. The theard owner mentioned 
Spark on k8s, Data Science and Spark Structured Streaming. What other topics 
need to be included I guess it depends on demand.. I suggest we wait a couple 
of days to see the demand .

We just need to create a draft list of topics of interest and share them in the 
forum to get the priority order.

Well that is my thoughts.

Cheers






 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 9 Mar 2023 at 06:13, Deepak Sharma 
mailto:deepakmc...@gmail.com>> wrote:
I can prepare some topics and present as well , if we have a prioritised list 
of topics already .

On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee 
mailto:denny.g@gmail.com>> wrote:
We used to run Spark webinars on the Apache Spark LinkedIn 
group
 but honestly the turnout was pretty low.  We had dove into various features.  
If there are particular topics that. you would like to discuss during a live 
session, please let me know and we can try to restart them.  HTH!

On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World 
mailto:mmistr...@gmail.com>> wrote:
+1

On Wed, Mar 8, 2023 at 10:40 PM Winston Lai 
mailto:weiruanl...@gmail.com>> wrote:
+1, any webinar on Spark related topic is appreciated 

Thank You & Best Regards
Winston Lai

From: asma zgolli mailto:zgollia...@gmail.com>>
Sent: Thursday, March 9, 2023 5:43:06 AM
To: karan alang mailto:karan.al...@gmail.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; 
ashok34...@yahoo.com 
mailto:ashok34...@yahoo.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: Online classes for spark topics

+1

Le mer. 8 mars 2023 à 21:32, karan alang 
mailto:karan.al...@gmail.com>> a écrit :
+1 .. I'm happy to be part of these discussions as well !




On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

I guess I can schedule this work over a course of time. I for myself can 
contribute plus learn from others.

So +1 for me.

Let us see if anyone else is interested.

HTH



 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
mailto:ashok34...@yahoo.com>> wrote:

Hello Mich.

Greetings. Would you be able to arrange for Spark Structured Streaming learning 
webinar.?


read a binary file and save in another location

2023-03-09 Thread second_co...@yahoo.com.INVALID
any example on how to read a binary file using pySpark and save it in another 
location . copy feature

Thank you,Teoh


Re: Online classes for spark topics

2023-03-09 Thread Mich Talebzadeh
Hi Deepak,

The priority list of topics is a very good point. The theard owner
mentioned Spark on k8s, Data Science and Spark Structured Streaming. What
other topics need to be included I guess it depends on demand.. I suggest
we wait a couple of days to see the demand .

We just need to create a draft list of topics of interest and share them in
the forum to get the priority order.

Well that is my thoughts.

Cheers





   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 9 Mar 2023 at 06:13, Deepak Sharma  wrote:

> I can prepare some topics and present as well , if we have a prioritised
> list of topics already .
>
> On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee  wrote:
>
>> We used to run Spark webinars on the Apache Spark LinkedIn group
>>  but
>> honestly the turnout was pretty low.  We had dove into various features.
>> If there are particular topics that. you would like to discuss during a
>> live session, please let me know and we can try to restart them.  HTH!
>>
>> On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World  wrote:
>>
>>> +1
>>>
>>> On Wed, Mar 8, 2023 at 10:40 PM Winston Lai 
>>> wrote:
>>>
 +1, any webinar on Spark related topic is appreciated 

 Thank You & Best Regards
 Winston Lai
 --
 *From:* asma zgolli 
 *Sent:* Thursday, March 9, 2023 5:43:06 AM
 *To:* karan alang 
 *Cc:* Mich Talebzadeh ; ashok34...@yahoo.com
 ; User 
 *Subject:* Re: Online classes for spark topics

 +1

 Le mer. 8 mars 2023 à 21:32, karan alang  a
 écrit :

 +1 .. I'm happy to be part of these discussions as well !




 On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

 Hi,

 I guess I can schedule this work over a course of time. I for myself
 can contribute plus learn from others.

 So +1 for me.

 Let us see if anyone else is interested.

 HTH



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
 wrote:


 Hello Mich.

 Greetings. Would you be able to arrange for Spark Structured Streaming
 learning webinar.?

 This is something I haven been struggling with recently. it will be
 very helpful.

 Thanks and Regard

 AK
 On Tuesday, 7 March 2023 at 20:24:36 GMT, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:


 Hi,

 This might  be a worthwhile exercise on the assumption that the
 contributors will find the time and bandwidth to chip in so to speak.

 I am sure there are many but on top of my head I can think of Holden
 Karau for k8s, and Sean Owen for data science stuff. They are both very
 experienced.

 Anyone else 樂

 HTH



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Tue, 7 Mar 2023 at 19:17, ashok34...@yahoo.com.INVALID
  wrote:

 Hello gurus,

 Does Spark arranges online webinars for special topics like Spark on
 K8s, data science and Spark Structured Streaming?

 I would be most grateful if experts can share their experience with
 learners with intermediate knowledge like myself. Hopefully we will find
 the practical experiences told valuable.

 Respectively,

 AK




>>>