Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks,

I wanted to check why spark doesn't create staging dir while doing an
insertInto on partitioned tables. I'm running below example code –
```
spark.sql("set hive.exec.dynamic.partition.mode=nonstrict")

val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3)))
val df = spark.createDataFrame(rdd)
df.write.insertInto("testing_table") // testing table is partitioned on "_1"
```
In this scenario FileOutputCommitter considers table path as output path
and creates temporary folders like
`/testing_table/_temporary/0` and then moves them to the
partition location when the job commit happens.

But in-case if multiple parallel apps are inserting into the same
partition, this can cause race condition issues while deleting the
`_temporary` dir. Ideally for each app there should be a unique staging dir
where the job should write its output.

Is there any specific reason for this? or am i missing something here?
Thanks for your time and assistance regarding this!

Kind regards
Sanskar


[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar
Hi Team,
   We are trying to use *spark structured streaming *for our use case.
We will be joining 2 streaming sources(from kafka topic) with watermarks.
As time progresses, the records that are prior to the watermark timestamp
are removed from the state. For our use case, we want to *store these
dropped records* in some postgres table or s3.

When searching, we found a similar question
in
StackOverflow which is unanswered.
*We would like to know how to store these dropped records due to the
watermark.*


unsubscribe

2024-05-07 Thread Wojciech Bombik
unsubscribe