Re: Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2020-08-19 Thread golokeshpatra.patra
Adding this simple setting helped me overcome the issue - 

*spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
*
My Issue - 

In a S3 Folder, I previously had data partitionedBy - *ingestiontime* .
Now I wanted to reprocess this data and partition it by - 
businessname & ingestiontime.

Whenever I was writing my dataframe in OverWrite Mode, 
All my data, which was present prior to this operation were
TRUNCATED/DELETED.

After setting the above spark configuration,
Only the required Partitions are being truncated and overwritten and all
others stay Intact.

In addition to this, if you have hadoop Trash Enabled, then you might be
able to fetch this lost data back.
For more -
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#File_Deletes_and_Undeletes



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread nirandap
Hi Yash,

Yes, AFAIK, that is the expected behavior of the Overwrite mode.

I think you can use the following approaches if you want to perform a job
on each partitions
[1] for each partition in DF :
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1444
[2] run job in SC:
https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/SparkContext.scala#L1818

Best

On Thu, Jul 7, 2016 at 7:40 AM, Yash Sharma [via Apache Spark Developers
List]  wrote:

> Hi All,
> While writing a partitioned data frame as partitioned text files I see
> that Spark deletes all available partitions while writing few new
> partitions.
>
> dataDF.write.partitionBy(“year”, “month”,
>> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)
>
>
> Is this an expected behavior ?
>
> I have a past correction job which would overwrite couple of past
> partitions based on new arriving data. I would only want to remove those
> partitions.
>
> Is there a neater way to do that other than:
> - Find the partitions
> - Delete using Hadoop API's
> - Write DF in Append Mode
>
>
> Cheers
> Yash
>
>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-tp18219.html
> To start a new topic under Apache Spark Developers List, email
> ml-node+s1001551n1...@n3.nabble.com
> To unsubscribe from Apache Spark Developers List, click here
> 
> .
> NAML
> 
>



-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-tp18219p18220.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Spark deletes all existing partitions in SaveMode.Overwrite - Expected behavior ?

2016-07-06 Thread Yash Sharma
Hi All,
While writing a partitioned data frame as partitioned text files I see that
Spark deletes all available partitions while writing few new partitions.

dataDF.write.partitionBy(“year”, “month”,
> “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)


Is this an expected behavior ?

I have a past correction job which would overwrite couple of past
partitions based on new arriving data. I would only want to remove those
partitions.

Is there a neater way to do that other than:
- Find the partitions
- Delete using Hadoop API's
- Write DF in Append Mode


Cheers
Yash