Based on the behavior of spark [1], Overwrite mode will delete all your
data when you try to overwrite a particular partition.

What I did-
- Use S3 api to delete all partitions
- Use spark df to write in Append mode [2]


1.
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html

2. dataDF.write.partitionBy(“year”, “month”,
“date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”)

On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> Probably should have been more specific with the code we are using, which
> is something like
>
> val df = ....
> df.write.mode("append or overwrite
> here").partitionBy("date").saveAsTable("my_table")
>
> Unless there is something like what I described on the native API, I will
> probably take the approach of having a S3 API call to wipe out that
> partition before the job starts, but it would be nice to not have to
> incorporate another step in the job.
>
> Pedro
>
> On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkad...@collectivei.com> wrote:
>
>> You can have a temporary file to capture the data that you would like to
>> overwrite. And swap that with existing partition that you would want to
>> wipe the data away. Swapping can be done by simple rename of the partition
>> and just repair the table to pick up the new partition.
>>
>> Am not sure if that addresses your scenario.
>>
>> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <ski.rodrig...@gmail.com>
>> wrote:
>>
>> What would be the best way to accomplish the following behavior:
>>
>> 1. There is a table which is partitioned by date
>> 2. Spark job runs on a particular date, we would like it to wipe out all
>> data for that date. This is to make the job idempotent and lets us rerun a
>> job if it failed without fear of duplicated data
>> 3. Preserve data for all other dates
>>
>> I am guessing that overwrite would not work here or if it does its not
>> guaranteed to stay that way, but am not sure. If thats the case, is there a
>> good/robust way to get this behavior?
>>
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>>
>> Collective[i] dramatically improves sales and marketing performance using
>> technology, applications and a revolutionary network designed to provide
>> next generation analytics and decision-support directly to business users.
>> Our goal is to maximize human potential and minimize mistakes. In most
>> cases, the results are astounding. We cannot, however, stop emails from
>> sometimes being sent to the wrong person. If you are not the intended
>> recipient, please notify us by replying to this email's sender and deleting
>> it (and any attachments) permanently from your system. If you are, please
>> respect the confidentiality of this communication's contents.
>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Reply via email to