Based on the behavior of spark [1], Overwrite mode will delete all your data when you try to overwrite a particular partition.
What I did- - Use S3 api to delete all partitions - Use spark df to write in Append mode [2] 1. http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-deletes-all-existing-partitions-in-SaveMode-Overwrite-Expected-behavior-td18219.html 2. dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Overwrite).text(“s3://data/test2/events/”) On Tue, Jul 26, 2016 at 9:37 AM, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote: > Probably should have been more specific with the code we are using, which > is something like > > val df = .... > df.write.mode("append or overwrite > here").partitionBy("date").saveAsTable("my_table") > > Unless there is something like what I described on the native API, I will > probably take the approach of having a S3 API call to wipe out that > partition before the job starts, but it would be nice to not have to > incorporate another step in the job. > > Pedro > > On Mon, Jul 25, 2016 at 5:23 PM, RK Aduri <rkad...@collectivei.com> wrote: > >> You can have a temporary file to capture the data that you would like to >> overwrite. And swap that with existing partition that you would want to >> wipe the data away. Swapping can be done by simple rename of the partition >> and just repair the table to pick up the new partition. >> >> Am not sure if that addresses your scenario. >> >> On Jul 25, 2016, at 4:18 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> >> wrote: >> >> What would be the best way to accomplish the following behavior: >> >> 1. There is a table which is partitioned by date >> 2. Spark job runs on a particular date, we would like it to wipe out all >> data for that date. This is to make the job idempotent and lets us rerun a >> job if it failed without fear of duplicated data >> 3. Preserve data for all other dates >> >> I am guessing that overwrite would not work here or if it does its not >> guaranteed to stay that way, but am not sure. If thats the case, is there a >> good/robust way to get this behavior? >> >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> >> >> Collective[i] dramatically improves sales and marketing performance using >> technology, applications and a revolutionary network designed to provide >> next generation analytics and decision-support directly to business users. >> Our goal is to maximize human potential and minimize mistakes. In most >> cases, the results are astounding. We cannot, however, stop emails from >> sometimes being sent to the wrong person. If you are not the intended >> recipient, please notify us by replying to this email's sender and deleting >> it (and any attachments) permanently from your system. If you are, please >> respect the confidentiality of this communication's contents. > > > > > -- > Pedro Rodriguez > PhD Student in Distributed Machine Learning | CU Boulder > UC Berkeley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > >