As alternative: checkpoint the dataframe, collect days, and then delete corresponding directories using hadoop FileUtils, then write the dataframe
On Fri, Sep 29, 2017 at 10:31 AM, peay <p...@protonmail.com> wrote: > Hello, > > I am trying to use data_frame.write.partitionBy("day").save("dataset.parquet") > to write a dataset while splitting by day. > > I would like to run a Spark job to process, e.g., a month: > dataset.parquet/day=2017-01-01/... > ... > > and then run another Spark job to add another month using the same folder > structure, getting me > dataset.parquet/day=2017-01-01/ > ... > dataset.parquet/day=2017-02-01/ <http://dataset.parquet/day=2017-01-01/> > ... > > However: > - with save mode "overwrite", when I process the second month, all of > dataset.parquet/ gets removed and I lose whatever was already computed for > the previous month. > - with save mode "append", then I can't get idempotence: if I run the job > to process a given month twice, I'll get duplicate data in all the > subfolders for that month. > > Is there a way to do "append in terms of the subfolders from partitionBy, > but overwrite within each such partitions? Any help would be appreciated. > > Thanks! >