If your processing task inherently processes input data by month you may want to "manually" partition the output data by month as well as by day, that is to save it with a file name including the given month, i.e. "dataset.parquet/month=01". Then you will be able to use the overwrite mode with each month partition. Hope this could be of some help.
-- Pavel Knoblokh On Fri, Sep 29, 2017 at 5:31 PM, peay <p...@protonmail.com> wrote: > Hello, > > I am trying to use > data_frame.write.partitionBy("day").save("dataset.parquet") to write a > dataset while splitting by day. > > I would like to run a Spark job to process, e.g., a month: > dataset.parquet/day=2017-01-01/... > ... > > and then run another Spark job to add another month using the same folder > structure, getting me > dataset.parquet/day=2017-01-01/ > ... > dataset.parquet/day=2017-02-01/ > ... > > However: > - with save mode "overwrite", when I process the second month, all of > dataset.parquet/ gets removed and I lose whatever was already computed for > the previous month. > - with save mode "append", then I can't get idempotence: if I run the job to > process a given month twice, I'll get duplicate data in all the subfolders > for that month. > > Is there a way to do "append in terms of the subfolders from partitionBy, > but overwrite within each such partitions? Any help would be appreciated. > > Thanks! -- Pavel Knoblokh --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org