i am looking into writing a dataframe to parquet using partioning. so
something like
df
.write
.mode(saveMode)
.partitionBy(partitionColumn)
.format("parquet")
.save(path)
i imagine i will have thousands of partitions. generally my goal is not to
recreate all partitions every time, but just a few partitions. the
partitions i do write to i want to replace all the data in.
i would expect this to be a general and typical use case since a true
append (adding data to partitions) is messy and not idempotent and to be
avoided by design (in fact i am not sure why it exists at all, unless
transactions are supported). redoing all partitions is very inefficient.
what saveMode do i use? in my tests if i use saveMode=Overwrite then i lose
all partitions. if i use saveMode=Append is the dangerous non-idempotent
usage that adds to partitions. i dont think saveMode=Ignore or
saveMode=ErrorIfExists will help me either.