Hello, I have a parquet dataset, partitioned by a column 'a'. I want to take advantage of Spark SQL's ability to filter to the partition when you filter on 'a'. I also want to periodically update individual partitions without disrupting any jobs that are querying the data.
The obvious solution was to write parquet datasets to a separate directory and then update a symlink to point to it. Readers resolve the symlink to construct the DataFrame, so that when an update occurs any jobs continue to read the version of the data that they started with. Old data is cleaned up after no jobs are using it. This strategy works fine when updating an entire top-level parquet database. However, it seems like Spark SQL (or parquet) cannot handle partition directories being symlinks (and even if it could, it probably wouldn't resolve those symlinks so that it doesn't blow up when the symlink changes at runtime). For example, if you create symlinks a=1, a=2 and a=3 in a directory and then try to load that directory in Spark SQL, you get the "Conflicting partition column names detected". So my question is, can anyone think of another solution that meets my requirements (i.e. to take advantage of paritioning and perform safe updates of existing partitions)? Thanks! - Philip