Sanely updating parquet partitions.

Philip Weaver Fri, 29 Apr 2016 09:57:02 -0700

Hello,

I have a parquet dataset, partitioned by a column 'a'. I want to take
advantage
of Spark SQL's ability to filter to the partition when you filter on 'a'. I
also
want to periodically update individual partitions without disrupting any
jobs
that are querying the data.


The obvious solution was to write parquet datasets to a separate directory
and
then update a symlink to point to it. Readers resolve the symlink to
construct
the DataFrame, so that when an update occurs any jobs continue to read the
version of the data that they started with. Old data is cleaned up after no
jobs
are using it.

This strategy works fine when updating an entire top-level parquet
database. However, it seems like Spark SQL (or parquet) cannot handle
partition
directories being symlinks (and even if it could, it probably wouldn't
resolve
those symlinks so that it doesn't blow up when the symlink changes at
runtime). For example, if you create symlinks a=1, a=2 and a=3 in a
directory
and then try to load that directory in Spark SQL, you get the "Conflicting
partition column names detected".

So my question is, can anyone think of another solution that meets my
requirements (i.e. to take advantage of paritioning and perform safe
updates of
existing partitions)?

Thanks!

- Philip

Sanely updating parquet partitions.

Reply via email to