Hello, I think I may have jumped to the wrong conclusion about symlinks,
and I was able to get what I want working perfectly.
I added these two settings in my importer application:
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
"false")
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Then when I read the parquet table, I set the "basePath" option to the
parent of each of the partitions, e.g.:
val df = sqlContext.read.options(Map("basePath" ->
"/path/to/table")).parquet("/path/to/table/a=*")
I also checked that the symlinks were followed the way I wanted, by
removing one of the symlinks after creating the DataFrame, and I was able
to query the DataFrame without error.
- Philip
On Fri, Apr 29, 2016 at 9:56 AM, Philip Weaver
wrote:
> Hello,
>
> I have a parquet dataset, partitioned by a column 'a'. I want to take
> advantage
> of Spark SQL's ability to filter to the partition when you filter on 'a'.
> I also
> want to periodically update individual partitions without disrupting any
> jobs
> that are querying the data.
>
> The obvious solution was to write parquet datasets to a separate directory
> and
> then update a symlink to point to it. Readers resolve the symlink to
> construct
> the DataFrame, so that when an update occurs any jobs continue to read the
> version of the data that they started with. Old data is cleaned up after
> no jobs
> are using it.
>
> This strategy works fine when updating an entire top-level parquet
> database. However, it seems like Spark SQL (or parquet) cannot handle
> partition
> directories being symlinks (and even if it could, it probably wouldn't
> resolve
> those symlinks so that it doesn't blow up when the symlink changes at
> runtime). For example, if you create symlinks a=1, a=2 and a=3 in a
> directory
> and then try to load that directory in Spark SQL, you get the "Conflicting
> partition column names detected".
>
> So my question is, can anyone think of another solution that meets my
> requirements (i.e. to take advantage of paritioning and perform safe
> updates of
> existing partitions)?
>
> Thanks!
>
> - Philip
>
>
>