Re: Sanely updating parquet partitions.

2016-04-29 Thread Philip Weaver
Hello, I think I may have jumped to the wrong conclusion about symlinks,
and I was able to get what I want working perfectly.

I added these two settings in my importer application:

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs",
"false")

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")


Then when I read the parquet table, I set the "basePath" option to the
parent of each of the partitions, e.g.:

val df = sqlContext.read.options(Map("basePath" ->
"/path/to/table")).parquet("/path/to/table/a=*")


I also checked that the symlinks were followed the way I wanted, by
removing one of the symlinks after creating the DataFrame, and I was able
to query the DataFrame without error.

- Philip


On Fri, Apr 29, 2016 at 9:56 AM, Philip Weaver 
wrote:

> Hello,
>
> I have a parquet dataset, partitioned by a column 'a'. I want to take
> advantage
> of Spark SQL's ability to filter to the partition when you filter on 'a'.
> I also
> want to periodically update individual partitions without disrupting any
> jobs
> that are querying the data.
>
> The obvious solution was to write parquet datasets to a separate directory
> and
> then update a symlink to point to it. Readers resolve the symlink to
> construct
> the DataFrame, so that when an update occurs any jobs continue to read the
> version of the data that they started with. Old data is cleaned up after
> no jobs
> are using it.
>
> This strategy works fine when updating an entire top-level parquet
> database. However, it seems like Spark SQL (or parquet) cannot handle
> partition
> directories being symlinks (and even if it could, it probably wouldn't
> resolve
> those symlinks so that it doesn't blow up when the symlink changes at
> runtime). For example, if you create symlinks a=1, a=2 and a=3 in a
> directory
> and then try to load that directory in Spark SQL, you get the "Conflicting
> partition column names detected".
>
> So my question is, can anyone think of another solution that meets my
> requirements (i.e. to take advantage of paritioning and perform safe
> updates of
> existing partitions)?
>
> Thanks!
>
> - Philip
>
>
>


Sanely updating parquet partitions.

2016-04-29 Thread Philip Weaver
Hello,

I have a parquet dataset, partitioned by a column 'a'. I want to take
advantage
of Spark SQL's ability to filter to the partition when you filter on 'a'. I
also
want to periodically update individual partitions without disrupting any
jobs
that are querying the data.

The obvious solution was to write parquet datasets to a separate directory
and
then update a symlink to point to it. Readers resolve the symlink to
construct
the DataFrame, so that when an update occurs any jobs continue to read the
version of the data that they started with. Old data is cleaned up after no
jobs
are using it.

This strategy works fine when updating an entire top-level parquet
database. However, it seems like Spark SQL (or parquet) cannot handle
partition
directories being symlinks (and even if it could, it probably wouldn't
resolve
those symlinks so that it doesn't blow up when the symlink changes at
runtime). For example, if you create symlinks a=1, a=2 and a=3 in a
directory
and then try to load that directory in Spark SQL, you get the "Conflicting
partition column names detected".

So my question is, can anyone think of another solution that meets my
requirements (i.e. to take advantage of paritioning and perform safe
updates of
existing partitions)?

Thanks!

- Philip