Parquet Migrations

2014-10-31 Thread Gary Malouf
Outside of what is discussed here
https://issues.apache.org/jira/browse/SPARK-3851 as a future solution, is
there any path for being able to modify a Parquet schema once some data has
been written?  This seems like the kind of thing that should make people
pause when considering whether or not to use Parquet+Spark...


Re: Parquet Migrations

2014-10-31 Thread Michael Armbrust
You can't change parquet schema without reencoding the data as you need to
recalculate the footer index data.  You can manually do what SPARK-3851
https://issues.apache.org/jira/browse/SPARK-3851 is going to do today
however.

Consider two schemas:

Old Schema: (a: Int, b: String)
New Schema, where I've dropped and added a column: (a: Int, c: Long)

parquetFile(old).registerTempTable(old)
parquetFile(new).registerTempTable(new)

sql(
  SELECT a, b, CAST(null AS LONG) AS c  FROM old UNION ALL
  SELECT a, CAST(null AS STRING) AS b, c FROM new
).registerTempTable(unifiedData)

Because of filter/column pushdown past UNIONs this should executed as
desired even if you write more complicated queries on top of
unifiedData.  Its a little onerous but should work for now.  This can
also support things like column renaming which would be much harder to do
automatically.

On Fri, Oct 31, 2014 at 1:49 PM, Gary Malouf malouf.g...@gmail.com wrote:

 Outside of what is discussed here
 https://issues.apache.org/jira/browse/SPARK-3851 as a future solution,
 is
 there any path for being able to modify a Parquet schema once some data has
 been written?  This seems like the kind of thing that should make people
 pause when considering whether or not to use Parquet+Spark...