Re: Parquet schema migrations

2014-10-24 Thread Gary Malouf
Hi Michael,

Does this affect people who use Hive for their metadata store as well?  I'm
wondering if the issue is as bad as I think it is - namely that if you
build up a year's worth of data, adding a field forces you to have to
migrate that entire year's data.

Gary

On Wed, Oct 8, 2014 at 5:08 PM, Cody Koeninger c...@koeninger.org wrote:

 On Wed, Oct 8, 2014 at 3:19 PM, Michael Armbrust mich...@databricks.com
 wrote:

 
  I was proposing you manually convert each different format into one
  unified format  (by adding literal nulls and such for missing columns)
 and
  then union these converted datasets.  It would be weird to have union all
  try and do this automatically.
 


 Sure, I was just musing on what an api for doing the merging without manual
 user input should look like / do.   I'll comment on the ticket, thanks for
 making it



Re: Parquet schema migrations

2014-10-05 Thread Andrew Ash
Hi Cody,

I wasn't aware there were different versions of the parquet format.  What's
the difference between raw parquet and the Hive-written parquet files?

As for your migration question, the approaches I've often seen are
convert-on-read and convert-all-at-once.  Apache Cassandra for example does
both -- when upgrading between Cassandra versions that change the on-disk
sstable format, it will do a convert-on-read as you access the sstables, or
you can run the upgradesstables command to convert them all at once
post-upgrade.

Andrew

On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger c...@koeninger.org wrote:

 Wondering if anyone has thoughts on a path forward for parquet schema
 migrations, especially for people (like us) that are using raw parquet
 files rather than Hive.

 So far we've gotten away with reading old files, converting, and writing to
 new directories, but that obviously becomes problematic above a certain
 data size.



Re: Parquet schema migrations

2014-10-05 Thread Michael Armbrust
Hi Cody,

Assuming you are talking about 'safe' changes to the schema (i.e. existing
column names are never reused with incompatible types), this is something
I'd love to support.  Perhaps you can describe more what sorts of changes
you are making, and if simple merging of the schemas would be sufficient.
If so, we can open a JIRA, though I'm not sure when we'll have resources to
dedicate to this.

In the near term, I'd suggest writing converters for each version of the
schema, that translate to some desired master schema.  You can then union
all of these together and avoid the cost of batch conversion.  It seems
like in most cases this should be pretty efficient, at least now that we
have good pushdown past union operators :)

Michael

On Sun, Oct 5, 2014 at 3:58 PM, Andrew Ash and...@andrewash.com wrote:

 Hi Cody,

 I wasn't aware there were different versions of the parquet format.  What's
 the difference between raw parquet and the Hive-written parquet files?

 As for your migration question, the approaches I've often seen are
 convert-on-read and convert-all-at-once.  Apache Cassandra for example does
 both -- when upgrading between Cassandra versions that change the on-disk
 sstable format, it will do a convert-on-read as you access the sstables, or
 you can run the upgradesstables command to convert them all at once
 post-upgrade.

 Andrew

 On Fri, Oct 3, 2014 at 4:33 PM, Cody Koeninger c...@koeninger.org wrote:

  Wondering if anyone has thoughts on a path forward for parquet schema
  migrations, especially for people (like us) that are using raw parquet
  files rather than Hive.
 
  So far we've gotten away with reading old files, converting, and writing
 to
  new directories, but that obviously becomes problematic above a certain
  data size.