Support for schema evolution of Parquet files?

Dave Birdsall Mon, 28 Jan 2019 11:57:44 -0800

Hi,

In Hive, it is possible to evolve one's schema using ALTER TABLE ADD COLUMNS 
and/or ALTER TABLE REPLACE COLUMNS. These commands change the metadata for the 
Hive table as a whole but do not rewrite existing files that are part of the 
table. So, for example, if I create a Parquet table, insert some rows into it, 
then do ALTER TABLE ADD COLUMN, then insert some more rows into it, the Parquet 
table will have (at least) two files. The added column will be absent from one 
file (at least) and present in another (at least).


Within the Hive shell, if one does a select * from t for such a table, the rows 
that lack the added column get the NULL value on retrieval.

I have tried doing the equivalent with the Arrow C++ reader, but it cores in 
FromParquetSchema (in module arrow_schema.cc), because a column index is off 
the end of the leaves_ vector of the ColumnDescriptor object.

So, my conclusion is that Arrow at this time does not support such schema 
evolution.

Is this a correct conclusion?

I was thinking about how one might add such support. I'm extremely new to the 
Arrow code base; I've had just a few hours of exposure to it.

Some questions that come to my mind (apologies if these are naïve):


  1.  Does the Arrow reader assume all files in a Parquet table have the same 
metadata? (Put another way, does it read the metadata of the first file it 
encounters, and then reuse that when reading subsequent  files?)
  2.  Is there an appropriate layer in the reader where one could simply 
materialize null values if a column isn't in Arrow's copy of the metadata? 
(Just for fun, I'm experimenting with changing the FromParquetSchema function 
to simply ignore column indexes that are off the end of leaves_; I expect it 
will fail somewhere else. And I'll learn a bit more about the Arrow reader in 
the process.)

Thanks and kind regards,

Dave

Support for schema evolution of Parquet files?

Reply via email to