Hi, In Hive, it is possible to evolve one's schema using ALTER TABLE ADD COLUMNS and/or ALTER TABLE REPLACE COLUMNS. These commands change the metadata for the Hive table as a whole but do not rewrite existing files that are part of the table. So, for example, if I create a Parquet table, insert some rows into it, then do ALTER TABLE ADD COLUMN, then insert some more rows into it, the Parquet table will have (at least) two files. The added column will be absent from one file (at least) and present in another (at least).
Within the Hive shell, if one does a select * from t for such a table, the rows that lack the added column get the NULL value on retrieval. I have tried doing the equivalent with the Arrow C++ reader, but it cores in FromParquetSchema (in module arrow_schema.cc), because a column index is off the end of the leaves_ vector of the ColumnDescriptor object. So, my conclusion is that Arrow at this time does not support such schema evolution. Is this a correct conclusion? I was thinking about how one might add such support. I'm extremely new to the Arrow code base; I've had just a few hours of exposure to it. Some questions that come to my mind (apologies if these are naïve): 1. Does the Arrow reader assume all files in a Parquet table have the same metadata? (Put another way, does it read the metadata of the first file it encounters, and then reuse that when reading subsequent files?) 2. Is there an appropriate layer in the reader where one could simply materialize null values if a column isn't in Arrow's copy of the metadata? (Just for fun, I'm experimenting with changing the FromParquetSchema function to simply ignore column indexes that are off the end of leaves_; I expect it will fail somewhere else. And I'll learn a bit more about the Arrow reader in the process.) Thanks and kind regards, Dave