Could anyone suggest a faster way to Nullify columns in a parquet file? My dataset consists of a lot of parquet files. Each of them having roughly 12 million rows and 350 columns. Being split in 2 Row groups of 10 million and 2 million rows.
For each file I need to nullify 150 columns and rewrite the files. I tried using 'nullifyColumn' in 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java' But I find it slow as for each column, it iterates on the number of rows and calls ColumnWriter.writeNull Would anyone have suggestions on how to avoid all the iteration? https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13 ' for (int i = 0; i < totalChunkValues; i++) {...' Could a single call be made per column + row-group to write enough information to: A) keep the column present (in schema and as a Column chunk) B) set Column rowCount and num_nulls= totalChunkValues e.g. perhaps write a single 'empty' page which has: 1) valueCount and rowCount = totalChunkValues 2) Statistics.num_nulls set to totalChunkValues Thanks, Paul
