Fast nullify of columns?

Paul Rooney Mon, 04 Dec 2023 01:30:31 -0800

Could anyone suggest a faster way to Nullify columns in a parquet file?

My dataset consists of a lot of parquet files.
Each of them having roughly 12 million rows and 350 columns. Being split in
2 Row groups of 10 million and 2 million rows.


For each file I need to nullify 150 columns and rewrite the files.

I tried using 'nullifyColumn' in
'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java'
But I find it slow as for each column, it iterates on the number of rows
and calls ColumnWriter.writeNull

Would anyone have suggestions on how to avoid all the iteration?
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13
' for (int i = 0; i < totalChunkValues; i++) {...'

Could a single call be made per column + row-group to write enough
information to:
A) keep the column present (in schema and as a Column chunk)
B) set Column rowCount and num_nulls= totalChunkValues


e.g. perhaps write a single 'empty' page which has:
1) valueCount and rowCount = totalChunkValues
2) Statistics.num_nulls set to totalChunkValues

Thanks, Paul

Fast nullify of columns?

Reply via email to