Hi Paul, I agree there are better ways to do this, e.g. we can prepare encoded definition levels and repetition levels (if they exist) and directly write the page. However, we need to take care of other rewrite configurations including data page version (v1 or v2), compression, page statistics and page index. By writing null records, the writer handles all the above details internally.
BTW, IMO writing `empty` pages may break the specs and fail the reader. Best, Gang On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney <[email protected]> wrote: > Could anyone suggest a faster way to Nullify columns in a parquet file? > > My dataset consists of a lot of parquet files. > Each of them having roughly 12 million rows and 350 columns. Being split in > 2 Row groups of 10 million and 2 million rows. > > For each file I need to nullify 150 columns and rewrite the files. > > I tried using 'nullifyColumn' in > > 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java' > But I find it slow as for each column, it iterates on the number of rows > and calls ColumnWriter.writeNull > > Would anyone have suggestions on how to avoid all the iteration? > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13 > ' for (int i = 0; i < totalChunkValues; i++) {...' > > Could a single call be made per column + row-group to write enough > information to: > A) keep the column present (in schema and as a Column chunk) > B) set Column rowCount and num_nulls= totalChunkValues > > > e.g. perhaps write a single 'empty' page which has: > 1) valueCount and rowCount = totalChunkValues > 2) Statistics.num_nulls set to totalChunkValues > > Thanks, Paul >
