Hi Paul,

I agree there are better ways to do this, e.g. we can prepare encoded
definition levels and repetition levels (if they exist) and directly write
the
page. However, we need to take care of other rewrite configurations
including data page version (v1 or v2), compression, page statistics and
page index. By writing null records, the writer handles all the above
details
internally.

BTW, IMO writing `empty` pages may break the specs and fail the reader.

Best,
Gang

On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney <[email protected]> wrote:

> Could anyone suggest a faster way to Nullify columns in a parquet file?
>
> My dataset consists of a lot of parquet files.
> Each of them having roughly 12 million rows and 350 columns. Being split in
> 2 Row groups of 10 million and 2 million rows.
>
> For each file I need to nullify 150 columns and rewrite the files.
>
> I tried using 'nullifyColumn' in
>
> 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java'
> But I find it slow as for each column, it iterates on the number of rows
> and calls ColumnWriter.writeNull
>
> Would anyone have suggestions on how to avoid all the iteration?
>
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13
> ' for (int i = 0; i < totalChunkValues; i++) {...'
>
> Could a single call be made per column + row-group to write enough
> information to:
> A) keep the column present (in schema and as a Column chunk)
> B) set Column rowCount and num_nulls= totalChunkValues
>
>
> e.g. perhaps write a single 'empty' page which has:
> 1) valueCount and rowCount = totalChunkValues
> 2) Statistics.num_nulls set to totalChunkValues
>
> Thanks, Paul
>

Reply via email to