Thanks Gang,

On Wed, 6 Dec 2023 at 05:15, Gang Wu <[email protected]> wrote:

> Hi Paul,
>
> I agree there are better ways to do this, e.g. we can prepare encoded
> definition levels and repetition levels (if they exist) and directly write
> the
> page. However, we need to take care of other rewrite configurations
> including data page version (v1 or v2), compression, page statistics and
> page index. By writing null records, the writer handles all the above
> details
> internally.
>
> BTW, IMO writing `empty` pages may break the specs and fail the reader.
>
> Best,
> Gang
>
> On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney <[email protected]> wrote:
>
> > Could anyone suggest a faster way to Nullify columns in a parquet file?
> >
> > My dataset consists of a lot of parquet files.
> > Each of them having roughly 12 million rows and 350 columns. Being split
> in
> > 2 Row groups of 10 million and 2 million rows.
> >
> > For each file I need to nullify 150 columns and rewrite the files.
> >
> > I tried using 'nullifyColumn' in
> >
> >
> 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java'
> > But I find it slow as for each column, it iterates on the number of rows
> > and calls ColumnWriter.writeNull
> >
> > Would anyone have suggestions on how to avoid all the iteration?
> >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13
> > ' for (int i = 0; i < totalChunkValues; i++) {...'
> >
> > Could a single call be made per column + row-group to write enough
> > information to:
> > A) keep the column present (in schema and as a Column chunk)
> > B) set Column rowCount and num_nulls= totalChunkValues
> >
> >
> > e.g. perhaps write a single 'empty' page which has:
> > 1) valueCount and rowCount = totalChunkValues
> > 2) Statistics.num_nulls set to totalChunkValues
> >
> > Thanks, Paul
> >
>

Reply via email to