Re: Fast nullify of columns?

Xinli shang Wed, 03 Jan 2024 08:07:45 -0800

HI Paul,

Sorry for the late reply! How many columns in total do you have for that
file? The rewriter generally works better if you only nullify a small
percentage of columns while the remaining columns are not changed. It can
copy & paste those unchanged columns as byte buffer instead of rewriting it
field by field.


The reason we still use ColumnWriter.writeNull is to let the rewriter keep
parity with the original writer. ColumnRewriter.writeNull goes through the
existing code path like generating statistics etc to avoid a lot of short
circuits to keep the writing safe.

Xinli



On Thu, Dec 7, 2023 at 6:06 AM Paul Rooney <[email protected]> wrote:

> Thanks Gang,
>
> On Wed, 6 Dec 2023 at 05:15, Gang Wu <[email protected]> wrote:
>
> > Hi Paul,
> >
> > I agree there are better ways to do this, e.g. we can prepare encoded
> > definition levels and repetition levels (if they exist) and directly
> write
> > the
> > page. However, we need to take care of other rewrite configurations
> > including data page version (v1 or v2), compression, page statistics and
> > page index. By writing null records, the writer handles all the above
> > details
> > internally.
> >
> > BTW, IMO writing `empty` pages may break the specs and fail the reader.
> >
> > Best,
> > Gang
> >
> > On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney <[email protected]> wrote:
> >
> > > Could anyone suggest a faster way to Nullify columns in a parquet file?
> > >
> > > My dataset consists of a lot of parquet files.
> > > Each of them having roughly 12 million rows and 350 columns. Being
> split
> > in
> > > 2 Row groups of 10 million and 2 million rows.
> > >
> > > For each file I need to nullify 150 columns and rewrite the files.
> > >
> > > I tried using 'nullifyColumn' in
> > >
> > >
> >
> 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java'
> > > But I find it slow as for each column, it iterates on the number of
> rows
> > > and calls ColumnWriter.writeNull
> > >
> > > Would anyone have suggestions on how to avoid all the iteration?
> > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13
> > > ' for (int i = 0; i < totalChunkValues; i++) {...'
> > >
> > > Could a single call be made per column + row-group to write enough
> > > information to:
> > > A) keep the column present (in schema and as a Column chunk)
> > > B) set Column rowCount and num_nulls= totalChunkValues
> > >
> > >
> > > e.g. perhaps write a single 'empty' page which has:
> > > 1) valueCount and rowCount = totalChunkValues
> > > 2) Statistics.num_nulls set to totalChunkValues
> > >
> > > Thanks, Paul
> > >
> >
>


-- 
Xinli Shang

Re: Fast nullify of columns?

Reply via email to