Thanks Gang, On Wed, 6 Dec 2023 at 05:15, Gang Wu <[email protected]> wrote:
> Hi Paul, > > I agree there are better ways to do this, e.g. we can prepare encoded > definition levels and repetition levels (if they exist) and directly write > the > page. However, we need to take care of other rewrite configurations > including data page version (v1 or v2), compression, page statistics and > page index. By writing null records, the writer handles all the above > details > internally. > > BTW, IMO writing `empty` pages may break the specs and fail the reader. > > Best, > Gang > > On Mon, Dec 4, 2023 at 5:30 PM Paul Rooney <[email protected]> wrote: > > > Could anyone suggest a faster way to Nullify columns in a parquet file? > > > > My dataset consists of a lot of parquet files. > > Each of them having roughly 12 million rows and 350 columns. Being split > in > > 2 Row groups of 10 million and 2 million rows. > > > > For each file I need to nullify 150 columns and rewrite the files. > > > > I tried using 'nullifyColumn' in > > > > > 'parquet-mr/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java' > > But I find it slow as for each column, it iterates on the number of rows > > and calls ColumnWriter.writeNull > > > > Would anyone have suggestions on how to avoid all the iteration? > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java#L743C13 > > ' for (int i = 0; i < totalChunkValues; i++) {...' > > > > Could a single call be made per column + row-group to write enough > > information to: > > A) keep the column present (in schema and as a Column chunk) > > B) set Column rowCount and num_nulls= totalChunkValues > > > > > > e.g. perhaps write a single 'empty' page which has: > > 1) valueCount and rowCount = totalChunkValues > > 2) Statistics.num_nulls set to totalChunkValues > > > > Thanks, Paul > > >
