Xianyang Liu created PARQUET-2366:
-------------------------------------
Summary: Optimize random seek during rewriting
Key: PARQUET-2366
URL: https://issues.apache.org/jira/browse/PARQUET-2366
Project: Parquet
Issue Type: Bug
Reporter: Xianyang Liu
The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of the
file. We need to randomly seek 4 times when rewriting a column chunk. We found
this could impact the rewrite performance heavily for files with a number of
columns(~1000). In this PR, we read the `ColumnIndex`, `OffsetIndex`, and
`BloomFilter` into a cache to avoid the random seek. We got about 60 times
performance improvement in production environments for the files with about one
thousand columns.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)