Xianyang Liu created PARQUET-2366:
-------------------------------------

             Summary: Optimize random seek during rewriting
                 Key: PARQUET-2366
                 URL: https://issues.apache.org/jira/browse/PARQUET-2366
             Project: Parquet
          Issue Type: Bug
            Reporter: Xianyang Liu


The `ColunIndex`, `OffsetIndex`, and `BloomFilter` are stored at the end of the 
file. We need to randomly seek 4 times when rewriting a column chunk. We found 
this could impact the rewrite performance heavily for files with a number of 
columns(~1000). In this PR, we read the `ColumnIndex`, `OffsetIndex`, and 
`BloomFilter` into a cache to avoid the random seek. We got about 60 times 
performance improvement in production environments for the files with about one 
thousand columns.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to