[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758787#comment-17758787 ]
Lin Liu edited comment on HUDI-6712 at 8/25/23 1:31 AM: -------------------------------------------------------- During our process of moving lake_plumber code into hudi, we found that the parquet version in lake plumbe is 1.13.1, but in hudi is 1.10.1 for spark2, and 1.12.2 for spark3. Though we can not ignore for compiling for spark2, I have done a few checks: # ParquetRewriter can be compiled for parquet 1.12.2 version, and its benchmark has been run on a file in 1.10.1 without any issues (benchmark finished successfully.) # ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its benchmark fails to be compiled; after commenting out the failing part, the benchmark threw NullPointerException. After checking, the error is related to page index. Will dig deeper. was (Author: JIRAUSER301185): During our process of moving lake_plumber code into hudi, we found that the parquet version in lake plumbe is 1.13.1, but in hudi is 1.10.1 for spark2, and 1.12.2 for spark3. Though we can not ignore for compiling for spark2, I have done a few checks: # ParquetRewriter can be compiled for parquet 1.12.2 version, and its benchmark has been run on a file in 1.10.1 without any issues (benchmark finished successfully.) # ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its benchmark fails to be compiled; after removing the failing part, the benchmark threw NullPointerException. After checking, the error is related to page index. Will dig deeper. > Implement optimized keyed lookup on parquet files > ------------------------------------------------- > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature > Reporter: Vinoth Chandar > Assignee: Lin Liu > Priority: Major > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)