[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757157#comment-17757157 ]
Lin Liu commented on HUDI-6712: ------------------------------- Based on these PRs, e.g., [https://github.com/onehouseinc/lake-plumber/pull/7,] [https://github.com/onehouseinc/lake-plumber/pull/5,] [https://github.com/onehouseinc/lake-plumber/pull/8,] [https://github.com/onehouseinc/lake-plumber/pull/9,] etc, will update the corresponding logic and do experiments. > Implement optimized keyed lookup on parquet files > ------------------------------------------------- > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature > Reporter: Vinoth Chandar > Assignee: Lin Liu > Priority: Major > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)