[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

Lin Liu (Jira) Mon, 21 Aug 2023 16:57:04 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17757157#comment-17757157
 ]


Lin Liu commented on HUDI-6712:
-------------------------------

Based on these PRs, e.g., [https://github.com/onehouseinc/lake-plumber/pull/7,] 
[https://github.com/onehouseinc/lake-plumber/pull/5,] 
[https://github.com/onehouseinc/lake-plumber/pull/8,] 
[https://github.com/onehouseinc/lake-plumber/pull/9,] etc, will update the 
corresponding logic and do experiments.

> Implement optimized keyed lookup on parquet files
> -------------------------------------------------
>
>                 Key: HUDI-6712
>                 URL: https://issues.apache.org/jira/browse/HUDI-6712
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Vinoth Chandar
>            Assignee: Lin Liu
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

Reply via email to