[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758787#comment-17758787
 ] 

Lin Liu edited comment on HUDI-6712 at 8/25/23 1:31 AM:
--------------------------------------------------------

During our process of moving lake_plumber code into hudi, we found that the 
parquet version in lake plumbe is 1.13.1, but in hudi is 1.10.1 for spark2, and 
1.12.2 for spark3. Though we can not ignore for compiling for spark2, I have 
done a few checks:
 # ParquetRewriter can be compiled for parquet 1.12.2 version, and its 
benchmark has been run on a file in 1.10.1 without any issues (benchmark 
finished successfully.)
 # ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its benchmark 
fails to be compiled; after commenting out the failing part, the benchmark 
threw NullPointerException. After checking, the error is related to page index. 
Will dig deeper.  


was (Author: JIRAUSER301185):
During our process of moving lake_plumber code into hudi, we found that the 
parquet version in lake plumbe is 1.13.1, but in hudi is 1.10.1 for spark2, and 
1.12.2 for spark3. Though we can not ignore for compiling for spark2, I have 
done a few checks:
 # ParquetRewriter can be compiled for parquet 1.12.2 version, and its 
benchmark has been run on a file in 1.10.1 without any issues (benchmark 
finished successfully.)
 # ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its benchmark 
fails to be compiled; after removing the failing part, the benchmark threw 
NullPointerException. After checking, the error is related to page index. Will 
dig deeper.  

> Implement optimized keyed lookup on parquet files
> -------------------------------------------------
>
>                 Key: HUDI-6712
>                 URL: https://issues.apache.org/jira/browse/HUDI-6712
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Vinoth Chandar
>            Assignee: Lin Liu
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List<Records> lookup(parquetFile, Set<String> keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to