[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-09-20 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767225#comment-17767225
 ] 

Lin Liu commented on HUDI-6712:
---

Have fixed the unit tests which are caused by the version conflicts of Parquet 
format. Re-pushed and waiting to see if the unit tests pass or not.

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-09-11 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763985#comment-17763985
 ] 

Vinoth Chandar commented on HUDI-6712:
--

I actually had a bunch of comments on tests. 

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-08-25 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17759150#comment-17759150
 ] 

Lin Liu commented on HUDI-6712:
---

Compared the metadata between two parquet files, one in 1.10.1 and 1.12.3, 
which says the format versions are both in 1.0. I assume the "format version" 
mean the file layout; so, now I am focusing on fixing the problems between 
1.13.1 and 1.12.2; after that migrate the code to Hudi.

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-08-24 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17758787#comment-17758787
 ] 

Lin Liu commented on HUDI-6712:
---

During our process of moving lake_plumber code into hudi, we found that the 
parquet version in lake plumbe is 1.13.1, but in hudi is 1.10.1 for spark2, and 
1.12.2 for spark3. Though we can not ignore for compiling for spark2, I have 
done a few checks:
 # ParquetRewriter can be compiled for parquet 1.12.2 version, and its 
benchmark has been run on a file in 1.10.1 without any issues (benchmark 
finished successfully.)
 # ParqueKeyedLookup can be compiled for parquet 1.12.2; however, its benchmark 
fails to be compiled; after removing the failing part, the benchmark threw 
NullPointerException. After checking, the error is related to page index. Will 
dig deeper.  

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-08-21 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757157#comment-17757157
 ] 

Lin Liu commented on HUDI-6712:
---

Based on these PRs, e.g., [https://github.com/onehouseinc/lake-plumber/pull/7,] 
[https://github.com/onehouseinc/lake-plumber/pull/5,] 
[https://github.com/onehouseinc/lake-plumber/pull/8,] 
[https://github.com/onehouseinc/lake-plumber/pull/9,] etc, will update the 
corresponding logic and do experiments.

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-6712) Implement optimized keyed lookup on parquet files

2023-08-21 Thread Lin Liu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757003#comment-17757003
 ] 

Lin Liu commented on HUDI-6712:
---

[~rmahindra] sent some PRs to review for the context. Will finish reading them 
and start write design doc today.

> Implement optimized keyed lookup on parquet files
> -
>
> Key: HUDI-6712
> URL: https://issues.apache.org/jira/browse/HUDI-6712
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Vinoth Chandar
>Assignee: Lin Liu
>Priority: Major
> Fix For: 1.0.0
>
>
> Parquet performs poorly when performing a lookup of specific records, based 
> on a single key lookup column. 
> e.g: select * from parquet where key in ("a","b", "c) (SQL)
> e.g: List lookup(parquetFile, Set keys) (code) 
> Let's implement a reader, that is optimized for this pattern, by scanning 
> least amount of data. 
> Requirements: 
> 1. Need to support multiple values for same key. 
> 2. Can assume the file is sorted by the key/lookup field. 
> 3. Should handle non-existence of keys.
> 4. Should leverage parquet metadata (bloom filters, column index, ... ) to 
> minimize read read. 
> 5. Must to the minimum about of RPC calls to cloud storage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)