[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6712: Sprint: Sprint 2024-03-25, Sprint 2024-04-26, 2024/06/17-30, 2024/06/03-16 (was: Sprint 2024-03-25, Sprint 2024-04-26, 2024/06/03-16) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6712: Sprint: Sprint 2024-03-25, Sprint 2023-04-26, Sprint 2023-04-28 (was: Sprint 2024-03-25, Sprint 2023-04-26) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6712: Sprint: Sprint 2024-03-25, Sprint 2023-04-26 (was: Sprint 2024-03-25, Sprint 2023-04-26, Sprint 2023-04-27) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-6712: Sprint: Sprint 2024-03-25, Sprint 2023-04-26, Sprint 2023-04-27 (was: Sprint 2024-03-25, Sprint 2023-04-26) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Status: Open (was: Patch Available) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Sprint: Sprint 2024-03-25, Sprint 2023-04-26 (was: Sprint 2024-03-25) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Sprint: Sprint 2024-03-25 > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Epic Link: HUDI-6242 > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Status: Patch Available (was: In Progress) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinoth Chandar updated HUDI-6712: - Reviewers: Vinoth Chandar > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-6712: - Labels: pull-request-available (was: ) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-6712) Implement optimized keyed lookup on parquet files
[ https://issues.apache.org/jira/browse/HUDI-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Liu updated HUDI-6712: -- Status: In Progress (was: Open) > Implement optimized keyed lookup on parquet files > - > > Key: HUDI-6712 > URL: https://issues.apache.org/jira/browse/HUDI-6712 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinoth Chandar >Assignee: Lin Liu >Priority: Major > Fix For: 1.0.0 > > > Parquet performs poorly when performing a lookup of specific records, based > on a single key lookup column. > e.g: select * from parquet where key in ("a","b", "c) (SQL) > e.g: List lookup(parquetFile, Set keys) (code) > Let's implement a reader, that is optimized for this pattern, by scanning > least amount of data. > Requirements: > 1. Need to support multiple values for same key. > 2. Can assume the file is sorted by the key/lookup field. > 3. Should handle non-existence of keys. > 4. Should leverage parquet metadata (bloom filters, column index, ... ) to > minimize read read. > 5. Must to the minimum about of RPC calls to cloud storage. -- This message was sent by Atlassian Jira (v8.20.10#820010)