[jira] [Updated] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

ASF GitHub Bot (Jira) Tue, 08 Mar 2022 22:49:05 -0800


     [ 
https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated HUDI-512:
--------------------------------
    Labels: features pull-request-available  (was: features)

> Support for Index functions on columns to generate logical or micro 
> partitioning
> --------------------------------------------------------------------------------
>
>                 Key: HUDI-512
>                 URL: https://issues.apache.org/jira/browse/HUDI-512
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>    Affects Versions: 0.9.0
>            Reporter: Alexander Filipchik
>            Assignee: Alexey Kudinkin
>            Priority: Blocker
>              Labels: features, pull-request-available
>             Fix For: 0.11.0
>
>
> h3. *--- Updated Description ---*
> As part of this effort we're planning to (at the very least) support a suite 
> of standard Spark functions when evaluating Data Filtering expressions w/in 
> Data Skipping flow, for ex: when user is issuing a following query 
>  
> {code:java}
> SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022'
> {code}
> We're able to relate such query to our Column Stats Index appropriately, 
> therefore being able to do Data Skipping not only on the "raw" columns, but 
> also upon simple derivative expressions on top of them (like standard 
> function calls){*}{*}
>  
> h3. *--- Original Description ---*
> This one is more inspirational, but, I believe, will be very useful. 
> Currently hudi is following Hive table format, which means that data is 
> logically and physically partitioned into folder structure like:
> table_name
>   2019
>     01
>     02
>        bla.parquet
>  
> This has several issues:
>  1) Modern object stores (AWS S3, GCP) are more performant when each file 
> name starts with some kind of a random value. By definition Hive layout is 
> not perfect
> 2) Hive Metastore stores partitions in the text field in the single table (2 
> tables with very similar information) and doesn't support proper filtering. 
> Data partitioned by day will be stored like:
> 2019/01/10
> 2019/01/11
> so only regexp queries are suported (at least in Hive 2.X.X)
> 3) Having a single POF which relies on non distributed DB is dangerous and 
> creates bottlenecks. 
>  
> The idea is to get rid of logical partitioning all together (and hive 
> metastore as well). If dataset has a time columns, user should be able to 
> query it without understanding what is the physical layout of the table (by 
> specifying those partitions explicitly or ending up with a full table scan 
> accidentally).
> It will require some kind of mapping of time to file locations (similar to 
> Iceberg). I'm also leaning towards the idea that storing table metadata with 
> the table is a good thing as it can be read by the engine in one shot and 
> will be faster that taxing a standalone metastore. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HUDI-512) Support for Index functions on columns to generate logical or micro partitioning

Reply via email to