[ https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-512: -------------------------------- Labels: features pull-request-available (was: features) > Support for Index functions on columns to generate logical or micro > partitioning > -------------------------------------------------------------------------------- > > Key: HUDI-512 > URL: https://issues.apache.org/jira/browse/HUDI-512 > Project: Apache Hudi > Issue Type: Task > Components: Common Core > Affects Versions: 0.9.0 > Reporter: Alexander Filipchik > Assignee: Alexey Kudinkin > Priority: Blocker > Labels: features, pull-request-available > Fix For: 0.11.0 > > > h3. *--- Updated Description ---* > As part of this effort we're planning to (at the very least) support a suite > of standard Spark functions when evaluating Data Filtering expressions w/in > Data Skipping flow, for ex: when user is issuing a following query > > {code:java} > SELECT ... WHERE date_format(ts, 'dd-mm-yyyy') > '01-01-2022' > {code} > We're able to relate such query to our Column Stats Index appropriately, > therefore being able to do Data Skipping not only on the "raw" columns, but > also upon simple derivative expressions on top of them (like standard > function calls){*}{*} > > h3. *--- Original Description ---* > This one is more inspirational, but, I believe, will be very useful. > Currently hudi is following Hive table format, which means that data is > logically and physically partitioned into folder structure like: > table_name > 2019 > 01 > 02 > bla.parquet > > This has several issues: > 1) Modern object stores (AWS S3, GCP) are more performant when each file > name starts with some kind of a random value. By definition Hive layout is > not perfect > 2) Hive Metastore stores partitions in the text field in the single table (2 > tables with very similar information) and doesn't support proper filtering. > Data partitioned by day will be stored like: > 2019/01/10 > 2019/01/11 > so only regexp queries are suported (at least in Hive 2.X.X) > 3) Having a single POF which relies on non distributed DB is dangerous and > creates bottlenecks. > > The idea is to get rid of logical partitioning all together (and hive > metastore as well). If dataset has a time columns, user should be able to > query it without understanding what is the physical layout of the table (by > specifying those partitions explicitly or ending up with a full table scan > accidentally). > It will require some kind of mapping of time to file locations (similar to > Iceberg). I'm also leaning towards the idea that storing table metadata with > the table is a good thing as it can be read by the engine in one shot and > will be faster that taxing a standalone metastore. -- This message was sent by Atlassian Jira (v8.20.1#820001)