[ https://issues.apache.org/jira/browse/HUDI-512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-512: -------------------------------- Sprint: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7, Hudi-Sprint-Feb-14 (was: Hudi-Sprint-Jan-24, Hudi-Sprint-Jan-31, Hudi-Sprint-Feb-7) > Support for Index functions on columns to generate logical or micro > partitioning > -------------------------------------------------------------------------------- > > Key: HUDI-512 > URL: https://issues.apache.org/jira/browse/HUDI-512 > Project: Apache Hudi > Issue Type: Task > Components: Common Core > Affects Versions: 0.9.0 > Reporter: Alexander Filipchik > Assignee: Sagar Sumit > Priority: Blocker > Labels: features > Fix For: 0.11.0 > > > This one is more inspirational, but, I believe, will be very useful. > Currently hudi is following Hive table format, which means that data is > logically and physically partitioned into folder structure like: > table_name > 2019 > 01 > 02 > bla.parquet > > This has several issues: > 1) Modern object stores (AWS S3, GCP) are more performant when each file > name starts with some kind of a random value. By definition Hive layout is > not perfect > 2) Hive Metastore stores partitions in the text field in the single table (2 > tables with very similar information) and doesn't support proper filtering. > Data partitioned by day will be stored like: > 2019/01/10 > 2019/01/11 > so only regexp queries are suported (at least in Hive 2.X.X) > 3) Having a single POF which relies on non distributed DB is dangerous and > creates bottlenecks. > > The idea is to get rid of logical partitioning all together (and hive > metastore as well). If dataset has a time columns, user should be able to > query it without understanding what is the physical layout of the table (by > specifying those partitions explicitly or ending up with a full table scan > accidentally). > It will require some kind of mapping of time to file locations (similar to > Iceberg). I'm also leaning towards the idea that storing table metadata with > the table is a good thing as it can be read by the engine in one shot and > will be faster that taxing a standalone metastore. -- This message was sent by Atlassian Jira (v8.20.1#820001)