[ https://issues.apache.org/jira/browse/HUDI-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
xy updated HUDI-6950: --------------------- Fix Version/s: 0.14.1 > query should process listed partitions avoid driver oom due to large number > files in table > ------------------------------------------------------------------------------------------ > > Key: HUDI-6950 > URL: https://issues.apache.org/jira/browse/HUDI-6950 > Project: Apache Hudi > Issue Type: Bug > Components: spark-sql > Affects Versions: 0.14.0 > Reporter: xy > Priority: Critical > Fix For: 0.14.1 > > Attachments: before_fix_dump_filestatus.jpg > > > currently if multiple partition table,would cause oom easy > eg: > CREATE TABLE {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} ( > {{id}} string, > {{name}} string, > {{dt}} bigint, > {{day}} STRING COMMENT '日期分区', > {{hour}} INT COMMENT '小时分区' > )using hudi > OPTIONS ('hoodie.datasource.write.hive_style_partitioning' 'false', > 'hoodie.datasource.meta.sync.enable' 'false', > 'hoodie.datasource.hive_sync.enable' 'false') > tblproperties ( > 'primaryKey' = 'id', > 'type' = 'mor', > 'preCombineField'='dt', > 'hoodie.index.type' = 'BUCKET', > 'hoodie.bucket.index.hash.field' = 'id', > 'hoodie.bucket.index.num.buckets'=512 > ) > PARTITIONED BY ({{{}day{}}},{{{}hour{}}}); > select count(1) from {{{}hudi_test{}}}.{{{}tmp_hudi_test_1{}}} where > day='2023-10-17' would list much filestatus to driver,and driver would > oom(such as table with hundreds billion records in a > partition(day='2023-10-17')) -- This message was sent by Atlassian Jira (v8.20.10#820010)