Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1838182121 SparkMetadataTableRecordIndex fileGroupSize = hoodieTable.getMetadataTable().getNumFileGroupsForPartition(MetadataPartitionType.RECORD_INDEX); Why not 512 fileGroupSize? In addition to adjusting the number of buckets in the upstream source table, is there any other way to tune it? https://github.com/apache/hudi/assets/15028279/dce9392f-1199-46f4-b126-d44a068beba1;> ![image](https://github.com/apache/hudi/assets/15028279/9e839ace-3e3a-485f-b950-64735cd2bd3f) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1838138200 @danny0405 With set hoodie.metadata.enable=true, now is RECORD_INDEX. But the follow stage is very very slow too. ![image](https://github.com/apache/hudi/assets/15028279/fa20c388-0dbb-4f31-80b7-6937de7de7f2) ![image](https://github.com/apache/hudi/assets/15028279/2e4255f2-7a2c-451f-b4c5-b3bd4b9d8bbb) ![image](https://github.com/apache/hudi/assets/15028279/e071415f-bcdd-4640-9c73-ee396494462d) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
danny0405 commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837959640 hoodie.metadata.table -> hoodie.metadata.enable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837949851 @danny0405 why is back to GLOBAL_SIMPLE? ![image](https://github.com/apache/hudi/assets/15028279/20107e0d-46eb-4e28-9a5a-0fc8750cbc34) 23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove commented on issue #10235: URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837946117 @danny0405 why is back to GLOBAL_SIMPLE? https://github.com/apache/hudi/assets/15028279/9cddf011-e25c-4c0f-9b40-c2d7fdd17cf9;> 23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not initialized so falling back to GLOBAL_SIMPLE for tagging records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]
zyclove opened a new issue, #10235: URL: https://github.com/apache/hudi/issues/10235 **Describe the problem you faced** The spark job is too slow in follow stage. Adjusting CPU, memory, and concurrency has no effect. Which stage can be optimized or skipped? ![image](https://github.com/apache/hudi/assets/15028279/e4122bc3-e02b-4f01-9010-737300b85bed) Is this normal? Why still use HoodieGlobalSimpleIndex? ![image](https://github.com/apache/hudi/assets/15028279/89cb305f-bc23-40a7-ac00-0adab5933b53) **To Reproduce** Steps to reproduce the behavior: 1. table config ``` CREATE TABLE if NOT EXISTS bi_dw_real.smart_datapoint_report_rw_clear_rt( id STRING COMMENT 'id', uuid STRING COMMENT 'log uuid', data_id STRING COMMENT '', dev_id STRING COMMENT '', gw_id STRING COMMENT '', product_id STRING COMMENT '', uid STRING COMMENT '', dp_code STRING COMMENT '', dp_id STRING COMMENT '', dp_mode STRING COMMENT ', dp_name STRING COMMENT '', dp_time STRING COMMENT '', dp_type STRING COMMENT '', dp_value STRING COMMENT '', gmt_modified BIGINT COMMENT 'ct 时间', dt STRING COMMENT '时间分区字段' ) using hudi PARTITIONED BY (dt,dp_mode) COMMENT '' location '${bi_db_dir}/bi_ods_real/ods_smart_datapoint_report_rw_clear_rt' tblproperties ( type = 'mor', primaryKey = 'id', preCombineField = 'gmt_modified', hoodie.combine.before.upsert='false', hoodie.metadata.record.index.enable='true', hoodie.datasource.write.operation='upsert', hoodie.metadata.table='true', hoodie.datasource.write.hive_style_partitioning='true', hoodie.metadata.record.index.min.filegroup.count ='512', hoodie.index.type='RECORD_INDEX', hoodie.compact.inline='false', hoodie.common.spillable.diskmap.type='ROCKS_DB', hoodie.datasource.write.partitionpath.field='dt,dp_mode', hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload' ) ; set hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt; set hoodie.storage.layout.type=DEFAULT; set hoodie.metadata.record.index.enable=true; set hoodie.metadata.table=true; set hoodie.populate.meta.fields=false; set hoodie.parquet.compression.codec=snappy; set hoodie.memory.merge.max.size=200485760; set hoodie.write.buffer.limit.bytes=419430400; set hoodie.index.type=RECORD_INDEX; ``` 3.insert into bi_dw_real.smart_datapoint_report_rw_clear_rt **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.14.0 * Spark version :3.2.1 * Hive version :3.1.3 * Hadoop version :3.2.2 * Storage (HDFS/S3/GCS..) :s3 * Running on Docker? (yes/no) :no -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org