Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-04 Thread via GitHub


zyclove commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1838182121

   SparkMetadataTableRecordIndex
 
fileGroupSize = 
hoodieTable.getMetadataTable().getNumFileGroupsForPartition(MetadataPartitionType.RECORD_INDEX);
   Why not 512  fileGroupSize? 
   In addition to adjusting the number of buckets in the upstream source table, 
is there any other way to tune it?

   https://github.com/apache/hudi/assets/15028279/dce9392f-1199-46f4-b126-d44a068beba1;>
   
![image](https://github.com/apache/hudi/assets/15028279/9e839ace-3e3a-485f-b950-64735cd2bd3f)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-04 Thread via GitHub


zyclove commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1838138200

   @danny0405 With  set hoodie.metadata.enable=true, now is RECORD_INDEX.
   But the follow stage is very very slow too.
   
   
![image](https://github.com/apache/hudi/assets/15028279/fa20c388-0dbb-4f31-80b7-6937de7de7f2)
   
   
![image](https://github.com/apache/hudi/assets/15028279/2e4255f2-7a2c-451f-b4c5-b3bd4b9d8bbb)
   
![image](https://github.com/apache/hudi/assets/15028279/e071415f-bcdd-4640-9c73-ee396494462d)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub


danny0405 commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837959640

   hoodie.metadata.table -> hoodie.metadata.enable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub


zyclove commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837949851

   @danny0405  why is back to GLOBAL_SIMPLE?
   
![image](https://github.com/apache/hudi/assets/15028279/20107e0d-46eb-4e28-9a5a-0fc8750cbc34)
   
   23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not 
initialized so falling back to GLOBAL_SIMPLE for tagging records
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub


zyclove commented on issue #10235:
URL: https://github.com/apache/hudi/issues/10235#issuecomment-1837946117

   @danny0405 
   why is back to GLOBAL_SIMPLE?
   https://github.com/apache/hudi/assets/15028279/9cddf011-e25c-4c0f-9b40-c2d7fdd17cf9;>
   
   23/12/04 14:39:29 WARN SparkMetadataTableRecordIndex: Record index not 
initialized so falling back to GLOBAL_SIMPLE for tagging records
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[I] [SUPPORT] hudi RECORD_INDEX is too slow in "Building workload profile" stage . why is HoodieGlobalSimpleIndex ? [hudi]

2023-12-03 Thread via GitHub


zyclove opened a new issue, #10235:
URL: https://github.com/apache/hudi/issues/10235

   
   **Describe the problem you faced**
   
   The spark job is too slow in follow stage.  Adjusting CPU, memory, and 
concurrency has no effect.
   Which stage can be optimized or skipped?
   
   
![image](https://github.com/apache/hudi/assets/15028279/e4122bc3-e02b-4f01-9010-737300b85bed)
   
   Is this normal? Why still use HoodieGlobalSimpleIndex?
   
![image](https://github.com/apache/hudi/assets/15028279/89cb305f-bc23-40a7-ac00-0adab5933b53)
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. table config
   ```
   CREATE  TABLE if NOT EXISTS bi_dw_real.smart_datapoint_report_rw_clear_rt(
 id STRING COMMENT 'id',
 uuid STRING COMMENT 'log uuid',
 data_id STRING COMMENT '',
 dev_id STRING COMMENT '',
 gw_id STRING COMMENT '',
 product_id STRING COMMENT '',
 uid STRING COMMENT '',
 dp_code STRING COMMENT '',
 dp_id STRING COMMENT '',
 dp_mode STRING COMMENT ',
 dp_name STRING COMMENT '',
 dp_time STRING COMMENT '',
 dp_type STRING COMMENT '',
 dp_value STRING COMMENT '',
 gmt_modified BIGINT COMMENT 'ct 时间',
 dt STRING COMMENT '时间分区字段'
   )
   using hudi 
   PARTITIONED BY (dt,dp_mode)
   COMMENT ''
   location '${bi_db_dir}/bi_ods_real/ods_smart_datapoint_report_rw_clear_rt'
   tblproperties (
 type = 'mor',
 primaryKey = 'id',
 preCombineField = 'gmt_modified',
 hoodie.combine.before.upsert='false',
 hoodie.metadata.record.index.enable='true',
 hoodie.datasource.write.operation='upsert',
 hoodie.metadata.table='true',
 hoodie.datasource.write.hive_style_partitioning='true',
 hoodie.metadata.record.index.min.filegroup.count ='512',
 hoodie.index.type='RECORD_INDEX',
 hoodie.compact.inline='false',
 hoodie.common.spillable.diskmap.type='ROCKS_DB',
 hoodie.datasource.write.partitionpath.field='dt,dp_mode',
 
hoodie.compaction.payload.class='org.apache.hudi.common.model.PartialUpdateAvroPayload'
)
   ;
   
   set 
hoodie.write.lock.zookeeper.lock_key=bi_ods_real.smart_datapoint_report_rw_clear_rt;
   set hoodie.storage.layout.type=DEFAULT;
   set hoodie.metadata.record.index.enable=true;
   set hoodie.metadata.table=true;
   set hoodie.populate.meta.fields=false;
   set hoodie.parquet.compression.codec=snappy;
   set hoodie.memory.merge.max.size=200485760;
   set hoodie.write.buffer.limit.bytes=419430400;
   set hoodie.index.type=RECORD_INDEX;
   ``` 
   3.insert into bi_dw_real.smart_datapoint_report_rw_clear_rt
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version :0.14.0
   
   * Spark version :3.2.1
   
   * Hive version :3.1.3
   
   * Hadoop version :3.2.2
   
   * Storage (HDFS/S3/GCS..) :s3
   
   * Running on Docker? (yes/no) :no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org