[GitHub] [hudi] hudi-bot removed a comment on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot removed a comment on pull request #4710: URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023944845 ## CI report: * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5573) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot commented on pull request #4710: URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023972079 ## CI report: * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5573) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023961928 ## CI report: * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) * 879e966586fe287e710fb2b9db7a2436fef03a92 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5575) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023951714 ## CI report: * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483610#comment-17483610 ] Harsha Teja Kanna commented on HUDI-3335: - Log 22/01/28 01:29:34 INFO Executor: Adding file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/hudi-utilities-bundle_2.12-0.10.1.jar to class loader 22/01/28 01:29:34 INFO Executor: Fetching spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar with timestamp 1643354959702 22/01/28 01:29:34 INFO Utils: Fetching spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar to /private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/fetchFileTemp5819832321479921719.tmp 22/01/28 01:29:34 INFO Executor: Adding file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/org.spark-project.spark_unused-1.0.0.jar to class loader 22/01/28 01:29:34 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 49956. 22/01/28 01:29:34 INFO NettyBlockTransferService: Server created on 192.168.86.5:49956 22/01/28 01:29:34 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 22/01/28 01:29:34 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:34 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.86.5:49956 with 2004.6 MiB RAM, BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:34 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:34 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.86.5, 49956, None) 22/01/28 01:29:35 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/harshakanna/spark-warehouse/'). 22/01/28 01:29:35 INFO SharedState: Warehouse path is 'file:/Users/harshakanna/spark-warehouse/'. 22/01/28 01:29:36 INFO DataSourceUtils: Getting table path.. 22/01/28 01:29:36 INFO TablePathUtils: Getting table path from path : s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO DefaultSource: Obtained hudi table path: s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/sessions/.hoodie/hoodie.properties 22/01/28 01:29:36 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/sessions 22/01/28 01:29:36 INFO DefaultSource: Is bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot 22/01/28 01:29:36 INFO DefaultSource: Loading Base File Only View with options :Map(hoodie.datasource.query.type -> snapshot, hoodie.metadata.enable -> true, path -> s3a://datalake-hudi/sessions/) 22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/sessions 22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/sessions/.hoodie/hoodie.properties 22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from s3a://datalake-hudi/sessions 22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient from s3a://datalake-hudi/sessions/.hoodie/metadata 22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from s3a://datalake-hudi/sessions/.hoodie/metadata/.hoodie/hoodie.properties 22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type MERGE_ON_READ(version=1, baseFileFormat=HFILE) from s3a://datalake-hudi/sessions/.hoodie/metadata 22/01/28 01:29:37 INFO HoodieTableMetadataUtil: Loading latest merged file slices for metadata table partition files 22/01/28 01:29:38 INFO HoodieActiveTimeline: Loaded instants upto : Option\{val=[20220126024720121__deltacommit__COMPLETED]} 22/01/28 01:29:38 INFO AbstractTableFileSystemView: Took 2 ms to read 0 instants, 0 replaced file groups 22/01/28 01:29:38 INFO ClusteringUtils: Found 0 files in pending clustering operations 22/01/28 01:29:38 INFO AbstractTableFileSystemView: Building file system view for partition (files) 22/01/28 01:29:38 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=9, NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=0 22/01/28 01:29:38 INFO CacheConfig: Allocating LruBlockCache size=1.42 GB, blockSize=64 KB 22/01/28 01:29:38 INFO CacheConfig: Created c
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:30 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions, I am not able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 389 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:27 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions, I am not able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 329 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f
[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive
hudi-bot commented on pull request #3391: URL: https://github.com/apache/hudi/pull/3391#issuecomment-1023954251 ## CI report: * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792) * 223c320447bc9adc8fccaabb9c590bed159b375d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5574) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive
hudi-bot removed a comment on pull request #3391: URL: https://github.com/apache/hudi/pull/3391#issuecomment-1023952675 ## CI report: * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792) * 223c320447bc9adc8fccaabb9c590bed159b375d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:25 AM: --- Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket https://issues.apache.org/jira/browse/HUDI-3066 2) 'metadata list-partitions, I am able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 329 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4
[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive
hudi-bot commented on pull request #3391: URL: https://github.com/apache/hudi/pull/3391#issuecomment-1023952675 ## CI report: * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792) * 223c320447bc9adc8fccaabb9c590bed159b375d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive
hudi-bot removed a comment on pull request #3391: URL: https://github.com/apache/hudi/pull/3391#issuecomment-1002400317 ## CI report: * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023951714 ## CI report: * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023949877 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023949877 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023930597 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603 ] Harsha Teja Kanna commented on HUDI-3335: - Hi, 1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long time. In fact we switch to only using base path from the suggestion from the other ticket 3066 2) 'metadata list-partitions, I am able to run it succesfully, will give the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig 3) No delete_partition operations were performed 4) hive_sync disabled intentionally, 5) 'metadata validate-files' is running for all partitions for a while now, total 329 of them, But I see these below errors for many partitions 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of log files scanned => 7 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - MaxMemoryInBytes allowed for compaction => 1073741824 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in MemoryBasedMap in ExternalSpillableMap => 3 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Total size in bytes of MemoryBasedMap in ExternalSpillableMap => 1800 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Number of entries in BitCaskDiskMap in ExternalSpillableMap => 0 1640606 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner - Size of file spilled to disk => 0 1640607 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Opened 7 metadata log files (dataset instant=20220126024720121, metadata instant=20220126024720121) in 3577 ms 1640806 [Spring Shell] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor [.gz] 1640808 [Spring Shell] INFO org.apache.hudi.metadata.HoodieBackedTableMetadata - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms 1640809 [Spring Shell] INFO org.apache.hudi.metadata.BaseTableMetadata - Listed file in partition from metadata: partition=date=2022/01/15, #files=0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS and metadata files count not matching for date=2022/01/15. FS files count 19, metadata base files count 0 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata 5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet 1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand - FS file not found in metadata dac6d5d0-ca15-48e2-9b13-f37b4d113e64-1_804-501-16163_20220122014925413.parquet 1641313 [Spring Shell] ERROR org.apache.hudi
[GitHub] [hudi] hudi-bot commented on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot commented on pull request #4710: URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023944845 ## CI report: * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5573) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot removed a comment on pull request #4710: URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023943191 ## CI report: * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test
[ https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3088: - Labels: pull-request-available (was: ) > Make Spark 3 the default profile for build and test > --- > > Key: HUDI-3088 > URL: https://issues.apache.org/jira/browse/HUDI-3088 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > By default, when people check out the code, they should have activated spark > 3 for the repo. Also all tests should be running against the latest supported > spark version. Correspondingly the default scala version becomes 2.12 and the > default parquet version 1.12. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot commented on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version
hudi-bot commented on pull request #4710: URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023943191 ## CI report: * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023930597 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023910477 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023910477 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN * 8f670f3466a15e536605b67edd5586c152d04035 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023904815 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023906918 ## CI report: * c13c56e14dad9fad992fdf4a50e24e45c1539817 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5570) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023872466 ## CI report: * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) * c13c56e14dad9fad992fdf4a50e24e45c1539817 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5570) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023903715 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023904815 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023903715 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot removed a comment on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023902614 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
hudi-bot commented on pull request #4709: URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023902614 ## CI report: * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation
[ https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3338: - Labels: pull-request-available (was: ) > Use custom relation instead of HadoopFsRelation > --- > > Key: HUDI-3338 > URL: https://issues.apache.org/jira/browse/HUDI-3338 > Project: Apache Hudi > Issue Type: Improvement > Components: spark, spark-sql >Reporter: Yann Byron >Priority: Major > Labels: pull-request-available > > For HUDI-3204, COW table and MOR table in read_optimized query mode should > return the '-MM-dd' format of origin `data_date`, not /MM/dd''. > And the reason for that is because Hudi use HadoopFsRelation for the snapshot > query mode of cow and the read_optimized query mode of mor. > Spark HadoopFsRelation will append the partition value of the real partition > path. However, different from the normal table, Hudi will persist the > partition value in the parquet file. So we just need read the partition value > from the parquet file, not leave it to spark. > So we should not use `HadoopFsRelation` any more, and implement Hudi own > `Relation` to deal with it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] YannByron opened a new pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation
YannByron opened a new pull request #4709: URL: https://github.com/apache/hudi/pull/4709 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation
[ https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yann Byron updated HUDI-3338: - Description: For HUDI-3204, COW table and MOR table in read_optimized query mode should return the '-MM-dd' format of origin `data_date`, not /MM/dd''. And the reason for that is because Hudi use HadoopFsRelation for the snapshot query mode of cow and the read_optimized query mode of mor. Spark HadoopFsRelation will append the partition value of the real partition path. However, different from the normal table, Hudi will persist the partition value in the parquet file. So we just need read the partition value from the parquet file, not leave it to spark. So we should not use `HadoopFsRelation` any more, and implement Hudi own `Relation` to deal with it. > Use custom relation instead of HadoopFsRelation > --- > > Key: HUDI-3338 > URL: https://issues.apache.org/jira/browse/HUDI-3338 > Project: Apache Hudi > Issue Type: Improvement > Components: spark, spark-sql >Reporter: Yann Byron >Priority: Major > > For HUDI-3204, COW table and MOR table in read_optimized query mode should > return the '-MM-dd' format of origin `data_date`, not /MM/dd''. > And the reason for that is because Hudi use HadoopFsRelation for the snapshot > query mode of cow and the read_optimized query mode of mor. > Spark HadoopFsRelation will append the partition value of the real partition > path. However, different from the normal table, Hudi will persist the > partition value in the parquet file. So we just need read the partition value > from the parquet file, not leave it to spark. > So we should not use `HadoopFsRelation` any more, and implement Hudi own > `Relation` to deal with it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3338) Use custom relation instead of HadoopFsRelation
Yann Byron created HUDI-3338: Summary: Use custom relation instead of HadoopFsRelation Key: HUDI-3338 URL: https://issues.apache.org/jira/browse/HUDI-3338 Project: Apache Hudi Issue Type: Improvement Components: spark, spark-sql Reporter: Yann Byron -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2458) Relax compaction in metadata being fenced based on inflight requests in data table
[ https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-2458: Description: Relax compaction in metadata being fenced based on inflight requests in data table. Compaction in metadata is triggered only if there are no inflight requests in data table. This might cause liveness problem since for very large deployments, we could either have compaction or clustering always in progress. So, we should try to see how we can relax this constraint. Proposal to remove this dependency: With recent addition of spurious deletes config, we can actually get away with this. As of now, we have 3 inter linked nuances. - Compaction in metadata may not kick in, if there are any inflight operations in data table. - Rollback when being applied to metadata table has a dependency on last compaction instant in metadata table. We might even throw exception if instant being rolledback is < latest metadata compaction instant time. - Archival in data table is fenced by latest compaction in metadata table. So, just incase data timeline has any dangling inflght operation (lets say someone tried clustering, and killed midway and did not ever attempt again), metadata compaction will never kick in at all for good. I need to check what does archival do for such inflight operations in data table though when it tries to archive near by commits. So, with spurious deletes support which we added recently, all these can be much simplified. Whenever we want to apply a rollback commit, we don't need to take different actions based on whether the commit being rolled back is already committed to metadata table or not. Just go ahead and apply the rollback. Merging of metadata payload records will take care of this. If the commit was already synced, final merged payload may not have spurious deletes. If the commit being rolledback was never committed to metadata, final merged payload may have some spurious deletes which we can ignore. With this, compaction in metadata does not need to have any dependency on inflight operations in data table. And we can loosen up the dependency of archival in data table on metadata table compaction as well. So, in summary, all the 3 dependencies quoted above will be moot if we go with this approach. Archival in data table does not have any dependency on metadata table compaction. Rollback when being applied to metadata table does not care about last metadata table compaction. Compaction in metadata table can proceed even if there are inflight operations in data table. Especially our logic to apply rollback metadata to metadata table will become a lot simpler and is easy to reason about. was: Relax compaction in metadata being fenced based on inflight requests in data table. Compaction is metadata is triggered only if there are no inflight requests in data table. This might cause liveness problem since for very large deployments, we could either have compaction or clustering always in progress. So, we should try to see how we can relax this constraint. Proposal to remove this dependency: With recent addition of spurious deletes config, we can actually get away with this. As of now, we have 3 inter linked nuances. - Compaction in metadata may not kick in, if there are any inflight operations in data table. - Rollback when being applied to metadata table has a dependency on last compaction instant in metadata table. We might even throw exception if instant being rolledback is < latest metadata compaction instant time. - Archival in data table is fenced by latest compaction in metadata table. So, just incase data timeline has any dangling inflght operation (lets say someone tried clustering, and killed midway and did not ever attempt again), metadata compaction will never kick in at all for good. I need to check what does archival do for such inflight operations in data table though when it tries to archive near by commits. So, with spurious deletes support which we added recently, all these can be much simplified. Whenever we want to apply a rollback commit, we don't need to take different actions based on whether the commit being rolled back is already committed to metadata table or not. Just go ahead and apply the rollback. Merging of metadata payload records will take care of this. If the commit was already synced, final merged payload may not have spurious deletes. If the commit being rolledback was never committed to metadata, final merged payload may have some spurious deletes which we can ignore. With this, compaction in metadata does not need to have any dependency on inflight operations in data table. And we can loosen up the dependency of archival in data table on metadata table compaction as well. So, in summary, all the 3 dependencies quoted above will be
[jira] [Updated] (HUDI-1370) Scoping work needed to support bootstrapped data table and RFC-15 together
[ https://issues.apache.org/jira/browse/HUDI-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-1370: Summary: Scoping work needed to support bootstrapped data table and RFC-15 together (was: Scoping work needed to support bootstrap and RFC-15 together) > Scoping work needed to support bootstrapped data table and RFC-15 together > -- > > Key: HUDI-1370 > URL: https://issues.apache.org/jira/browse/HUDI-1370 > Project: Apache Hudi > Issue Type: Task > Components: Common Core >Affects Versions: 0.9.0 >Reporter: Vinoth Chandar >Assignee: Ethan Guo >Priority: Blocker > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot removed a comment on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023851940 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562) * 4d38e462c4fc79432b3cef2691cb76229d054cab Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5568) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot commented on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023886573 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * 4d38e462c4fc79432b3cef2691cb76229d054cab Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5568) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch asf-site updated: [MINOR] Fix build of Hudi website (#4708)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new 5aa8f30 [MINOR] Fix build of Hudi website (#4708) 5aa8f30 is described below commit 5aa8f30f7ea27639c73fbff6612e317097920e09 Author: Y Ethan Guo AuthorDate: Thu Jan 27 20:46:49 2022 -0800 [MINOR] Fix build of Hudi website (#4708) --- website/package.json | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/website/package.json b/website/package.json index 526429a..6b483ac 100644 --- a/website/package.json +++ b/website/package.json @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "2.0.0-beta.14", +"@docusaurus/preset-classic": "2.0.0-beta.14", +"@docusaurus/theme-search-algolia": "2.0.0-beta.14", "@fontsource/comfortaa": "^4.5.0", "@mdx-js/react": "^1.6.21", "@svgr/webpack": "^5.5.0",
[GitHub] [hudi] yihua merged pull request #4708: [MINOR] Fix build of Hudi website
yihua merged pull request #4708: URL: https://github.com/apache/hudi/pull/4708 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4708: [MINOR] Fix build of Hudi website
yihua commented on a change in pull request #4708: URL: https://github.com/apache/hudi/pull/4708#discussion_r794194072 ## File path: website/package.json ## @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "^2.0.0-beta.14", Review comment: Good call. So I triaged that the issue is actually due to the latest version of docusaurus, `2.0.0-beta.15` released yesterday. Freezing it to `2.0.0-beta.14` solves the issue. @vingov do you know why docusaurus has `beta` in its versions? Are they still experimental? For now, sticking to one version saves us time from debugging such issues again in near future. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023871149 ## CI report: * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) * c13c56e14dad9fad992fdf4a50e24e45c1539817 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023872466 ## CI report: * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) * c13c56e14dad9fad992fdf4a50e24e45c1539817 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5570) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023871149 ## CI report: * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) * c13c56e14dad9fad992fdf4a50e24e45c1539817 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023868510 ## CI report: * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023868510 ## CI report: * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023858958 ## CI report: * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vingov commented on a change in pull request #4708: [MINOR] Fix build of Hudi website
vingov commented on a change in pull request #4708: URL: https://github.com/apache/hudi/pull/4708#discussion_r794187342 ## File path: website/package.json ## @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "^2.0.0-beta.14", Review comment: It's not a good practice to freeze the versions in the package.json, the versions will be frozen in package-lock.json in our local, but I see your point, if you want stability we can freeze the version but, we should also once in a while try to upgrade to the latest stable version which might have security and other critical bug fixes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4708: [MINOR] Fix build of Hudi website
yihua commented on a change in pull request #4708: URL: https://github.com/apache/hudi/pull/4708#discussion_r794187235 ## File path: website/package.json ## @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "^2.0.0-beta.14", +"@docusaurus/preset-classic": "2.0.0-beta.14", +"@docusaurus/theme-search-algolia": "^2.0.0-beta.14", Review comment: I followed the pattern of the original PR @vingov put up (some have fixed version and some have "up to" constraint). Let me test the latest and freeze the versions altogether. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #4681: [HUDI-2987] [WIP] diff to update all remove deprecated calls to HoodieRecordPayload
nsivabalan commented on a change in pull request #4681: URL: https://github.com/apache/hudi/pull/4681#discussion_r794145199 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java ## @@ -53,10 +55,12 @@ public static HoodieFileSliceReader getFileSliceReader( return new HoodieFileSliceReader(scanner.iterator()); } else { Iterable> iterable = () -> scanner.iterator(); + // todo : wire in event time field as well + HoodiePayloadConfig payloadConfig = HoodiePayloadConfig.newBuilder().withPayloadOrderingField(preCombineField).build(); Review comment: need to write in event time from callers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023857805 ## CI report: * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023858958 ## CI report: * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023857805 ## CI report: * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023834106 ## CI report: * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vingov commented on a change in pull request #4708: [MINOR] Fix build of Hudi website
vingov commented on a change in pull request #4708: URL: https://github.com/apache/hudi/pull/4708#discussion_r794181432 ## File path: website/package.json ## @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "^2.0.0-beta.14", Review comment: 2.0.0-beta.15 is the latest stable version, did you test the beta-15? If that works, can you please freeze it to beta-15? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot removed a comment on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023850656 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562) * 4d38e462c4fc79432b3cef2691cb76229d054cab UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot commented on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023851940 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562) * 4d38e462c4fc79432b3cef2691cb76229d054cab Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5568) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] manojpec commented on a change in pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
manojpec commented on a change in pull request #4352: URL: https://github.com/apache/hudi/pull/4352#discussion_r794176408 ## File path: hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileReader.java ## @@ -35,6 +37,14 @@ public Set filterRowKeys(Set candidateRowKeys); + default Map getRecordsByKeys(TreeSet sortedCandidateRowKeys) throws IOException { Review comment: Fixed. ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java ## @@ -101,4 +116,34 @@ public static HoodieRecord getTaggedRecord(HoodieRecord inputRecord, Option filterKeysFromFile(Path filePath, List candidateRecordKeys, +Configuration configuration) throws HoodieIndexException { +ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath)); +List foundRecordKeys = new ArrayList<>(); +try { + // Load all rowKeys from the file, to double-confirm + if (!candidateRecordKeys.isEmpty()) { +HoodieTimer timer = new HoodieTimer().startTimer(); +HoodieFileReader fileReader = HoodieFileReaderFactory.getFileReader(configuration, filePath); +Set fileRowKeys = fileReader.filterKeys(new TreeSet<>(candidateRecordKeys)); Review comment: fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot removed a comment on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023725018 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups
hudi-bot commented on pull request #4352: URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023850656 ## CI report: * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562) * 4d38e462c4fc79432b3cef2691cb76229d054cab UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column
[ https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HUDI-3316: - Status: In Progress (was: Open) > HoodieColumnRangeMetadata doesn't include all statistics for the column > --- > > Key: HUDI-3316 > URL: https://issues.apache.org/jira/browse/HUDI-3316 > Project: Apache Hudi > Issue Type: Task > Components: writer-core >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Blocker > Fix For: 0.11.0 > > > HoodieColumnChunkMetadata includes the following stats about a parquet column > * columnName; > * minValue > * maxValue > * numNulls > > Parquet's ColumnChunkMetaData do have more stats and we need to include them > all in our index > * num values > * total size > * total uncompressed size -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3260) Support column stat index for multiple columns
[ https://issues.apache.org/jira/browse/HUDI-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HUDI-3260: - Status: In Progress (was: Open) > Support column stat index for multiple columns > -- > > Key: HUDI-3260 > URL: https://issues.apache.org/jira/browse/HUDI-3260 > Project: Apache Hudi > Issue Type: Task > Components: writer-core >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: sev:normal > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3260) Support column stat index for multiple columns
[ https://issues.apache.org/jira/browse/HUDI-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HUDI-3260: - Sprint: Hudi-Sprint-Jan-24 > Support column stat index for multiple columns > -- > > Key: HUDI-3260 > URL: https://issues.apache.org/jira/browse/HUDI-3260 > Project: Apache Hudi > Issue Type: Task > Components: writer-core >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Blocker > Labels: sev:normal > Fix For: 0.11.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] nsivabalan commented on a change in pull request #4708: [MINOR] Fix build of Hudi website
nsivabalan commented on a change in pull request #4708: URL: https://github.com/apache/hudi/pull/4708#discussion_r794169735 ## File path: website/package.json ## @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "^2.0.0-beta.14", +"@docusaurus/preset-classic": "2.0.0-beta.14", +"@docusaurus/theme-search-algolia": "^2.0.0-beta.14", Review comment: this one also needs fix. remove "^" at the beginning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xushiyan commented on a change in pull request #4708: [MINOR] Fix build of Hudi website
xushiyan commented on a change in pull request #4708: URL: https://github.com/apache/hudi/pull/4708#discussion_r794168724 ## File path: website/package.json ## @@ -14,11 +14,11 @@ "write-heading-ids": "docusaurus write-heading-ids" }, "dependencies": { -"@docusaurus/core": "^2.0.0-beta.3", -"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3", -"@docusaurus/plugin-sitemap": "^2.0.0-beta.3", -"@docusaurus/preset-classic": "^2.0.0-beta.3", -"@docusaurus/theme-search-algolia": "^2.0.0-beta.3", +"@docusaurus/core": "2.0.0-beta.14", +"@docusaurus/plugin-client-redirects": "2.0.0-beta.14", +"@docusaurus/plugin-sitemap": "^2.0.0-beta.14", Review comment: shall we freeze the versions at `2.0.0-beta.14` ? ```suggestion "@docusaurus/plugin-sitemap": "2.0.0-beta.14", ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #4708: [MINOR] Fix build of Hudi website
nsivabalan commented on pull request #4708: URL: https://github.com/apache/hudi/pull/4708#issuecomment-1023839234 CC @vingov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column
[ https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HUDI-3316: - Description: HoodieColumnChunkMetadata includes the following stats about a parquet column * columnName; * minValue * maxValue * numNulls Parquet's ColumnChunkMetaData do have more stats and we need to include them all in our index * num values * total size * total uncompressed size was: HoodieColumnChunkMetadata includes the following stats about a parquet column * columnName; * minValue * maxValue * numNulls Parquet's ColumnChunkMetaData do have more stats and we need to include them all in our index * distinct * num values > HoodieColumnRangeMetadata doesn't include all statistics for the column > --- > > Key: HUDI-3316 > URL: https://issues.apache.org/jira/browse/HUDI-3316 > Project: Apache Hudi > Issue Type: Task > Components: writer-core >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Blocker > Fix For: 0.11.0 > > > HoodieColumnChunkMetadata includes the following stats about a parquet column > * columnName; > * minValue > * maxValue > * numNulls > > Parquet's ColumnChunkMetaData do have more stats and we need to include them > all in our index > * num values > * total size > * total uncompressed size -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column
[ https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Govindassamy updated HUDI-3316: - Summary: HoodieColumnRangeMetadata doesn't include all statistics for the column (was: HoodieColumnRangeMetadata doesn't include all Parquet chunk statistics) > HoodieColumnRangeMetadata doesn't include all statistics for the column > --- > > Key: HUDI-3316 > URL: https://issues.apache.org/jira/browse/HUDI-3316 > Project: Apache Hudi > Issue Type: Task > Components: writer-core >Reporter: Manoj Govindassamy >Assignee: Manoj Govindassamy >Priority: Blocker > Fix For: 0.11.0 > > > HoodieColumnChunkMetadata includes the following stats about a parquet column > * columnName; > * minValue > * maxValue > * numNulls > > Parquet's ColumnChunkMetaData do have more stats and we need to include them > all in our index > * distinct > * num values -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023796380 ## CI report: * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563) * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023834106 ## CI report: * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua opened a new pull request #4708: [MINOR] Fix build of Hudi website
yihua opened a new pull request #4708: URL: https://github.com/apache/hudi/pull/4708 ## What is the purpose of the pull request The build of Hudi website is broken due to the following error from `npm run build`: ``` (asf-site)> npm run build > hudi@0.0.0 build > docusaurus build [INFO] Website will be built for all these locales: - en - cn [INFO] [en] Creating an optimized production build... ✔ Client ✖ Server Compiled with some errors in 2.64m [ERROR] Docusaurus Node/SSR could not render static page with path / because of following error: Error: Minified React error #130; visit https://reactjs.org/docs/error-decoder.html?invariant=130&args[]=object&args[]= for the full message or use the non-minified dev environment for full errors and additional helpful warnings. at a.b.render (main:115785:32) at a.b.read (main:115781:83) at Object.exports.renderToString (main:115792:138) at doRender (main:25801:356) at async serverEntry_render (main:25797:329) Error: Server-side rendering fails due to the error above. [ERROR] Unable to build website for locale en. [ERROR] Error: Failed to compile with errors. at /Users/ethan/Work/repo/hudi-docs-8/website/node_modules/@docusaurus/core/lib/webpack/utils.js:207:24 at /Users/ethan/Work/repo/hudi-docs-8/website/node_modules/webpack/lib/MultiCompiler.js:554:14 at processQueueWorker (/Users/ethan/Work/repo/hudi-docs-8/website/node_modules/webpack/lib/MultiCompiler.js:491:6) at processTicksAndRejections (node:internal/process/task_queues:78:11) ``` The root cause is that docusaurus versions specified in `website/package.json` are not honored. Looking at the `website/package-lock.json` generated, `2.0.0-beta.15` is actually used instead of `^2.0.0-beta.3` (up to 2.0.0-beta.3) specified. Another evidence of higher version already used is that `2.0.0-beta.14` shows up in generated content: ``` ./content/docs/next/clustering/index.html: ``` The build failure is likely due to recent new versions (`2.0.0-beta.15`, `2.0.0-beta.16`) of docusaurus and related dependencies. The fix is to bound the docusaurus version properly. Note that the build failure can only be reproduced from a fresh clone of the branch from remote, with `npm install` and `npm run build` under `website` folder. If there is previous successful build and package info is cached, such build failure may not show up. ## Brief change log - Updates `website/package.json` to bound the docusaurus version properly. ## Verify this pull request The change is verified by a fresh build of the website. The website can be successfully launched after `npm start`. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023796380 ## CI report: * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563) * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023793855 ## CI report: * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563) * f8cb5b06e3940fe5a931bf968f394bd6068b4731 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot commented on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023793855 ## CI report: * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563) * f8cb5b06e3940fe5a931bf968f394bd6068b4731 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests
hudi-bot removed a comment on pull request #4704: URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023728908 ## CI report: * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483505#comment-17483505 ] sivabalan narayanan commented on HUDI-3335: --- Can you also enable debug logs(just for hudi) and rerun your query and give us the logs. > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Blocker > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestam
[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour
nsivabalan commented on issue #3478: URL: https://github.com/apache/hudi/issues/3478#issuecomment-1023790632 awesome, thanks for updating! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-2596) Make class names consistent in hudi-client
[ https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-2596. Reviewers: Ethan Guo Resolution: Done > Make class names consistent in hudi-client > -- > > Key: HUDI-2596 > URL: https://issues.apache.org/jira/browse/HUDI-2596 > Project: Apache Hudi > Issue Type: Task > Components: Code Cleanup >Reporter: Ethan Guo >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Currently we have different naming convention for the abstract classes, such > as AbstractBulkInsertHelper, BaseCommitActionExecutor, > HoodieTableFileIndexBase, etc. Ideally, we should have the same naming > convention for such common abstraction/interface, "Abstract*", "Base*", or > "\*Base"{*}.{*} I prefer to use "Base\*". -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483261#comment-17483261 ] sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 1:08 AM: - CC [~manojpec] [~guoyihua] [~codope] metadata related bug was (Author: shivnarayan): CC [~manojpec] [~guoyihua] metadata related bug > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Blocker > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.1/hudi-
[jira] [Reopened] (HUDI-2596) Make class names consistent in hudi-client
[ https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reopened HUDI-2596: -- > Make class names consistent in hudi-client > -- > > Key: HUDI-2596 > URL: https://issues.apache.org/jira/browse/HUDI-2596 > Project: Apache Hudi > Issue Type: Task > Components: Code Cleanup >Reporter: Ethan Guo >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Currently we have different naming convention for the abstract classes, such > as AbstractBulkInsertHelper, BaseCommitActionExecutor, > HoodieTableFileIndexBase, etc. Ideally, we should have the same naming > convention for such common abstraction/interface, "Abstract*", "Base*", or > "\*Base"{*}.{*} I prefer to use "Base\*". -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test
[ https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3088: - Status: In Progress (was: Open) > Make Spark 3 the default profile for build and test > --- > > Key: HUDI-3088 > URL: https://issues.apache.org/jira/browse/HUDI-3088 > Project: Apache Hudi > Issue Type: Improvement > Components: spark >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Blocker > Fix For: 0.11.0 > > > By default, when people check out the code, they should have activated spark > 3 for the repo. Also all tests should be running against the latest supported > spark version. Correspondingly the default scala version becomes 2.12 and the > default parquet version 1.12. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HUDI-2596) Make class names consistent in hudi-client
[ https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu resolved HUDI-2596. -- > Make class names consistent in hudi-client > -- > > Key: HUDI-2596 > URL: https://issues.apache.org/jira/browse/HUDI-2596 > Project: Apache Hudi > Issue Type: Task > Components: Code Cleanup >Reporter: Ethan Guo >Assignee: Raymond Xu >Priority: Blocker > Labels: pull-request-available > Fix For: 0.11.0 > > > Currently we have different naming convention for the abstract classes, such > as AbstractBulkInsertHelper, BaseCommitActionExecutor, > HoodieTableFileIndexBase, etc. Ideally, we should have the same naming > convention for such common abstraction/interface, "Abstract*", "Base*", or > "\*Base"{*}.{*} I prefer to use "Base\*". -- This message was sent by Atlassian Jira (v8.20.1#820001)
[hudi] branch master updated (4a9f826 -> 0bd38f2)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 4a9f826 [HUDI-3215] Solve UT for Spark 3.2 (#4565) add 0bd38f2 [HUDI-2596] Make class names consistent in hudi-client (#4680) No new revisions were added by this update. Summary of changes: .../hudi/cli/commands/TestRollbacksCommand.java| 4 +-- .../apache/hudi/async/AsyncClusteringService.java | 14 .../org/apache/hudi/async/AsyncCompactService.java | 14 .../apache/hudi/client/AsyncCleanerService.java| 6 ++-- ...actClusteringClient.java => BaseClusterer.java} | 8 ++--- .../{AbstractCompactor.java => BaseCompactor.java} | 8 ++--- ...ractHoodieClient.java => BaseHoodieClient.java} | 8 ++--- ...WriteClient.java => BaseHoodieWriteClient.java} | 20 ++-- .../apache/hudi/client/CompactionAdminClient.java | 2 +- .../java/org/apache/hudi/keygen/KeyGenUtils.java | 8 ++--- .../keygen/TimestampBasedAvroKeyGenerator.java | 8 ++--- ...meParser.java => BaseHoodieDateTimeParser.java} | 4 +-- ...meParserImpl.java => HoodieDateTimeParser.java} | 4 +-- .../metadata/HoodieBackedTableMetadataWriter.java | 6 ++-- .../hudi/metrics/MetricsReporterFactory.java | 10 +++--- .../CustomizableMetricsReporter.java} | 14 .../AbstractUserDefinedMetricsReporter.java| 37 +- ...InsertHelper.java => BaseBulkInsertHelper.java} | 2 +- .../action/commit/BaseCommitActionExecutor.java| 2 +- ...ractDeleteHelper.java => BaseDeleteHelper.java} | 2 +- ...stractMergeHelper.java => BaseMergeHelper.java} | 2 +- ...stractWriteHelper.java => BaseWriteHelper.java} | 2 +- .../hudi/table/upgrade/DowngradeHandler.java | 4 +-- .../hudi/table/upgrade/OneToTwoUpgradeHandler.java | 2 +- .../table/upgrade/OneToZeroDowngradeHandler.java | 2 +- ...deHelper.java => SupportsUpgradeDowngrade.java} | 2 +- .../table/upgrade/ThreeToTwoDowngradeHandler.java | 2 +- .../table/upgrade/TwoToOneDowngradeHandler.java| 2 +- .../table/upgrade/TwoToThreeUpgradeHandler.java| 2 +- .../hudi/table/upgrade/UpgradeDowngrade.java | 4 +-- .../apache/hudi/table/upgrade/UpgradeHandler.java | 4 +-- .../table/upgrade/ZeroToOneUpgradeHandler.java | 2 +- .../hudi/metrics/TestMetricsReporterFactory.java | 8 ++--- .../providers/HoodieWriteClientProvider.java | 4 +-- .../apache/hudi/client/HoodieFlinkWriteClient.java | 2 +- .../FlinkHoodieBackedTableMetadataWriter.java | 2 +- .../table/action/commit/FlinkDeleteHelper.java | 2 +- .../hudi/table/action/commit/FlinkMergeHelper.java | 2 +- .../hudi/table/action/commit/FlinkWriteHelper.java | 2 +- .../table/upgrade/FlinkUpgradeDowngradeHelper.java | 2 +- .../apache/hudi/client/HoodieJavaWriteClient.java | 2 +- .../table/action/commit/JavaBulkInsertHelper.java | 4 +-- .../hudi/table/action/commit/JavaDeleteHelper.java | 2 +- .../hudi/table/action/commit/JavaMergeHelper.java | 2 +- .../hudi/table/action/commit/JavaWriteHelper.java | 2 +- .../hudi/async/SparkAsyncClusteringService.java| 8 ++--- .../hudi/async/SparkAsyncCompactService.java | 8 ++--- .../hudi/client/HoodieSparkClusteringClient.java | 4 +-- .../apache/hudi/client/HoodieSparkCompactor.java | 4 +-- .../apache/hudi/client/SparkRDDWriteClient.java| 2 +- .../table/action/commit/SparkBulkInsertHelper.java | 4 +-- .../table/action/commit/SparkDeleteHelper.java | 4 +-- .../hudi/table/action/commit/SparkMergeHelper.java | 2 +- .../hudi/table/action/commit/SparkWriteHelper.java | 4 +-- ...ava => BaseSparkDeltaCommitActionExecutor.java} | 8 ++--- .../SparkBulkInsertDeltaCommitActionExecutor.java | 10 +++--- ...BulkInsertPreppedDeltaCommitActionExecutor.java | 8 ++--- .../SparkDeleteDeltaCommitActionExecutor.java | 4 +-- .../SparkInsertDeltaCommitActionExecutor.java | 4 +-- ...parkInsertPreppedDeltaCommitActionExecutor.java | 3 +- .../SparkUpsertDeltaCommitActionExecutor.java | 4 +-- ...parkUpsertPreppedDeltaCommitActionExecutor.java | 3 +- .../table/upgrade/SparkUpgradeDowngradeHelper.java | 2 +- .../functional/TestHoodieBackedMetadata.java | 2 +- .../TestHoodieClientOnCopyOnWriteStorage.java | 4 +-- .../hudi/table/TestHoodieMergeOnReadTable.java | 4 +-- .../SparkStreamingAsyncClusteringService.java | 8 ++--- .../async/SparkStreamingAsyncCompactService.java | 8 ++--- 68 files changed, 178 insertions(+), 181 deletions(-) rename hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/{AbstractClusteringClient.java => BaseClusterer.java} (80%) rename hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/{AbstractCompactor.java => BaseCompactor.java} (78%) rename hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/{AbstractHoodieClient.java
[GitHub] [hudi] yihua merged pull request #4680: [HUDI-2596] Make class names consistent in hudi-client
yihua merged pull request #4680: URL: https://github.com/apache/hudi/pull/4680 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4680: [HUDI-2596] Make class names consistent in hudi-client
hudi-bot commented on pull request #4680: URL: https://github.com/apache/hudi/pull/4680#issuecomment-1023785526 ## CI report: * ae88c2fc58bf07a435feb971435646258e2b5e87 UNKNOWN * 5d9189c4f457e5877280f00d0dcd9ccdb476135f Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5565) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4680: [HUDI-2596] Make class names consistent in hudi-client
hudi-bot removed a comment on pull request #4680: URL: https://github.com/apache/hudi/pull/4680#issuecomment-1023749306 ## CI report: * ae88c2fc58bf07a435feb971435646258e2b5e87 UNKNOWN * 5593a7e380700e7f89c65b44b20dfa4d31a15ea9 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5481) * 5d9189c4f457e5877280f00d0dcd9ccdb476135f Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5565) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3335: Fix Version/s: 0.11.0 > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/sessions \ > --target-table se
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483503#comment-17483503 ] sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 1:00 AM: - [~h7kanna] : an orthogonal question. was hive sync disabled intentionally. >From your logs {code:java} hoodie.datasource.hive_sync.enable=false {code} was (Author: shivnarayan): [~h7kanna] : an orthogonal question. was hive sync disabled by default. >From your logs {code:java} hoodie.datasource.hive_sync.enable=false {code} > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Blocker > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStrea
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483503#comment-17483503 ] sivabalan narayanan commented on HUDI-3335: --- [~h7kanna] : an orthogonal question. was hive sync disabled by default. >From your logs {code:java} hoodie.datasource.hive_sync.enable=false {code} > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ > --table-type COPY_ON_WRITE \ > --sourc
[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ethan Guo updated HUDI-3335: Priority: Blocker (was: Critical) > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Blocker > Fix For: 0.11.0 > > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ > --table-type COPY_ON_WRITE \ > --source-ordering-field timestamp \ > --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \ > --target-base-path s3a://datalake-hudi/sessions \ > --tar
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483499#comment-17483499 ] sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 12:58 AM: -- [~h7kanna] : I assume your partitions are of the format "/mm/dd". And so partitions are 3 level. Can you check if giving an explicit glob path works? for eg: {code:java} val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/*/*/*/*") {code} I assume above one will be slower compared to not giving any explicit glob pattern, but wanted to rule things out. Also, can you try the below command in hudi-cli and let us know what you see. {code:java} connect --path basePath set conf SPARK_MASTER=local[2] metadata list-partitions {code} Also, can you run below command and let us know what you see {code:java} metadata validateFiles {code} was (Author: shivnarayan): [~h7kanna] : I assume your partitions are of the format "/mm/dd". And so partitions are 3 level. Can you check if giving an explicit glob path works? for eg: {code:java} val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/*/*/*/*") {code} I assume above one will be slower compared to not giving any explicit glob pattern, but wanted to rule things out. Also, can you try the below command in hudi-cli and let us know what you see. {code:java} connect --path basePath set conf SPARK_MASTER=local[2] metadata list-partitions {code} > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.dat
[GitHub] [hudi] vingov opened a new pull request #4707: Stop-gap solution to fix the broken blog link
vingov opened a new pull request #4707: URL: https://github.com/apache/hudi/pull/4707 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request Fix the broken link with a redirect, hence the regular redirect is not working, came up with this workaround since the link has been tweeted to reduce the impact. ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483501#comment-17483501 ] sivabalan narayanan commented on HUDI-3335: --- By any chance does any of your partitions have 0 files? I mean, added initially and then later triggered delete_partition operation may be. > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \ > s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \ > --table-type COPY_ON_WRITE \ > --source-ord
[GitHub] [hudi] manojpec commented on a change in pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
manojpec commented on a change in pull request #4705: URL: https://github.com/apache/hudi/pull/4705#discussion_r794119507 ## File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java ## @@ -31,15 +29,13 @@ private final T minValue; private final T maxValue; private final long numNulls; - private final PrimitiveStringifier stringifier; - public HoodieColumnRangeMetadata(final String filePath, final String columnName, final T minValue, final T maxValue, final long numNulls, final PrimitiveStringifier stringifier) { + public HoodieColumnRangeMetadata(final String filePath, final String columnName, final T minValue, final T maxValue, final long numNulls) { Review comment: I misread. We are good here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483499#comment-17483499 ] sivabalan narayanan commented on HUDI-3335: --- [~h7kanna] : I assume your partitions are of the format "/mm/dd". And so partitions are 3 level. Can you check if giving an explicit glob path works? for eg: {code:java} val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load(s"${basePath}/sessions/*/*/*/*") {code} I assume above one will be slower compared to not giving any explicit glob pattern, but wanted to rule things out. Also, can you try the below command in hudi-cli and let us know what you see. {code:java} connect --path basePath set conf SPARK_MASTER=local[2] metadata list-partitions {code} > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-s
[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException
[ https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483261#comment-17483261 ] sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 12:51 AM: -- CC [~manojpec] [~guoyihua] metadata related bug was (Author: shivnarayan): Can you furnish more info for us to triage. hoodie write configs used. hive sync configs used. contents of .hoodie and contents of .hoodie/metadata/.hoodie > Loading Hudi table fails with NullPointerException > -- > > Key: HUDI-3335 > URL: https://issues.apache.org/jira/browse/HUDI-3335 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.1 >Reporter: Harsha Teja Kanna >Priority: Critical > > Have a COW table with metadata enabled. Loading from Spark query fails with > java.lang.NullPointerException > *Environment* > Spark 3.1.2 > Hudi 0.10.1 > *Query* > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val basePath = "s3a://datalake-hudi/v1" > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Passing an individual partition works though* > val df = spark. > read. > format("org.apache.hudi"). > option(HoodieMetadataConfig.ENABLE.key(), "true"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/date=2022/01/25") > df.createOrReplaceTempView(table) > *Also, disabling metadata works, but the query taking very long time* > val df = spark. > read. > format("org.apache.hudi"). > option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). > load(s"${basePath}/sessions/") > df.createOrReplaceTempView(table) > *Loading files with stacktrace:* > at > org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191) > at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210) > at > org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804) > at > org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631) > at > org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629) > at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) > at > org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629) > at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387) > at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184) > at > org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119) > at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239) > at $anonfun$res3$1(:46) > at $anonfun$res3$1$adapted(:40) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > *Writer config* > ** > spark-submit \ > --master yarn \ > --deploy-mode cluster \ > --driver-cores 4 \ > --driver-memory 4g \ > --executor-cores 4 \ > --executor-memory 6g \ > --num-executors 8 \ > --jars > s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar > \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf spark.sql.sources.parallelPar
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction
alexeykudinkin commented on a change in pull request #4705: URL: https://github.com/apache/hudi/pull/4705#discussion_r794116319 ## File path: hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java ## @@ -31,15 +29,13 @@ private final T minValue; private final T maxValue; private final long numNulls; - private final PrimitiveStringifier stringifier; - public HoodieColumnRangeMetadata(final String filePath, final String columnName, final T minValue, final T maxValue, final long numNulls, final PrimitiveStringifier stringifier) { + public HoodieColumnRangeMetadata(final String filePath, final String columnName, final T minValue, final T maxValue, final long numNulls) { Review comment: Not sure i follow. Am actually removing it since it ain't used anywhere ## File path: hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java ## @@ -447,8 +448,8 @@ private static String composeZIndexColName(String col, String statName) { new Float(colMetadata.getMaxValue().toString())); } else if (colType instanceof BinaryType) { return Pair.of( - ((Binary) colMetadata.getMinValue()).getBytes(), - ((Binary) colMetadata.getMaxValue()).getBytes()); + ((ByteBuffer) colMetadata.getMinValue()).array(), Review comment: Good catch! ## File path: hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java ## @@ -360,24 +361,56 @@ public Boolean apply(String recordKey) { return new HoodieColumnRangeMetadata( one.getFilePath(), -one.getColumnName(), minValue, maxValue, one.getNumNulls() + another.getNumNulls(), one.getStringifier()); +one.getColumnName(), minValue, maxValue, one.getNumNulls() + another.getNumNulls()); } private static Comparable convertToNativeJavaType(PrimitiveType primitiveType, Comparable val) { if (primitiveType.getOriginalType() == OriginalType.DECIMAL) { - DecimalMetadata decimalMetadata = primitiveType.getDecimalMetadata(); - return BigDecimal.valueOf((Integer) val, decimalMetadata.getScale()); + return extractDecimal(val, primitiveType.getDecimalMetadata()); } else if (primitiveType.getOriginalType() == OriginalType.DATE) { // NOTE: This is a workaround to address race-condition in using // {@code SimpleDataFormat} concurrently (w/in {@code DateStringifier}) // TODO cleanup after Parquet upgrade to 1.12 synchronized (primitiveType.stringifier()) { +// Date logical type is implemented as a signed INT32 Review comment: It's not yet ## File path: hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java ## @@ -360,24 +361,56 @@ public Boolean apply(String recordKey) { return new HoodieColumnRangeMetadata( one.getFilePath(), -one.getColumnName(), minValue, maxValue, one.getNumNulls() + another.getNumNulls(), one.getStringifier()); +one.getColumnName(), minValue, maxValue, one.getNumNulls() + another.getNumNulls()); } private static Comparable convertToNativeJavaType(PrimitiveType primitiveType, Comparable val) { if (primitiveType.getOriginalType() == OriginalType.DECIMAL) { - DecimalMetadata decimalMetadata = primitiveType.getDecimalMetadata(); - return BigDecimal.valueOf((Integer) val, decimalMetadata.getScale()); + return extractDecimal(val, primitiveType.getDecimalMetadata()); } else if (primitiveType.getOriginalType() == OriginalType.DATE) { // NOTE: This is a workaround to address race-condition in using // {@code SimpleDataFormat} concurrently (w/in {@code DateStringifier}) // TODO cleanup after Parquet upgrade to 1.12 synchronized (primitiveType.stringifier()) { +// Date logical type is implemented as a signed INT32 +// REF: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md return java.sql.Date.valueOf( primitiveType.stringifier().stringify((Integer) val) ); } +} else if (primitiveType.getOriginalType() == OriginalType.UTF8) { + // NOTE: UTF8 type designates a byte array that should be interpreted as a + // UTF-8 encoded character string + // REF: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md + return ((Binary) val).toStringUsingUTF8(); +} else if (primitiveType.getPrimitiveTypeName() == PrimitiveType.PrimitiveTypeName.BINARY) { + // NOTE: `getBytes` access makes a copy of the underlying byte buffer + return ((Binary) val).toByteBuffer(); } return val; } + + @Nonnull + private static BigDecimal extractDecimal(Object val, DecimalMetadata decimalMetadata) { +// In Parquet, Decimal could be represented as either of +//1. INT32 (for 1 <= precision <= 9) +//2. IN
[GitHub] [hudi] alexeykudinkin commented on a change in pull request #3989: [HUDI-2589] RFC-37: Metadata table based bloom index
alexeykudinkin commented on a change in pull request #3989: URL: https://github.com/apache/hudi/pull/3989#discussion_r794112161 ## File path: rfc/rfc-37/rfc-37.md ## @@ -0,0 +1,286 @@ + +# RFC-37: Metadata based Bloom Index + +## Proposers +- @nsivabalan +- @manojpec + +## Approvers + - @vinothchandar + - @satishkotha + +## Status +JIRA: https://issues.apache.org/jira/browse/HUDI-2703 + +## Abstract +Hudi maintains several indices to locate/map incoming records to file groups during writes. Most commonly +used record index is the HoodieBloomIndex. Larger tables and global index has performance issues +as the bloom filter from a large number of data files needed to be read and looked up. Reading from several +files over the cloud object storage like S3 also faces request throttling issues. We are proposing to +build a new Metadata index (metadata table based bloom index) to boost the performance of existing bloom index. + +## Background +HoodieBloomIndex is used to find the location of incoming records during every write. Bloom index assists Hudi in +deterministically routing records to a given file group and to distinguish inserts vs updates. This aggregate bloom +index is built from several bloom filters stored in the base file footers. Prior to bloom filter lookup, the file +pruning for the incoming records is also done based on the record key min/max stats stored in the base file footers. +In this RFC, we plan to build a new index for the bloom filters under the metadata table which to assist in +bloom index based record location tagging. + +## Design +HoodieBloomIndex involves the following steps to find the right location of incoming records +1. Find all the interested partitions and list all its data files. +2. File Pruning: Load record key min/max details from all the interested data file footers. Filter files and generate + files to keys mapping for the incoming records based on the key ranges using range interval tree built from + previously loaded min/max details. +3. Bloom Filter lookup: Filter files and prune files to keys mapping for the incoming keys mapping based on the bloom + filter key lookup +4. Final Look up in actual data files to find the right location of every incoming record + +As we could see from step 1 and 2, we are in need of min and max values for "_hoodie_record_key" and bloom filters +from all interested data files to perform the location tagging. In this design, we will add these key stats and +bloom filter to the metadata table and thereby able to quickly load the interested details and do faster lookups. + +Metadata table already has one partition `files` to help in partition file listing. For the metadata table based +indices, we are proposing to add following two new partitions: +1. `bloom_filter` - for the file level bloom filter +2. `column_stats` - for the key range stats + +Why metadata table: +Metadata table uses HBase HFile - the map file format to store and retrieve data. HFile is an indexed file format and +supports map like faster lookups by keys. Since, we will be storing stats/bloom for every file and the index will do +lookups based on files, we should be able to benefit from the faster lookups in HFile. + + + +Following sections will talk about different partitions, key formats and then dive into the data and control flows. + +### MetaIndex/BloomFilter: + +A new partition `bloom_filter` will be added under the metadata table. Bloom filters from all the base files in the +data table will be added here. Metadata table is already in the HFile format. The existing metadata payload schema will +be extended and shared for this partition also. The type field will be used to detect the bloom filter payload record. +Here is the schema for the bloom filter payload record. +``` + { +"doc": "Metadata about base file bloom filters", +"name": "BloomFilterMetadata", +"type": [ +"null", +{ +"doc": "Base FileID and its BloomFilter details", +"name": "HoodieMetadataBloomFilter", +"type": "record", +"fields": [ +{ +"doc": "Version/type of the bloom filter metadata", +"name": "version", +"type": "string" +}, +{ +"doc": "Instant timestamp when this metadata was created/updated", +"name": "timestamp", +"type": "string" +}, +{ +"doc": "Bloom filter binary byte array", +"name": "bloomfilter", +"type": "bytes" +}, +{ +"doc": "T