[GitHub] [hudi] hudi-bot removed a comment on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4710:
URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023944845


   
   ## CI report:
   
   * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5573)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4710:
URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023972079


   
   ## CI report:
   
   * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5573)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023961928


   
   ## CI report:
   
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   * 879e966586fe287e710fb2b9db7a2436fef03a92 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5575)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023951714


   
   ## CI report:
   
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483610#comment-17483610
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

Log



22/01/28 01:29:34 INFO Executor: Adding 
file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/hudi-utilities-bundle_2.12-0.10.1.jar
 to class loader
22/01/28 01:29:34 INFO Executor: Fetching 
spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar with 
timestamp 1643354959702
22/01/28 01:29:34 INFO Utils: Fetching 
spark://192.168.86.5:49947/jars/org.spark-project.spark_unused-1.0.0.jar to 
/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/fetchFileTemp5819832321479921719.tmp
22/01/28 01:29:34 INFO Executor: Adding 
file:/private/var/folders/61/3vd56bjx3cj0hpdq_139d5hmgp/T/spark-823349b0-aeeb-494d-bdc6-c276419a0fe1/userFiles-644e376c-59bb-4837-a421-590697992dc6/org.spark-project.spark_unused-1.0.0.jar
 to class loader
22/01/28 01:29:34 INFO Utils: Successfully started service 
'org.apache.spark.network.netty.NettyBlockTransferService' on port 49956.
22/01/28 01:29:34 INFO NettyBlockTransferService: Server created on 
192.168.86.5:49956
22/01/28 01:29:34 INFO BlockManager: Using 
org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
policy
22/01/28 01:29:34 INFO BlockManagerMaster: Registering BlockManager 
BlockManagerId(driver, 192.168.86.5, 49956, None)
22/01/28 01:29:34 INFO BlockManagerMasterEndpoint: Registering block manager 
192.168.86.5:49956 with 2004.6 MiB RAM, BlockManagerId(driver, 192.168.86.5, 
49956, None)
22/01/28 01:29:34 INFO BlockManagerMaster: Registered BlockManager 
BlockManagerId(driver, 192.168.86.5, 49956, None)
22/01/28 01:29:34 INFO BlockManager: Initialized BlockManager: 
BlockManagerId(driver, 192.168.86.5, 49956, None)
22/01/28 01:29:35 INFO SharedState: Setting hive.metastore.warehouse.dir 
('null') to the value of spark.sql.warehouse.dir 
('file:/Users/harshakanna/spark-warehouse/').
22/01/28 01:29:35 INFO SharedState: Warehouse path is 
'file:/Users/harshakanna/spark-warehouse/'.
22/01/28 01:29:36 INFO DataSourceUtils: Getting table path..
22/01/28 01:29:36 INFO TablePathUtils: Getting table path from path : 
s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO DefaultSource: Obtained hudi table path: 
s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/sessions/.hoodie/hoodie.properties
22/01/28 01:29:36 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/sessions
22/01/28 01:29:36 INFO DefaultSource: Is bootstrapped table => false, tableType 
is: COPY_ON_WRITE, queryType is: snapshot
22/01/28 01:29:36 INFO DefaultSource: Loading Base File Only View with options 
:Map(hoodie.datasource.query.type -> snapshot, hoodie.metadata.enable -> true, 
path -> s3a://datalake-hudi/sessions/)
22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/sessions
22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/sessions/.hoodie/hoodie.properties
22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type 
COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) from 
s3a://datalake-hudi/sessions
22/01/28 01:29:37 INFO HoodieTableMetaClient: Loading HoodieTableMetaClient 
from s3a://datalake-hudi/sessions/.hoodie/metadata
22/01/28 01:29:37 INFO HoodieTableConfig: Loading table properties from 
s3a://datalake-hudi/sessions/.hoodie/metadata/.hoodie/hoodie.properties
22/01/28 01:29:37 INFO HoodieTableMetaClient: Finished Loading Table of type 
MERGE_ON_READ(version=1, baseFileFormat=HFILE) from 
s3a://datalake-hudi/sessions/.hoodie/metadata
22/01/28 01:29:37 INFO HoodieTableMetadataUtil: Loading latest merged file 
slices for metadata table partition files
22/01/28 01:29:38 INFO HoodieActiveTimeline: Loaded instants upto : 
Option\{val=[20220126024720121__deltacommit__COMPLETED]}
22/01/28 01:29:38 INFO AbstractTableFileSystemView: Took 2 ms to read 0 
instants, 0 replaced file groups
22/01/28 01:29:38 INFO ClusteringUtils: Found 0 files in pending clustering 
operations
22/01/28 01:29:38 INFO AbstractTableFileSystemView: Building file system view 
for partition (files)
22/01/28 01:29:38 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=9, 
NumFileGroups=1, FileGroupsCreationTime=11, StoreTimeTaken=0
22/01/28 01:29:38 INFO CacheConfig: Allocating LruBlockCache size=1.42 GB, 
blockSize=64 KB
22/01/28 01:29:38 INFO CacheConfig: Created c

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:30 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions, I am not able to run it succesfully, will give 
the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 389 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:27 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions, I am not able to run it succesfully, will give 
the info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 329 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive

2022-01-27 Thread GitBox


hudi-bot commented on pull request #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1023954251


   
   ## CI report:
   
   * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792)
 
   * 223c320447bc9adc8fccaabb9c590bed159b375d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5574)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1023952675


   
   ## CI report:
   
   * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792)
 
   * 223c320447bc9adc8fccaabb9c590bed159b375d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna edited comment on HUDI-3335 at 1/28/22, 7:25 AM:
---

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket https://issues.apache.org/jira/browse/HUDI-3066

2) 'metadata list-partitions, I am able to run it succesfully, will give the 
info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 329 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4

[GitHub] [hudi] hudi-bot commented on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive

2022-01-27 Thread GitBox


hudi-bot commented on pull request #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1023952675


   
   ## CI report:
   
   * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792)
 
   * 223c320447bc9adc8fccaabb9c590bed159b375d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #3391: [HUDI-83] Fix Timestamp type read by Hive

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #3391:
URL: https://github.com/apache/hudi/pull/3391#issuecomment-1002400317


   
   ## CI report:
   
   * ecb72b89015831cfbfa99ebcb027f660729b3195 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4792)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023951714


   
   ## CI report:
   
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[CANCELED](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023949877


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023949877


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   * 879e966586fe287e710fb2b9db7a2436fef03a92 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023930597


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483603#comment-17483603
 ] 

Harsha Teja Kanna commented on HUDI-3335:
-

Hi,

1) Yes partitions are of type '/mm/dd'. No using wild cards takes a long 
time. In fact we switch to only using base path from the suggestion from the 
other ticket 3066

2) 'metadata list-partitions, I am able to run it succesfully, will give the 
info once I can do it. Fails with parsing SPARK_MASTER url=yarn or somethig

3) No delete_partition operations were performed

4) hive_sync disabled intentionally, 

5) 'metadata validate-files' is running for all partitions for a while now, 
total 329 of them, But I see these below errors for many partitions

1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of log 
files scanned => 7
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - 
MaxMemoryInBytes allowed for compaction => 1073741824
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in MemoryBasedMap in ExternalSpillableMap => 3
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Total size in 
bytes of MemoryBasedMap in ExternalSpillableMap => 1800
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Number of 
entries in BitCaskDiskMap in ExternalSpillableMap => 0
1640606 [Spring Shell] INFO  
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner  - Size of file 
spilled to disk => 0
1640607 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Opened 7 metadata log files (dataset instant=20220126024720121, metadata 
instant=20220126024720121) in 3577 ms
1640806 [Spring Shell] INFO  org.apache.hadoop.io.compress.CodecPool  - Got 
brand-new decompressor [.gz]
1640808 [Spring Shell] INFO  org.apache.hudi.metadata.HoodieBackedTableMetadata 
 - Metadata read for 1 keys took [baseFileRead, logMerge] [0, 201] ms
1640809 [Spring Shell] INFO  org.apache.hudi.metadata.BaseTableMetadata  - 
Listed file in partition from metadata: partition=date=2022/01/15, #files=0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  -  
FS and metadata files count not matching for date=2022/01/15. FS files count 
19, metadata base files count 0
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-0_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
91eeab55-ee54-4285-a493-01d04a237691-1_775-489-15939_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
96c4d9d1-9de4-4c50-bd9c-f3f6a7bfb40b-0_133-10-6713_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
e266b306-b376-4954-8403-8f858dec34ee-0_134-10-6714_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
399fa917-1198-4c9a-bf63-754f40a5ad09-0_806-501-16124_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-0_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dde98b16-326c-4890-b100-4548d3161328-0_138-10-6718_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
80cd989e-296c-4dda-ae2d-4c7577fdc351-0_805-501-15994_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
83409360-e1c2-4bd6-ac22-b59e52eaf6ad-0_776-489-16054_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
5887f02d-fd66-4678-ac34-fe585e9946a4-0_137-10-6717_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
a03a35be-db55-4a96-a584-42b8007d507c-0_132-10-6712_20220121225020686.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
ddd0500e-9434-451b-9066-dfc7c5f6260b-0_777-489-16040_20220122024553151.parquet
1641313 [Spring Shell] ERROR org.apache.hudi.cli.commands.MetadataCommand  - FS 
file not found in metadata 
dac6d5d0-ca15-48e2-9b13-f37b4d113e64-1_804-501-16163_20220122014925413.parquet
1641313 [Spring Shell] ERROR org.apache.hudi

[GitHub] [hudi] hudi-bot commented on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4710:
URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023944845


   
   ## CI report:
   
   * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5573)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4710:
URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023943191


   
   ## CI report:
   
   * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2022-01-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3088:
-
Labels: pull-request-available  (was: )

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot commented on pull request #4710: [HUDI-3088] Use Spark 3.2 as default Spark version

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4710:
URL: https://github.com/apache/hudi/pull/4710#issuecomment-1023943191


   
   ## CI report:
   
   * 2cb81fb4f433cd3b99716d70f3751ea2782bfc0a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023930597


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5572)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023910477


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023910477


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   * 8f670f3466a15e536605b67edd5586c152d04035 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023904815


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023906918


   
   ## CI report:
   
   * c13c56e14dad9fad992fdf4a50e24e45c1539817 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5570)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023872466


   
   ## CI report:
   
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   * c13c56e14dad9fad992fdf4a50e24e45c1539817 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5570)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023903715


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023904815


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   * 2f14cbdd761921dc1b29c01b1201f58cc1f98b5a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023903715


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5571)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023902614


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4709:
URL: https://github.com/apache/hudi/pull/4709#issuecomment-1023902614


   
   ## CI report:
   
   * 1ee29bd82b5e0e4f5690f5e7469d064939e8e77a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-01-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3338:
-
Labels: pull-request-available  (was: )

> Use custom relation instead of HadoopFsRelation
> ---
>
> Key: HUDI-3338
> URL: https://issues.apache.org/jira/browse/HUDI-3338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Priority: Major
>  Labels: pull-request-available
>
> For HUDI-3204, COW table and MOR table in read_optimized query mode should 
> return the '-MM-dd' format of origin `data_date`, not /MM/dd''.
> And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
> query mode of cow and the read_optimized query mode of mor.
> Spark HadoopFsRelation will append the partition value of the real partition 
> path. However, different from the normal table, Hudi will persist the 
> partition value in the parquet file. So we just need read the partition value 
> from the parquet file, not leave it to spark.
> So we should not use `HadoopFsRelation` any more, and implement Hudi own 
> `Relation` to deal with it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] YannByron opened a new pull request #4709: [HUDI-3338] custom relation instead of HadoopFsRelation

2022-01-27 Thread GitBox


YannByron opened a new pull request #4709:
URL: https://github.com/apache/hudi/pull/4709


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-01-27 Thread Yann Byron (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yann Byron updated HUDI-3338:
-
Description: 
For HUDI-3204, COW table and MOR table in read_optimized query mode should 
return the '-MM-dd' format of origin `data_date`, not /MM/dd''.

And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
query mode of cow and the read_optimized query mode of mor.

Spark HadoopFsRelation will append the partition value of the real partition 
path. However, different from the normal table, Hudi will persist the partition 
value in the parquet file. So we just need read the partition value from the 
parquet file, not leave it to spark.


So we should not use `HadoopFsRelation` any more, and implement Hudi own 
`Relation` to deal with it.

> Use custom relation instead of HadoopFsRelation
> ---
>
> Key: HUDI-3338
> URL: https://issues.apache.org/jira/browse/HUDI-3338
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark, spark-sql
>Reporter: Yann Byron
>Priority: Major
>
> For HUDI-3204, COW table and MOR table in read_optimized query mode should 
> return the '-MM-dd' format of origin `data_date`, not /MM/dd''.
> And the reason for that is because Hudi use HadoopFsRelation for the snapshot 
> query mode of cow and the read_optimized query mode of mor.
> Spark HadoopFsRelation will append the partition value of the real partition 
> path. However, different from the normal table, Hudi will persist the 
> partition value in the parquet file. So we just need read the partition value 
> from the parquet file, not leave it to spark.
> So we should not use `HadoopFsRelation` any more, and implement Hudi own 
> `Relation` to deal with it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (HUDI-3338) Use custom relation instead of HadoopFsRelation

2022-01-27 Thread Yann Byron (Jira)
Yann Byron created HUDI-3338:


 Summary: Use custom relation instead of HadoopFsRelation
 Key: HUDI-3338
 URL: https://issues.apache.org/jira/browse/HUDI-3338
 Project: Apache Hudi
  Issue Type: Improvement
  Components: spark, spark-sql
Reporter: Yann Byron






--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-2458) Relax compaction in metadata being fenced based on inflight requests in data table

2022-01-27 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-2458:

Description: 
Relax compaction in metadata being fenced based on inflight requests in data 
table.

Compaction in metadata is triggered only if there are no inflight requests in 
data table. This might cause liveness problem since for very large deployments, 
we could either have compaction or clustering always in progress. So, we should 
try to see how we can relax this constraint.

 

Proposal to remove this dependency:

With recent addition of spurious deletes config, we can actually get away with 
this. 

As of now, we have 3 inter linked nuances.
 - Compaction in metadata may not kick in, if there are any inflight operations 
in data table. 

 - Rollback when being applied to metadata table has a dependency on last 
compaction instant in metadata table. We might even throw exception if instant 
being rolledback is < latest metadata compaction instant time. 

 - Archival in data table is fenced by latest compaction in metadata table. 

 

So, just incase data timeline has any dangling inflght operation (lets say 
someone tried clustering, and killed midway and did not ever attempt again), 
metadata compaction will never kick in at all for good. I need to check what 
does archival do for such inflight operations in data table though when it 
tries to archive near by commits. 

 

So, with spurious deletes support which we added recently, all these can be 
much simplified. 

Whenever we want to apply a rollback commit, we don't need to take different 
actions based on whether the commit being rolled back is already committed to 
metadata table or not. Just go ahead and apply the rollback. Merging of 
metadata payload records will take care of this. If the commit was already 
synced, final merged payload may not have spurious deletes. If the commit being 
rolledback was never committed to metadata, final merged payload may have some 
spurious deletes which we can ignore. 

With this, compaction in metadata does not need to have any dependency on 
inflight operations in data table. 

And we can loosen up the dependency of archival in data table on metadata table 
compaction as well. 

So, in summary, all the 3 dependencies quoted above will be moot if we go with 
this approach. Archival in data table does not have any dependency on metadata 
table compaction. Rollback when being applied to metadata table does not care 
about last metadata table compaction. Compaction in metadata table can proceed 
even if there are inflight operations in data table. 

 

Especially our logic to apply rollback metadata to metadata table will become a 
lot simpler and is easy to reason about. 

 

 

 

 

  was:
Relax compaction in metadata being fenced based on inflight requests in data 
table.

Compaction is metadata is triggered only if there are no inflight requests in 
data table. This might cause liveness problem since for very large deployments, 
we could either have compaction or clustering always in progress. So, we should 
try to see how we can relax this constraint.

 

Proposal to remove this dependency:

With recent addition of spurious deletes config, we can actually get away with 
this. 

As of now, we have 3 inter linked nuances.
 - Compaction in metadata may not kick in, if there are any inflight operations 
in data table. 

 - Rollback when being applied to metadata table has a dependency on last 
compaction instant in metadata table. We might even throw exception if instant 
being rolledback is < latest metadata compaction instant time. 

 - Archival in data table is fenced by latest compaction in metadata table. 

 

So, just incase data timeline has any dangling inflght operation (lets say 
someone tried clustering, and killed midway and did not ever attempt again), 
metadata compaction will never kick in at all for good. I need to check what 
does archival do for such inflight operations in data table though when it 
tries to archive near by commits. 

 

So, with spurious deletes support which we added recently, all these can be 
much simplified. 

Whenever we want to apply a rollback commit, we don't need to take different 
actions based on whether the commit being rolled back is already committed to 
metadata table or not. Just go ahead and apply the rollback. Merging of 
metadata payload records will take care of this. If the commit was already 
synced, final merged payload may not have spurious deletes. If the commit being 
rolledback was never committed to metadata, final merged payload may have some 
spurious deletes which we can ignore. 

With this, compaction in metadata does not need to have any dependency on 
inflight operations in data table. 

And we can loosen up the dependency of archival in data table on metadata table 
compaction as well. 

So, in summary, all the 3 dependencies quoted above will be

[jira] [Updated] (HUDI-1370) Scoping work needed to support bootstrapped data table and RFC-15 together

2022-01-27 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-1370:

Summary: Scoping work needed to support bootstrapped data table and RFC-15 
together  (was: Scoping work needed to support bootstrap and RFC-15 together)

> Scoping work needed to support bootstrapped data table and RFC-15 together
> --
>
> Key: HUDI-1370
> URL: https://issues.apache.org/jira/browse/HUDI-1370
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023851940


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562)
 
   * 4d38e462c4fc79432b3cef2691cb76229d054cab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5568)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023886573


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 4d38e462c4fc79432b3cef2691cb76229d054cab Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5568)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: [MINOR] Fix build of Hudi website (#4708)

2022-01-27 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 5aa8f30  [MINOR] Fix build of Hudi website (#4708)
5aa8f30 is described below

commit 5aa8f30f7ea27639c73fbff6612e317097920e09
Author: Y Ethan Guo 
AuthorDate: Thu Jan 27 20:46:49 2022 -0800

[MINOR] Fix build of Hudi website (#4708)
---
 website/package.json | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/website/package.json b/website/package.json
index 526429a..6b483ac 100644
--- a/website/package.json
+++ b/website/package.json
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "2.0.0-beta.14",
+"@docusaurus/preset-classic": "2.0.0-beta.14",
+"@docusaurus/theme-search-algolia": "2.0.0-beta.14",
 "@fontsource/comfortaa": "^4.5.0",
 "@mdx-js/react": "^1.6.21",
 "@svgr/webpack": "^5.5.0",


[GitHub] [hudi] yihua merged pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


yihua merged pull request #4708:
URL: https://github.com/apache/hudi/pull/4708


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


yihua commented on a change in pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#discussion_r794194072



##
File path: website/package.json
##
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "^2.0.0-beta.14",

Review comment:
   Good call.  So I triaged that the issue is actually due to the latest 
version of docusaurus, `2.0.0-beta.15` released yesterday.  Freezing it to 
`2.0.0-beta.14` solves the issue.
   
   @vingov do you know why docusaurus has `beta` in its versions?  Are they 
still experimental?  For now, sticking to one version saves us time from 
debugging such issues again in near future.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023871149


   
   ## CI report:
   
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   * c13c56e14dad9fad992fdf4a50e24e45c1539817 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023872466


   
   ## CI report:
   
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   * c13c56e14dad9fad992fdf4a50e24e45c1539817 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5570)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023871149


   
   ## CI report:
   
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   * c13c56e14dad9fad992fdf4a50e24e45c1539817 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023868510


   
   ## CI report:
   
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023868510


   
   ## CI report:
   
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023858958


   
   ## CI report:
   
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vingov commented on a change in pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


vingov commented on a change in pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#discussion_r794187342



##
File path: website/package.json
##
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "^2.0.0-beta.14",

Review comment:
   It's not a good practice to freeze the versions in the package.json, the 
versions will be frozen in package-lock.json in our local, but I see your 
point, if you want stability we can freeze the version but, we should also once 
in a while try to upgrade to the latest stable version which might have 
security and other critical bug fixes.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


yihua commented on a change in pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#discussion_r794187235



##
File path: website/package.json
##
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "^2.0.0-beta.14",
+"@docusaurus/preset-classic": "2.0.0-beta.14",
+"@docusaurus/theme-search-algolia": "^2.0.0-beta.14",

Review comment:
   I followed the pattern of the original PR @vingov put up (some have 
fixed version and some have "up to" constraint).  Let me test the latest and 
freeze the versions altogether.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #4681: [HUDI-2987] [WIP] diff to update all remove deprecated calls to HoodieRecordPayload

2022-01-27 Thread GitBox


nsivabalan commented on a change in pull request #4681:
URL: https://github.com/apache/hudi/pull/4681#discussion_r794145199



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/common/table/log/HoodieFileSliceReader.java
##
@@ -53,10 +55,12 @@ public static HoodieFileSliceReader getFileSliceReader(
   return new HoodieFileSliceReader(scanner.iterator());
 } else {
   Iterable> iterable = () -> 
scanner.iterator();
+  // todo : wire in event time field as well
+  HoodiePayloadConfig payloadConfig = 
HoodiePayloadConfig.newBuilder().withPayloadOrderingField(preCombineField).build();

Review comment:
   need to write in event time from callers. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023857805


   
   ## CI report:
   
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023858958


   
   ## CI report:
   
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5569)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023857805


   
   ## CI report:
   
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   * 0fcf479ebd1b4806f04b221dcdb59ebc44cc079e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023834106


   
   ## CI report:
   
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vingov commented on a change in pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


vingov commented on a change in pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#discussion_r794181432



##
File path: website/package.json
##
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "^2.0.0-beta.14",

Review comment:
   2.0.0-beta.15 is the latest stable version, did you test the beta-15? If 
that works, can you please freeze it to beta-15?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023850656


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562)
 
   * 4d38e462c4fc79432b3cef2691cb76229d054cab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023851940


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562)
 
   * 4d38e462c4fc79432b3cef2691cb76229d054cab Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5568)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] manojpec commented on a change in pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


manojpec commented on a change in pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#discussion_r794176408



##
File path: 
hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieFileReader.java
##
@@ -35,6 +37,14 @@
 
   public Set filterRowKeys(Set candidateRowKeys);
 
+  default Map getRecordsByKeys(TreeSet 
sortedCandidateRowKeys) throws IOException {

Review comment:
   Fixed.

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java
##
@@ -101,4 +116,34 @@ public static HoodieRecord getTaggedRecord(HoodieRecord 
inputRecord, Option filterKeysFromFile(Path filePath, List 
candidateRecordKeys,
+Configuration configuration) 
throws HoodieIndexException {
+ValidationUtils.checkArgument(FSUtils.isBaseFile(filePath));
+List foundRecordKeys = new ArrayList<>();
+try {
+  // Load all rowKeys from the file, to double-confirm
+  if (!candidateRecordKeys.isEmpty()) {
+HoodieTimer timer = new HoodieTimer().startTimer();
+HoodieFileReader fileReader = 
HoodieFileReaderFactory.getFileReader(configuration, filePath);
+Set fileRowKeys = fileReader.filterKeys(new 
TreeSet<>(candidateRecordKeys));

Review comment:
   fixed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023725018


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4352: [HUDI-1295] Metadata Index - Bloom filter and Column stats index to speed up index lookups

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4352:
URL: https://github.com/apache/hudi/pull/4352#issuecomment-1023850656


   
   ## CI report:
   
   * 235981abd20a498a3e29e98ce0eda9de35018f99 UNKNOWN
   * 176c05c8e5da623acdf8d333050b5f394a36aee9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5562)
 
   * 4d38e462c4fc79432b3cef2691cb76229d054cab UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column

2022-01-27 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3316:
-
Status: In Progress  (was: Open)

> HoodieColumnRangeMetadata doesn't include all statistics for the column
> ---
>
> Key: HUDI-3316
> URL: https://issues.apache.org/jira/browse/HUDI-3316
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieColumnChunkMetadata includes the following stats about a parquet column
>  * columnName;
>  * minValue
>  * maxValue
>  * numNulls
>  
> Parquet's ColumnChunkMetaData do have more stats and we need to include them 
> all in our index 
>  * num values 
>  * total size
>  * total uncompressed size



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3260) Support column stat index for multiple columns

2022-01-27 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3260:
-
Status: In Progress  (was: Open)

> Support column stat index for multiple columns
> --
>
> Key: HUDI-3260
> URL: https://issues.apache.org/jira/browse/HUDI-3260
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: sev:normal
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3260) Support column stat index for multiple columns

2022-01-27 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3260:
-
Sprint: Hudi-Sprint-Jan-24

> Support column stat index for multiple columns
> --
>
> Key: HUDI-3260
> URL: https://issues.apache.org/jira/browse/HUDI-3260
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
>  Labels: sev:normal
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] nsivabalan commented on a change in pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


nsivabalan commented on a change in pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#discussion_r794169735



##
File path: website/package.json
##
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "^2.0.0-beta.14",
+"@docusaurus/preset-classic": "2.0.0-beta.14",
+"@docusaurus/theme-search-algolia": "^2.0.0-beta.14",

Review comment:
   this one also needs fix. remove "^" at the beginning.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on a change in pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


xushiyan commented on a change in pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#discussion_r794168724



##
File path: website/package.json
##
@@ -14,11 +14,11 @@
 "write-heading-ids": "docusaurus write-heading-ids"
   },
   "dependencies": {
-"@docusaurus/core": "^2.0.0-beta.3",
-"@docusaurus/plugin-client-redirects": "^2.0.0-beta.3",
-"@docusaurus/plugin-sitemap": "^2.0.0-beta.3",
-"@docusaurus/preset-classic": "^2.0.0-beta.3",
-"@docusaurus/theme-search-algolia": "^2.0.0-beta.3",
+"@docusaurus/core": "2.0.0-beta.14",
+"@docusaurus/plugin-client-redirects": "2.0.0-beta.14",
+"@docusaurus/plugin-sitemap": "^2.0.0-beta.14",

Review comment:
   shall we freeze the versions at `2.0.0-beta.14` ?
   ```suggestion
   "@docusaurus/plugin-sitemap": "2.0.0-beta.14",
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


nsivabalan commented on pull request #4708:
URL: https://github.com/apache/hudi/pull/4708#issuecomment-1023839234


   CC @vingov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column

2022-01-27 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3316:
-
Description: 
HoodieColumnChunkMetadata includes the following stats about a parquet column
 * columnName;
 * minValue
 * maxValue
 * numNulls

 

Parquet's ColumnChunkMetaData do have more stats and we need to include them 
all in our index 
 * num values 
 * total size
 * total uncompressed size

  was:
HoodieColumnChunkMetadata includes the following stats about a parquet column
 * columnName;
 * minValue
 * maxValue
 * numNulls

 

Parquet's ColumnChunkMetaData do have more stats and we need to include them 
all in our index 
 * distinct
 * num values 


> HoodieColumnRangeMetadata doesn't include all statistics for the column
> ---
>
> Key: HUDI-3316
> URL: https://issues.apache.org/jira/browse/HUDI-3316
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieColumnChunkMetadata includes the following stats about a parquet column
>  * columnName;
>  * minValue
>  * maxValue
>  * numNulls
>  
> Parquet's ColumnChunkMetaData do have more stats and we need to include them 
> all in our index 
>  * num values 
>  * total size
>  * total uncompressed size



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3316) HoodieColumnRangeMetadata doesn't include all statistics for the column

2022-01-27 Thread Manoj Govindassamy (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Govindassamy updated HUDI-3316:
-
Summary: HoodieColumnRangeMetadata doesn't include all statistics for the 
column  (was: HoodieColumnRangeMetadata doesn't include all Parquet chunk 
statistics)

> HoodieColumnRangeMetadata doesn't include all statistics for the column
> ---
>
> Key: HUDI-3316
> URL: https://issues.apache.org/jira/browse/HUDI-3316
> Project: Apache Hudi
>  Issue Type: Task
>  Components: writer-core
>Reporter: Manoj Govindassamy
>Assignee: Manoj Govindassamy
>Priority: Blocker
> Fix For: 0.11.0
>
>
> HoodieColumnChunkMetadata includes the following stats about a parquet column
>  * columnName;
>  * minValue
>  * maxValue
>  * numNulls
>  
> Parquet's ColumnChunkMetaData do have more stats and we need to include them 
> all in our index 
>  * distinct
>  * num values 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023796380


   
   ## CI report:
   
   * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563)
 
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023834106


   
   ## CI report:
   
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua opened a new pull request #4708: [MINOR] Fix build of Hudi website

2022-01-27 Thread GitBox


yihua opened a new pull request #4708:
URL: https://github.com/apache/hudi/pull/4708


   ## What is the purpose of the pull request
   
   The build of Hudi website is broken due to the following error from `npm run 
build`:
   ```
   (asf-site)> npm run build
   
   > hudi@0.0.0 build
   > docusaurus build
   
   [INFO] Website will be built for all these locales: 
   - en
   - cn
   [INFO] [en] Creating an optimized production build...
   
   ✔ Client
 
   ✖ Server
 Compiled with some errors in 2.64m
   
   [ERROR] Docusaurus Node/SSR could not render static page with path / because 
of following error:
   Error: Minified React error #130; visit 
https://reactjs.org/docs/error-decoder.html?invariant=130&args[]=object&args[]= 
for the full message or use the non-minified dev environment for full errors 
and additional helpful warnings.
   at a.b.render (main:115785:32)
   at a.b.read (main:115781:83)
   at Object.exports.renderToString (main:115792:138)
   at doRender (main:25801:356)
   at async serverEntry_render (main:25797:329)
   
   Error: Server-side rendering fails due to the error above.
   [ERROR] Unable to build website for locale en.
   [ERROR] Error: Failed to compile with errors.
   at 
/Users/ethan/Work/repo/hudi-docs-8/website/node_modules/@docusaurus/core/lib/webpack/utils.js:207:24
   at 
/Users/ethan/Work/repo/hudi-docs-8/website/node_modules/webpack/lib/MultiCompiler.js:554:14
   at processQueueWorker 
(/Users/ethan/Work/repo/hudi-docs-8/website/node_modules/webpack/lib/MultiCompiler.js:491:6)
   at processTicksAndRejections (node:internal/process/task_queues:78:11)
   ```
   
   The root cause is that docusaurus versions specified in 
`website/package.json` are not honored.  Looking at the 
`website/package-lock.json` generated, `2.0.0-beta.15` is actually used instead 
of `^2.0.0-beta.3` (up to 2.0.0-beta.3) specified.  Another evidence of higher 
version already used is that `2.0.0-beta.14` shows up in generated content:
   
   ```
   ./content/docs/next/clustering/index.html:
   ```
   
   The build failure is likely due to recent new versions (`2.0.0-beta.15`, 
`2.0.0-beta.16`) of docusaurus and related dependencies.
   
   The fix is to bound the docusaurus version properly.
   
   Note that the build failure can only be reproduced from a fresh clone of the 
branch from remote, with `npm install` and `npm run build` under `website` 
folder.  If there is previous successful build and package info is cached, such 
build failure may not show up.
   
   ## Brief change log
   
 - Updates `website/package.json` to bound the docusaurus version properly.
   
   ## Verify this pull request
   
   The change is verified by a fresh build of the website.  The website can be 
successfully launched after `npm start`.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023796380


   
   ## CI report:
   
   * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563)
 
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5567)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023793855


   
   ## CI report:
   
   * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563)
 
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023793855


   
   ## CI report:
   
   * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563)
 
   * f8cb5b06e3940fe5a931bf968f394bd6068b4731 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4704: [HUDI-3330] Remove fixture test tables for multi writer tests

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4704:
URL: https://github.com/apache/hudi/pull/4704#issuecomment-1023728908


   
   ## CI report:
   
   * 5625be38641c68789265d95bd0b7ed51a83105b4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5563)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483505#comment-17483505
 ] 

sivabalan narayanan commented on HUDI-3335:
---

Can you also enable debug logs(just for hudi) and rerun your query and give us 
the logs. 

 

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestam

[GitHub] [hudi] nsivabalan commented on issue #3478: [SUPPORT] Unexpected Hive behaviour

2022-01-27 Thread GitBox


nsivabalan commented on issue #3478:
URL: https://github.com/apache/hudi/issues/3478#issuecomment-1023790632


   awesome, thanks for updating! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-2596) Make class names consistent in hudi-client

2022-01-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2596.

 Reviewers: Ethan Guo
Resolution: Done

> Make class names consistent in hudi-client
> --
>
> Key: HUDI-2596
> URL: https://issues.apache.org/jira/browse/HUDI-2596
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Currently we have different naming convention for the abstract classes, such 
> as AbstractBulkInsertHelper, BaseCommitActionExecutor, 
> HoodieTableFileIndexBase, etc.  Ideally, we should have the same naming 
> convention for such common abstraction/interface, "Abstract*", "Base*", or 
> "\*Base"{*}.{*}  I prefer to use "Base\*".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483261#comment-17483261
 ] 

sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 1:08 AM:
-

CC [~manojpec] [~guoyihua] [~codope]  metadata related bug

 


was (Author: shivnarayan):
CC [~manojpec] [~guoyihua]  metadata related bug

 

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-

[jira] [Reopened] (HUDI-2596) Make class names consistent in hudi-client

2022-01-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reopened HUDI-2596:
--

> Make class names consistent in hudi-client
> --
>
> Key: HUDI-2596
> URL: https://issues.apache.org/jira/browse/HUDI-2596
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Currently we have different naming convention for the abstract classes, such 
> as AbstractBulkInsertHelper, BaseCommitActionExecutor, 
> HoodieTableFileIndexBase, etc.  Ideally, we should have the same naming 
> convention for such common abstraction/interface, "Abstract*", "Base*", or 
> "\*Base"{*}.{*}  I prefer to use "Base\*".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3088) Make Spark 3 the default profile for build and test

2022-01-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3088:
-
Status: In Progress  (was: Open)

> Make Spark 3 the default profile for build and test
> ---
>
> Key: HUDI-3088
> URL: https://issues.apache.org/jira/browse/HUDI-3088
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.11.0
>
>
> By default, when people check out the code, they should have activated spark 
> 3 for the repo. Also all tests should be running against the latest supported 
> spark version. Correspondingly the default scala version becomes 2.12 and the 
> default parquet version 1.12.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-2596) Make class names consistent in hudi-client

2022-01-27 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-2596.
--

> Make class names consistent in hudi-client
> --
>
> Key: HUDI-2596
> URL: https://issues.apache.org/jira/browse/HUDI-2596
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>
> Currently we have different naming convention for the abstract classes, such 
> as AbstractBulkInsertHelper, BaseCommitActionExecutor, 
> HoodieTableFileIndexBase, etc.  Ideally, we should have the same naming 
> convention for such common abstraction/interface, "Abstract*", "Base*", or 
> "\*Base"{*}.{*}  I prefer to use "Base\*".



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[hudi] branch master updated (4a9f826 -> 0bd38f2)

2022-01-27 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 4a9f826  [HUDI-3215] Solve UT for Spark 3.2 (#4565)
 add 0bd38f2  [HUDI-2596] Make class names consistent in hudi-client (#4680)

No new revisions were added by this update.

Summary of changes:
 .../hudi/cli/commands/TestRollbacksCommand.java|  4 +--
 .../apache/hudi/async/AsyncClusteringService.java  | 14 
 .../org/apache/hudi/async/AsyncCompactService.java | 14 
 .../apache/hudi/client/AsyncCleanerService.java|  6 ++--
 ...actClusteringClient.java => BaseClusterer.java} |  8 ++---
 .../{AbstractCompactor.java => BaseCompactor.java} |  8 ++---
 ...ractHoodieClient.java => BaseHoodieClient.java} |  8 ++---
 ...WriteClient.java => BaseHoodieWriteClient.java} | 20 ++--
 .../apache/hudi/client/CompactionAdminClient.java  |  2 +-
 .../java/org/apache/hudi/keygen/KeyGenUtils.java   |  8 ++---
 .../keygen/TimestampBasedAvroKeyGenerator.java |  8 ++---
 ...meParser.java => BaseHoodieDateTimeParser.java} |  4 +--
 ...meParserImpl.java => HoodieDateTimeParser.java} |  4 +--
 .../metadata/HoodieBackedTableMetadataWriter.java  |  6 ++--
 .../hudi/metrics/MetricsReporterFactory.java   | 10 +++---
 .../CustomizableMetricsReporter.java}  | 14 
 .../AbstractUserDefinedMetricsReporter.java| 37 +-
 ...InsertHelper.java => BaseBulkInsertHelper.java} |  2 +-
 .../action/commit/BaseCommitActionExecutor.java|  2 +-
 ...ractDeleteHelper.java => BaseDeleteHelper.java} |  2 +-
 ...stractMergeHelper.java => BaseMergeHelper.java} |  2 +-
 ...stractWriteHelper.java => BaseWriteHelper.java} |  2 +-
 .../hudi/table/upgrade/DowngradeHandler.java   |  4 +--
 .../hudi/table/upgrade/OneToTwoUpgradeHandler.java |  2 +-
 .../table/upgrade/OneToZeroDowngradeHandler.java   |  2 +-
 ...deHelper.java => SupportsUpgradeDowngrade.java} |  2 +-
 .../table/upgrade/ThreeToTwoDowngradeHandler.java  |  2 +-
 .../table/upgrade/TwoToOneDowngradeHandler.java|  2 +-
 .../table/upgrade/TwoToThreeUpgradeHandler.java|  2 +-
 .../hudi/table/upgrade/UpgradeDowngrade.java   |  4 +--
 .../apache/hudi/table/upgrade/UpgradeHandler.java  |  4 +--
 .../table/upgrade/ZeroToOneUpgradeHandler.java |  2 +-
 .../hudi/metrics/TestMetricsReporterFactory.java   |  8 ++---
 .../providers/HoodieWriteClientProvider.java   |  4 +--
 .../apache/hudi/client/HoodieFlinkWriteClient.java |  2 +-
 .../FlinkHoodieBackedTableMetadataWriter.java  |  2 +-
 .../table/action/commit/FlinkDeleteHelper.java |  2 +-
 .../hudi/table/action/commit/FlinkMergeHelper.java |  2 +-
 .../hudi/table/action/commit/FlinkWriteHelper.java |  2 +-
 .../table/upgrade/FlinkUpgradeDowngradeHelper.java |  2 +-
 .../apache/hudi/client/HoodieJavaWriteClient.java  |  2 +-
 .../table/action/commit/JavaBulkInsertHelper.java  |  4 +--
 .../hudi/table/action/commit/JavaDeleteHelper.java |  2 +-
 .../hudi/table/action/commit/JavaMergeHelper.java  |  2 +-
 .../hudi/table/action/commit/JavaWriteHelper.java  |  2 +-
 .../hudi/async/SparkAsyncClusteringService.java|  8 ++---
 .../hudi/async/SparkAsyncCompactService.java   |  8 ++---
 .../hudi/client/HoodieSparkClusteringClient.java   |  4 +--
 .../apache/hudi/client/HoodieSparkCompactor.java   |  4 +--
 .../apache/hudi/client/SparkRDDWriteClient.java|  2 +-
 .../table/action/commit/SparkBulkInsertHelper.java |  4 +--
 .../table/action/commit/SparkDeleteHelper.java |  4 +--
 .../hudi/table/action/commit/SparkMergeHelper.java |  2 +-
 .../hudi/table/action/commit/SparkWriteHelper.java |  4 +--
 ...ava => BaseSparkDeltaCommitActionExecutor.java} |  8 ++---
 .../SparkBulkInsertDeltaCommitActionExecutor.java  | 10 +++---
 ...BulkInsertPreppedDeltaCommitActionExecutor.java |  8 ++---
 .../SparkDeleteDeltaCommitActionExecutor.java  |  4 +--
 .../SparkInsertDeltaCommitActionExecutor.java  |  4 +--
 ...parkInsertPreppedDeltaCommitActionExecutor.java |  3 +-
 .../SparkUpsertDeltaCommitActionExecutor.java  |  4 +--
 ...parkUpsertPreppedDeltaCommitActionExecutor.java |  3 +-
 .../table/upgrade/SparkUpgradeDowngradeHelper.java |  2 +-
 .../functional/TestHoodieBackedMetadata.java   |  2 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |  4 +--
 .../hudi/table/TestHoodieMergeOnReadTable.java |  4 +--
 .../SparkStreamingAsyncClusteringService.java  |  8 ++---
 .../async/SparkStreamingAsyncCompactService.java   |  8 ++---
 68 files changed, 178 insertions(+), 181 deletions(-)
 rename 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/{AbstractClusteringClient.java
 => BaseClusterer.java} (80%)
 rename 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/{AbstractCompactor.java
 => BaseCompactor.java} (78%)
 rename 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/{AbstractHoodieClient.java

[GitHub] [hudi] yihua merged pull request #4680: [HUDI-2596] Make class names consistent in hudi-client

2022-01-27 Thread GitBox


yihua merged pull request #4680:
URL: https://github.com/apache/hudi/pull/4680


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4680: [HUDI-2596] Make class names consistent in hudi-client

2022-01-27 Thread GitBox


hudi-bot commented on pull request #4680:
URL: https://github.com/apache/hudi/pull/4680#issuecomment-1023785526


   
   ## CI report:
   
   * ae88c2fc58bf07a435feb971435646258e2b5e87 UNKNOWN
   * 5d9189c4f457e5877280f00d0dcd9ccdb476135f Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5565)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4680: [HUDI-2596] Make class names consistent in hudi-client

2022-01-27 Thread GitBox


hudi-bot removed a comment on pull request #4680:
URL: https://github.com/apache/hudi/pull/4680#issuecomment-1023749306


   
   ## CI report:
   
   * ae88c2fc58bf07a435feb971435646258e2b5e87 UNKNOWN
   * 5593a7e380700e7f89c65b44b20dfa4d31a15ea9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5481)
 
   * 5d9189c4f457e5877280f00d0dcd9ccdb476135f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5565)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3335:

Fix Version/s: 0.11.0

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/sessions \
> --target-table se

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483503#comment-17483503
 ] 

sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 1:00 AM:
-

[~h7kanna] : an orthogonal question. was hive sync disabled intentionally.

>From your logs
{code:java}
hoodie.datasource.hive_sync.enable=false  {code}


was (Author: shivnarayan):
[~h7kanna] : an orthogonal question. was hive sync disabled by default.

>From your logs
{code:java}
hoodie.datasource.hive_sync.enable=false  {code}

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStrea

[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483503#comment-17483503
 ] 

sivabalan narayanan commented on HUDI-3335:
---

[~h7kanna] : an orthogonal question. was hive sync disabled by default.

>From your logs
{code:java}
hoodie.datasource.hive_sync.enable=false  {code}

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --sourc

[jira] [Updated] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-3335:

Priority: Blocker  (was: Critical)

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Blocker
> Fix For: 0.11.0
>
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ordering-field timestamp \
> --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
> --target-base-path s3a://datalake-hudi/sessions \
> --tar

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483499#comment-17483499
 ] 

sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 12:58 AM:
--

[~h7kanna] : I assume your partitions are of the format "/mm/dd". And so 
partitions are 3 level. 

Can you check if giving an explicit glob path works? 

for eg:
{code:java}
 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/*/*/*/*") {code}
I assume above one will be slower compared to not giving any explicit glob 
pattern, but wanted to rule things out. 

Also, can you try the below command in hudi-cli and let us know what you see. 
{code:java}
connect --path basePath
set conf SPARK_MASTER=local[2]
metadata list-partitions {code}
Also, can you run below command and let us know what you see 
{code:java}
metadata validateFiles {code}
 


was (Author: shivnarayan):
[~h7kanna] : I assume your partitions are of the format "/mm/dd". And so 
partitions are 3 level. 

Can you check if giving an explicit glob path works? 

for eg:
{code:java}
 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/*/*/*/*") {code}
I assume above one will be slower compared to not giving any explicit glob 
pattern, but wanted to rule things out. 

 

Also, can you try the below command in hudi-cli and let us know what you see. 
{code:java}
connect --path basePath
set conf SPARK_MASTER=local[2]
metadata list-partitions {code}

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.dat

[GitHub] [hudi] vingov opened a new pull request #4707: Stop-gap solution to fix the broken blog link

2022-01-27 Thread GitBox


vingov opened a new pull request #4707:
URL: https://github.com/apache/hudi/pull/4707


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   Fix the broken link with a redirect, hence the regular redirect is not 
working, came up with this workaround since the link has been tweeted to reduce 
the impact.
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483501#comment-17483501
 ] 

sivabalan narayanan commented on HUDI-3335:
---

By any chance does any of your partitions have 0 files? I mean, added initially 
and then later triggered delete_partition operation may be. 

 

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPartitionDiscovery.parallelism=25000 \
> s3://datalake/jars/hudi-0.10.1/hudi-utilities-bundle_2.12-0.10.1.jar \
> --table-type COPY_ON_WRITE \
> --source-ord

[GitHub] [hudi] manojpec commented on a change in pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-01-27 Thread GitBox


manojpec commented on a change in pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#discussion_r794119507



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java
##
@@ -31,15 +29,13 @@
   private final T minValue;
   private final T maxValue;
   private final long numNulls;
-  private final PrimitiveStringifier stringifier;
 
-  public HoodieColumnRangeMetadata(final String filePath, final String 
columnName, final T minValue, final T maxValue, final long numNulls, final 
PrimitiveStringifier stringifier) {
+  public HoodieColumnRangeMetadata(final String filePath, final String 
columnName, final T minValue, final T maxValue, final long numNulls) {

Review comment:
   I misread. We are good here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483499#comment-17483499
 ] 

sivabalan narayanan commented on HUDI-3335:
---

[~h7kanna] : I assume your partitions are of the format "/mm/dd". And so 
partitions are 3 level. 

Can you check if giving an explicit glob path works? 

for eg:
{code:java}
 val df = spark.
    read.
    format("org.apache.hudi").
    option(HoodieMetadataConfig.ENABLE.key(), "true").
    option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
    load(s"${basePath}/sessions/*/*/*/*") {code}
I assume above one will be slower compared to not giving any explicit glob 
pattern, but wanted to rule things out. 

 

Also, can you try the below command in hudi-cli and let us know what you see. 
{code:java}
connect --path basePath
set conf SPARK_MASTER=local[2]
metadata list-partitions {code}

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-s

[jira] [Comment Edited] (HUDI-3335) Loading Hudi table fails with NullPointerException

2022-01-27 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483261#comment-17483261
 ] 

sivabalan narayanan edited comment on HUDI-3335 at 1/28/22, 12:51 AM:
--

CC [~manojpec] [~guoyihua]  metadata related bug

 


was (Author: shivnarayan):
Can you furnish more info for us to triage. 

hoodie write configs used.

hive sync configs used. 

contents of .hoodie

and contents of .hoodie/metadata/.hoodie

 

> Loading Hudi table fails with NullPointerException
> --
>
> Key: HUDI-3335
> URL: https://issues.apache.org/jira/browse/HUDI-3335
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
>Reporter: Harsha Teja Kanna
>Priority: Critical
>
> Have a COW table with metadata enabled. Loading from Spark query fails with 
> java.lang.NullPointerException
> *Environment*
> Spark 3.1.2
> Hudi 0.10.1
> *Query*
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val basePath = "s3a://datalake-hudi/v1"
>  val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Passing an individual partition works though*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(HoodieMetadataConfig.ENABLE.key(), "true").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/date=2022/01/25")
>  df.createOrReplaceTempView(table)
> *Also, disabling metadata works, but the query taking very long time*
> val df = spark.
>     read.
>     format("org.apache.hudi").
>     option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
>     load(s"${basePath}/sessions/")
>  df.createOrReplaceTempView(table)
> *Loading files with stacktrace:*
>   at 
> org.sparkproject.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>   at org.sparkproject.guava.cache.LocalCache.put(LocalCache.java:4210)
>   at 
> org.sparkproject.guava.cache.LocalCache$LocalManualCache.put(LocalCache.java:4804)
>   at 
> org.apache.spark.sql.execution.datasources.SharedInMemoryCache$$anon$3.putLeafFiles(FileStatusCache.scala:161)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4(HoodieFileIndex.scala:631)
>   at 
> org.apache.hudi.HoodieFileIndex.$anonfun$loadPartitionPathFiles$4$adapted(HoodieFileIndex.scala:629)
>   at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468)
>   at 
> org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:629)
>   at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
>   at org.apache.hudi.HoodieFileIndex.(HoodieFileIndex.scala:184)
>   at 
> org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:355)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:325)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:307)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:307)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
>   at $anonfun$res3$1(:46)
>   at $anonfun$res3$1$adapted(:40)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> *Writer config*
> **
> spark-submit \
> --master yarn \
> --deploy-mode cluster \
> --driver-cores 4 \
> --driver-memory 4g \
> --executor-cores 4 \
> --executor-memory 6g \
> --num-executors 8 \
> --jars 
> s3://datalake/jars/unused-1.0.0.jar,s3://datalake/jars/spark-avro_2.12-3.1.2.jar
>  \
> --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.sql.sources.parallelPar

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #4705: [HUDI-3337] Fixing Parquet Column Range metadata extraction

2022-01-27 Thread GitBox


alexeykudinkin commented on a change in pull request #4705:
URL: https://github.com/apache/hudi/pull/4705#discussion_r794116319



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieColumnRangeMetadata.java
##
@@ -31,15 +29,13 @@
   private final T minValue;
   private final T maxValue;
   private final long numNulls;
-  private final PrimitiveStringifier stringifier;
 
-  public HoodieColumnRangeMetadata(final String filePath, final String 
columnName, final T minValue, final T maxValue, final long numNulls, final 
PrimitiveStringifier stringifier) {
+  public HoodieColumnRangeMetadata(final String filePath, final String 
columnName, final T minValue, final T maxValue, final long numNulls) {

Review comment:
   Not sure i follow. Am actually removing it since it ain't used anywhere

##
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/columnstats/ColumnStatsIndexHelper.java
##
@@ -447,8 +448,8 @@ private static String composeZIndexColName(String col, 
String statName) {
   new Float(colMetadata.getMaxValue().toString()));
 } else if (colType instanceof BinaryType) {
   return Pair.of(
-  ((Binary) colMetadata.getMinValue()).getBytes(),
-  ((Binary) colMetadata.getMaxValue()).getBytes());
+  ((ByteBuffer) colMetadata.getMinValue()).array(),

Review comment:
   Good catch!

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
##
@@ -360,24 +361,56 @@ public Boolean apply(String recordKey) {
 
 return new HoodieColumnRangeMetadata(
 one.getFilePath(),
-one.getColumnName(), minValue, maxValue, one.getNumNulls() + 
another.getNumNulls(), one.getStringifier());
+one.getColumnName(), minValue, maxValue, one.getNumNulls() + 
another.getNumNulls());
   }
 
   private static Comparable convertToNativeJavaType(PrimitiveType 
primitiveType, Comparable val) {
 if (primitiveType.getOriginalType() == OriginalType.DECIMAL) {
-  DecimalMetadata decimalMetadata = primitiveType.getDecimalMetadata();
-  return BigDecimal.valueOf((Integer) val, decimalMetadata.getScale());
+  return extractDecimal(val, primitiveType.getDecimalMetadata());
 } else if (primitiveType.getOriginalType() == OriginalType.DATE) {
   // NOTE: This is a workaround to address race-condition in using
   //   {@code SimpleDataFormat} concurrently (w/in {@code 
DateStringifier})
   // TODO cleanup after Parquet upgrade to 1.12
   synchronized (primitiveType.stringifier()) {
+// Date logical type is implemented as a signed INT32

Review comment:
   It's not yet

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
##
@@ -360,24 +361,56 @@ public Boolean apply(String recordKey) {
 
 return new HoodieColumnRangeMetadata(
 one.getFilePath(),
-one.getColumnName(), minValue, maxValue, one.getNumNulls() + 
another.getNumNulls(), one.getStringifier());
+one.getColumnName(), minValue, maxValue, one.getNumNulls() + 
another.getNumNulls());
   }
 
   private static Comparable convertToNativeJavaType(PrimitiveType 
primitiveType, Comparable val) {
 if (primitiveType.getOriginalType() == OriginalType.DECIMAL) {
-  DecimalMetadata decimalMetadata = primitiveType.getDecimalMetadata();
-  return BigDecimal.valueOf((Integer) val, decimalMetadata.getScale());
+  return extractDecimal(val, primitiveType.getDecimalMetadata());
 } else if (primitiveType.getOriginalType() == OriginalType.DATE) {
   // NOTE: This is a workaround to address race-condition in using
   //   {@code SimpleDataFormat} concurrently (w/in {@code 
DateStringifier})
   // TODO cleanup after Parquet upgrade to 1.12
   synchronized (primitiveType.stringifier()) {
+// Date logical type is implemented as a signed INT32
+// REF: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
 return java.sql.Date.valueOf(
 primitiveType.stringifier().stringify((Integer) val)
 );
   }
+} else if (primitiveType.getOriginalType() == OriginalType.UTF8) {
+  // NOTE: UTF8 type designates a byte array that should be interpreted as 
a
+  // UTF-8 encoded character string
+  // REF: 
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md
+  return ((Binary) val).toStringUsingUTF8();
+} else if (primitiveType.getPrimitiveTypeName() == 
PrimitiveType.PrimitiveTypeName.BINARY) {
+  // NOTE: `getBytes` access makes a copy of the underlying byte buffer
+  return ((Binary) val).toByteBuffer();
 }
 
 return val;
   }
+
+  @Nonnull
+  private static BigDecimal extractDecimal(Object val, DecimalMetadata 
decimalMetadata) {
+// In Parquet, Decimal could be represented as either of
+//1. INT32 (for 1 <= precision <= 9)
+//2. IN

[GitHub] [hudi] alexeykudinkin commented on a change in pull request #3989: [HUDI-2589] RFC-37: Metadata table based bloom index

2022-01-27 Thread GitBox


alexeykudinkin commented on a change in pull request #3989:
URL: https://github.com/apache/hudi/pull/3989#discussion_r794112161



##
File path: rfc/rfc-37/rfc-37.md
##
@@ -0,0 +1,286 @@
+
+# RFC-37: Metadata based Bloom Index
+
+## Proposers
+- @nsivabalan
+- @manojpec
+
+## Approvers
+ - @vinothchandar
+ - @satishkotha
+
+## Status
+JIRA: https://issues.apache.org/jira/browse/HUDI-2703
+
+## Abstract
+Hudi maintains several indices to locate/map incoming records to file groups 
during writes. Most commonly 
+used record index is the HoodieBloomIndex. Larger tables and global index has 
performance issues
+as the bloom filter from a large number of data files needed to be read and 
looked up. Reading from several
+files over the cloud object storage like S3 also faces request throttling 
issues. We are proposing to 
+build a new Metadata index (metadata table based bloom index) to boost the 
performance of existing bloom index. 
+
+## Background
+HoodieBloomIndex is used to find the location of incoming records during every 
write. Bloom index assists Hudi in
+deterministically routing records to a given file group and to distinguish 
inserts vs updates. This aggregate bloom
+index is built from several bloom filters stored in the base file footers. 
Prior to bloom filter lookup, the file
+pruning for the incoming records is also done based on the record key min/max 
stats stored in the base file footers.
+In this RFC, we plan to build a new index for the bloom filters under the 
metadata table which to assist in 
+bloom index based record location tagging. 
+
+## Design
+HoodieBloomIndex involves the following steps to find the right location of 
incoming records
+1. Find all the interested partitions and list all its data files.
+2. File Pruning: Load record key min/max details from all the interested data 
file footers. Filter files and generate
+   files to keys mapping for the incoming records based on the key ranges 
using range interval tree built from
+   previously loaded min/max details.
+3. Bloom Filter lookup: Filter files and prune files to keys mapping for the 
incoming keys mapping based on the bloom
+   filter key lookup
+4. Final Look up in actual data files to find the right location of every 
incoming record
+
+As we could see from step 1 and 2, we are in need of min and max values for 
"_hoodie_record_key" and bloom filters
+from all interested data files to perform the location tagging. In this 
design, we will add these key stats and
+bloom filter to the metadata table and thereby able to quickly load the 
interested details and do faster lookups.
+
+Metadata table already has one partition `files` to help in partition file 
listing. For the metadata table based
+indices, we are proposing to add following two new partitions:
+1. `bloom_filter` - for the file level bloom filter
+2. `column_stats` - for the key range stats
+
+Why metadata table: 
+Metadata table uses HBase HFile - the map file format to store and retrieve 
data. HFile is an indexed file format and
+supports map like faster lookups by keys. Since, we will be storing 
stats/bloom for every file and the index will do
+lookups based on files, we should be able to benefit from the faster lookups 
in HFile. 
+
+
+
+Following sections will talk about different partitions, key formats and then 
dive into the data and control flows.
+
+### MetaIndex/BloomFilter:
+
+A new partition `bloom_filter` will be added under the metadata table. Bloom 
filters from all the base files in the
+data table will be added here. Metadata table is already in the HFile format. 
The existing metadata payload schema will
+be extended and shared for this partition also. The type field will be used to 
detect the bloom filter payload record.
+Here is the schema for the bloom filter payload record.
+```
+   {
+"doc": "Metadata about base file bloom filters",
+"name": "BloomFilterMetadata",
+"type": [
+"null",
+{
+"doc": "Base FileID and its BloomFilter details",
+"name": "HoodieMetadataBloomFilter",
+"type": "record",
+"fields": [
+{
+"doc": "Version/type of the bloom filter metadata",
+"name": "version",
+"type": "string"
+},
+{
+"doc": "Instant timestamp when this metadata was 
created/updated",
+"name": "timestamp",
+"type": "string"
+},
+{
+"doc": "Bloom filter binary byte array",
+"name": "bloomfilter",
+"type": "bytes"
+},
+{
+"doc": "T

  1   2   3   >