[GitHub] [hudi] SteNicholas removed a comment on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation
SteNicholas removed a comment on pull request #2111: URL: https://github.com/apache/hudi/pull/2111#issuecomment-700524082 > According to this([https://github.com/apache/hudi/issues/2051)](https://github.com/apache/hudi/issues/2051%EF%BC%89) test。I can't get the results I want。When we set different value(hoodie.parquet.small.file.limit), the results are still different @linshan-ma Could you please provide this test again? I couldn't visit the test you mentioned. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation
SteNicholas commented on pull request #2111: URL: https://github.com/apache/hudi/pull/2111#issuecomment-702550693 @linshan-ma You could use the current commit to check your test case again. IMO, the current commit has already resolved your problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] SteNicholas removed a comment on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation
SteNicholas removed a comment on pull request #2111: URL: https://github.com/apache/hudi/pull/2111#issuecomment-701890658 > > > According to this([https://github.com/apache/hudi/issues/2051)](https://github.com/apache/hudi/issues/2051%EF%BC%89) test。I can't get the results I want。When we set different value(hoodie.parquet.small.file.limit), the results are still different > > > > > > @linshan-ma Could you please provide this test again? I couldn't visit the test you mentioned. > > @SteNicholas hi,this issue.#2051 @linshan-ma You could use the latest commit to check your test case. IMO, the latest commit has already solved your problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason edited a comment on pull request #2064: URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968 Remaining work items: - [x] 1. Support for rollbacks in MOR Table - [x] 2. Rollback of metadata if commit eventually fails on dataset - [x] 3. HUDI-CLI extensions for metadata debugging - [x] 4. Ensure partial rollbacks do not use metadata table as it does not contain partial info - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table - [ ] 6. Check if MergedBlockReader will neglect log blocks based on uncommitted commits. - [x] 7. Unit test for rollback of partial commits - [x] 8. Schema evolution strategy for metadata table - [x] 9. Unit test for marker based rollback - [x] 10. Can all compaction strategies work off of metadata table itself? Does it have all the data - [ ] 11. Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer - [ ] 12. Fix the case when the table is non-partitioned - [ ] 13. Test for Async cases This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1312) Query side use of Metadata Table
Prashant Wason created HUDI-1312: Summary: Query side use of Metadata Table Key: HUDI-1312 URL: https://issues.apache.org/jira/browse/HUDI-1312 Project: Apache Hudi Issue Type: New Feature Reporter: Prashant Wason Add support for opening Metadata Table on the query side and using it for eliminating file listings. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-970) HoodieTableFileSystem implementation to back API's using consolidated metadata
[ https://issues.apache.org/jira/browse/HUDI-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-970: Status: Open (was: New) > HoodieTableFileSystem implementation to back API's using consolidated metadata > -- > > Key: HUDI-970 > URL: https://issues.apache.org/jira/browse/HUDI-970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-970) HoodieTableFileSystem implementation to back API's using consolidated metadata
[ https://issues.apache.org/jira/browse/HUDI-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason closed HUDI-970. --- Resolution: Fixed > HoodieTableFileSystem implementation to back API's using consolidated metadata > -- > > Key: HUDI-970 > URL: https://issues.apache.org/jira/browse/HUDI-970 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-969) Implement compaction strategies for consolidated metadata table
[ https://issues.apache.org/jira/browse/HUDI-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason closed HUDI-969. --- Resolution: Invalid > Implement compaction strategies for consolidated metadata table > --- > > Key: HUDI-969 > URL: https://issues.apache.org/jira/browse/HUDI-969 > Project: Apache Hudi > Issue Type: Sub-task > Components: Compaction >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-968) Creation of first base/snapshot metadata (similar to onboarding/bootstrapping)
[ https://issues.apache.org/jira/browse/HUDI-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-968: Status: Open (was: New) > Creation of first base/snapshot metadata (similar to onboarding/bootstrapping) > -- > > Key: HUDI-968 > URL: https://issues.apache.org/jira/browse/HUDI-968 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-969) Implement compaction strategies for consolidated metadata table
[ https://issues.apache.org/jira/browse/HUDI-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-969: Status: Open (was: New) > Implement compaction strategies for consolidated metadata table > --- > > Key: HUDI-969 > URL: https://issues.apache.org/jira/browse/HUDI-969 > Project: Apache Hudi > Issue Type: Sub-task > Components: Compaction >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-968) Creation of first base/snapshot metadata (similar to onboarding/bootstrapping)
[ https://issues.apache.org/jira/browse/HUDI-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason closed HUDI-968. --- Resolution: Fixed > Creation of first base/snapshot metadata (similar to onboarding/bootstrapping) > -- > > Key: HUDI-968 > URL: https://issues.apache.org/jira/browse/HUDI-968 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason edited a comment on pull request #2064: URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968 Remaining work items: - [x] 1. Support for rollbacks in MOR Table - [x] 2. Rollback of metadata if commit eventually fails on dataset - [x] 3. HUDI-CLI extensions for metadata debugging - [x] 4. Ensure partial rollbacks do not use metadata table as it does not contain partial info - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table - [ ] 6. Check if MergedBlockReader will neglect log blocks based on uncommitted commits. - [x] 7. Unit test for rollback of partial commits - [x] 8. Schema evolution strategy for metadata table - [x] 9. Unit test for marker based rollback - [x] 10. Can all compaction strategies work off of metadata table itself? Does it have all the data - [ ] 11. Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer - [ ] 12. Query-side use of metadata table - [ ] 13. How we are going to add new metadata partitions in the background, as writers/cleaner/compactors keep running. - [ ] 14. Fix the case when the table is non-partitioned This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason edited a comment on pull request #2064: URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968 Remaining work items: - [x] 1. Support for rollbacks in MOR Table - [x] 2. Rollback of metadata if commit eventually fails on dataset - [x] 3. HUDI-CLI extensions for metadata debugging - [ ] 4. Ensure partial rollbacks do not use metadata table as it does not contain partial info - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table - [ ] 6. Check if MergedBlockReader will neglect log blocks based on uncommitted commits. - [x] 7. Unit test for rollback of partial commits - [x] 8. Schema evolution strategy for metadata table - [x] 9. Unit test for marker based rollback - [x] 10. Can all compaction strategies work off of metadata table itself? Does it have all the data - [ ] 11. Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer - [ ] 12. Query-side use of metadata table - [ ] 13. How we are going to add new metadata partitions in the background, as writers/cleaner/compactors keep running. - [ ] 14. Fix the case when the table is non-partitioned This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on pull request #1929: [HUDI-1160] Support update partial fields for CoW table
leesf commented on pull request #1929: URL: https://github.com/apache/hudi/pull/1929#issuecomment-702500795 > @leesf Any update? Let me know if you need any help here ack, will update the PR ASAP This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on pull request #2082: [WIP] hudi cluster write path poc
leesf commented on pull request #2082: URL: https://github.com/apache/hudi/pull/2082#issuecomment-702494333 > @leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed) Sure, considering I am a little busy these days, it is wonderful if you would take over the PR and land it. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf edited a comment on pull request #2082: [WIP] hudi cluster write path poc
leesf edited a comment on pull request #2082: URL: https://github.com/apache/hudi/pull/2082#issuecomment-702494333 > @leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed) Sure, considering I am a little busy these days, it is wonderful if you @satishkotha would take over the PR and land it. Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r498587425 ## File path: hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java ## @@ -0,0 +1,227 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.metadata; + +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieMetadataException; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericData; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.Path; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +/** + * This is a payload which saves information about a single entry in the Metadata Table. The type of the entry is + * determined by the "type" saved within the record. The following types of entries are saved: + * + * 1. List of partitions: There is a single such record + * key="__all_partitions__" + * filenameToSizeMap={"2020/01/01": 0, "2020/01/02": 0, ...} + * + * 2. List of files in a Partition: There is one such record for each partition + * key=Partition name + * filenameToSizeMap={"file1.parquet": 12345, "file2.parquet": 56789, "file1.log": 9876, + *"file0.parquet": -1, ...} + * + * For deleted files, -1 is used as the size. + * + * During compaction on the table, the deletions are merged with additions and hence pruned. + */ +public class HoodieMetadataPayload implements HoodieRecordPayload { + private static final Logger LOG = LogManager.getLogger(HoodieMetadataPayload.class); + + // Represents the size stored for a deleted file + private static final long DELETED_FILE_SIZE = -1; + + // Key and type for the metadata record + private final String metadataKey; + private final PayloadType type; + + // Filenames which are part of this record + // key=filename, value=file size (or DELETED_FILE_SIZE to represent a deleted file) + private final Map filenameMap = new HashMap<>(); + + // Type of the metadata record + public enum PayloadType { +PARTITION_LIST(1),// list of partitions +PARTITION_FILES(2); // list of files in a partition + +private final int value; Review comment: I have changed the schema. PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r498587284 ## File path: hudi-client/src/main/resources/metadataSchema.txt ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +{ +"namespace": "hudi.metadata", +"type": "record", +"name": "metadata", +"fields": [ +{ +"name": "key", +"type": "string" +}, +{ +"name": "type", +"type": "int", +"doc": "Type of the metadata record (refer to HoodieMetadataPayload)" +}, +{ "name": "filenameToSizeMap", +"type": { +"type": "map", +"doc": "Filenames mapped to their sizes", +"values": { +"type": "long", +"doc": "Size of this file in bytes or -1 for deleted files" Review comment: I have changed the schema. ## File path: hudi-client/src/main/resources/metadataSchema.txt ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +{ +"namespace": "hudi.metadata", +"type": "record", +"name": "metadata", Review comment: I have changed the schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason edited a comment on pull request #2064: URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968 Remaining work items: - [x] 1. Support for rollbacks in MOR Table - [ ] 2. Rollback of metadata if commit eventually fails on dataset - [x] 3. HUDI-CLI extensions for metadata debugging - [ ] 4. Ensure partial rollbacks do not use metadata table as it does not contain partial info - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table - [ ] 6. Check if MergedBlockReader will neglect log blocks based on uncommitted commits. - [x] 7. Unit test for rollback of partial commits - [x] 8. Schema evolution strategy for metadata table - [x] 9. Unit test for marker based rollback - [x] 10. Can all compaction strategies work off of metadata table itself? Does it have all the data - [ ] 11. Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer - [ ] 12. Query-side use of metadata table - [ ] 13. How we are going to add new metadata partitions in the background, as writers/cleaner/compactors keep running. - [ ] 14. Fix the case when the table is non-partitioned This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1289) Using hbase index in spark hangs in Hudi 0.6.0
[ https://issues.apache.org/jira/browse/HUDI-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205913#comment-17205913 ] Vinoth Chandar commented on HUDI-1289: -- Great! Given how h base and guava are notorious for class mismatch hell,I'd prefer that we shade these if its doable at the cost of having to set the listener hard coded). if shading does not work, then we can go with the working combination that you have tested without shading. By shading, I mean relocating the package. > Using hbase index in spark hangs in Hudi 0.6.0 > -- > > Key: HUDI-1289 > URL: https://issues.apache.org/jira/browse/HUDI-1289 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ryan Pifer >Priority: Major > Fix For: 0.6.1 > > > In Hudi 0.6.0 I can see that there was a change to shade the hbase > dependencies in hudi-spark-bundle jar. When using HBASE index with only > hudi-spark-bundle jar specified in spark session there are several issues: > > # Dependencies are not being correctly resolved: > Hbase default status listener class value is defined by the class name before > relocation > {code:java} > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not > org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427) at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:656) > ... 39 moreCaused by: java.lang.RuntimeException: class > org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not > org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2421) ... > 40 more{code} > > [https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClusterStatusListener.java#L72-L73] > > This can be fixed by overriding the status listener class in the hbase > configuration used in hudi > {code:java} > hbaseConfig.set("hbase.status.listener.class", > "org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener"){code} > [https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L134] > > 2. After modifying the above, executors hang when trying to connect to hbase > and fail after about 45 minutes > {code:java} > Caused by: > org.apache.hudi.org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed after attempts=36, exceptions:Thu Sep 17 23:59:42 UTC 2020, null, > java.net.SocketTimeoutException: callTimeout=6, callDuration=68536: row > 'hudiindex,12345678,99' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=ip-10-81-236-56.ec2.internal,16020,1600130997457, seqNum=0 > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1165) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1122) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:957) > at > org.apache.hudi.org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:75) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
[GitHub] [hudi] bschell commented on a change in pull request #2129: [HUDI-1302] Add support for timestamp field in HiveSync
bschell commented on a change in pull request #2129: URL: https://github.com/apache/hudi/pull/2129#discussion_r498571592 ## File path: hudi-sync/hudi-dla-sync/src/main/java/org/apache/hudi/dla/DLASyncConfig.java ## @@ -68,6 +68,9 @@ @Parameter(names = {"--help", "-h"}, help = true) public Boolean help = false; + @Parameter(names = {"--support-timestamp"}, description = "If true, converts int64(timestamp_micros) to timestamp type") + public Boolean supportTimestamp = false; Review comment: I think we need to add this option into DataSourceOptions, DataSourceUtils, and HoodieSparkSqlWriter something like? "hoodie.datasource.hive_sync.support_timestamp" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1289) Using hbase index in spark hangs in Hudi 0.6.0
[ https://issues.apache.org/jira/browse/HUDI-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205855#comment-17205855 ] Ryan Pifer commented on HUDI-1289: -- [~vinoth] I was able to surface the issue. Seems codec package is shaded in bundle but not included as part of bundle. Because of this, hbase references shaded pattern of codec but codec dependency is brought in by spark so class names are unchanged. By including these in bundle but not shading I am able to successfully use hbase index with hudi-spark-bundle jar. I will create a PR for this. Question is do we want to shade hbase dependencies still? We can include codec as part of bundle and continue to shade all. However, this would require still setting status listener class > Using hbase index in spark hangs in Hudi 0.6.0 > -- > > Key: HUDI-1289 > URL: https://issues.apache.org/jira/browse/HUDI-1289 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ryan Pifer >Priority: Major > Fix For: 0.6.1 > > > In Hudi 0.6.0 I can see that there was a change to shade the hbase > dependencies in hudi-spark-bundle jar. When using HBASE index with only > hudi-spark-bundle jar specified in spark session there are several issues: > > # Dependencies are not being correctly resolved: > Hbase default status listener class value is defined by the class name before > relocation > {code:java} > Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class > org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not > org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427) at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:656) > ... 39 moreCaused by: java.lang.RuntimeException: class > org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not > org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener > at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2421) ... > 40 more{code} > > [https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClusterStatusListener.java#L72-L73] > > This can be fixed by overriding the status listener class in the hbase > configuration used in hudi > {code:java} > hbaseConfig.set("hbase.status.listener.class", > "org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener"){code} > [https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L134] > > 2. After modifying the above, executors hang when trying to connect to hbase > and fail after about 45 minutes > {code:java} > Caused by: > org.apache.hudi.org.apache.hadoop.hbase.client.RetriesExhaustedException: > Failed after attempts=36, exceptions:Thu Sep 17 23:59:42 UTC 2020, null, > java.net.SocketTimeoutException: callTimeout=6, callDuration=68536: row > 'hudiindex,12345678,99' on table 'hbase:meta' at > region=hbase:meta,,1.1588230740, > hostname=ip-10-81-236-56.ec2.internal,16020,1600130997457, seqNum=0 > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60) > at > org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1165) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1122) > at > org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:957) > at > org.apache.hudi.org.apache.hadoop.hbase.client.HRe
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702420836 > @wangxianghu Just merged! Thanks again for the herculean effort. > > May be some followups could pop up. Would you be interested in taking them up? if so, I ll mention you along the way sure, just ping me when needed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702407275 @wangxianghu Just merged! Thanks again for the herculean effort. May be some followups could pop up. Would you be interested in taking them up? if so, I ll mention you along the way This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar merged pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar merged pull request #1827: URL: https://github.com/apache/hudi/pull/1827 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2082: [WIP] hudi cluster write path poc
satishkotha commented on pull request #2082: URL: https://github.com/apache/hudi/pull/2082#issuecomment-702405034 @leesf #2048 is landed. is it possible to merge this and address Balaji's comments? (I can help if needed) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #1929: [HUDI-1160] Support update partial fields for CoW table
satishkotha commented on pull request #1929: URL: https://github.com/apache/hudi/pull/1929#issuecomment-702404415 @leesf Any update? Let me know if you need any help here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702335492 @wangxianghu duh ofc. I understand now. Thanks for jumping in @wangxianghu ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-89) Clean up placement, naming, defaults of HoodieWriteConfig
[ https://issues.apache.org/jira/browse/HUDI-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205751#comment-17205751 ] sivabalan narayanan edited comment on HUDI-89 at 10/1/20, 6:41 PM: --- sorry, I was busy for the last few weeks. Here is my understanding. I don't have full context around moving configs to right classes. I need sometime to look into CompactionConfig, StorageConfig, HoodieCleanConfig. But wrt revamping Configs in general here is the idea. As of now, config management is naive. Let's say we want to add a new config, we add a key string to HoodieWriteConfig, and then add a default, expose getter and setter with builder pattern. Call into setting up defaults for properties not set. and then build the HoodieWriteConfig. We wish to introduce a class called ConfigOption (source: [[1|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]], [[2]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/Configuration.java |https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]] and [[3]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOptions.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]] [)|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)] We are not looking for a full fledged ConfigOption(which include fallback keys and and stuff), but just key, value, defaultValue and description for now. We can iteratively add more features. For eg: there was some prep work done earlier on this regards [here|https://github.com/apache/hudi/pull/1094/files]. By this, we can bind a key, default value, description for every config value together. With this, the default value is maintained along w/ the actual config in ConfigOption and so get() should return the actual value if set. if not will return the default value if set. Also, description will come in handy when we want to generate release docs. And btw, with this change, we also want to rename HoodieWriteConfig to HoodieClientConfig. [~vinoth]: I understand we don't want to do a complete overhaul which involves changes to the way users set the properties. So, may I know how do we go about populating ConfigOptions from a map of properties or from a property file. In other words, how do we intercept the value type from the property. Or am I missing something on what changes we need to make. was (Author: shivnarayan): sorry, I was busy for the last few weeks. Here is my understanding. I don't have full context around moving configs to right classes. I need sometime to look into CompactionConfig, StorageConfig, HoodieCleanConfig. But wrt revamping Configs in general here is the idea. As of now, config management is naive. Let's say we want to add a new config, we add a key string to HoodieWriteConfig, and then add a default, expose getter and setter with builder pattern. Call into setting up defaults for properties not set. and then build the HoodieWriteConfig. We wish to introduce a class called ConfigOption (source: [[1]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java |https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]], [[2]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/Configuration.java |https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]] and [[3]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOptions.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]] [)|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)] We are not looking for a full fledged ConfigOption(which include fallback keys and and stuff), but just key, value, defaultValue and description for now. We can iteratively add more features. For eg: there was some prep work done earlier on this regards [here|https://github.com/apache/hudi/pull/1094/files]. By this, we can bind a key, default value, description for every config value together. With this, the default value is maintained along w/ the actual config in ConfigOption and so get() should return the actual value if
[jira] [Commented] (HUDI-89) Clean up placement, naming, defaults of HoodieWriteConfig
[ https://issues.apache.org/jira/browse/HUDI-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205751#comment-17205751 ] sivabalan narayanan commented on HUDI-89: - sorry, I was busy for the last few weeks. Here is my understanding. I don't have full context around moving configs to right classes. I need sometime to look into CompactionConfig, StorageConfig, HoodieCleanConfig. But wrt revamping Configs in general here is the idea. As of now, config management is naive. Let's say we want to add a new config, we add a key string to HoodieWriteConfig, and then add a default, expose getter and setter with builder pattern. Call into setting up defaults for properties not set. and then build the HoodieWriteConfig. We wish to introduce a class called ConfigOption (source: [[1]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java |https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]], [[2]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/Configuration.java |https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]] and [[3]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOptions.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]] [)|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)] We are not looking for a full fledged ConfigOption(which include fallback keys and and stuff), but just key, value, defaultValue and description for now. We can iteratively add more features. For eg: there was some prep work done earlier on this regards [here|https://github.com/apache/hudi/pull/1094/files]. By this, we can bind a key, default value, description for every config value together. With this, the default value is maintained along w/ the actual config in ConfigOption and so get() should return the actual value if set. if not will return the default value if set. Also, description will come in handy when we want to generate release docs. And btw, with this change, we also want to rename HoodieWriteConfig to HoodieClientConfig. [~vinoth]: I understand we don't want to do a complete overhaul which involves changes to the way users set the properties. So, may I know how do we go about populating ConfigOptions from a map of properties or from a property file. In other words, how do we intercept the value type from the property. Or am I missing something on what changes we need to make. > Clean up placement, naming, defaults of HoodieWriteConfig > - > > Key: HUDI-89 > URL: https://issues.apache.org/jira/browse/HUDI-89 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup, Usability, Writer Core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > # Rename HoodieWriteConfig to HoodieClientConfig > # Move bunch of configs from CompactionConfig to StorageConfig > # Introduce new HoodieCleanConfig > # Should we consider lombok or something to automate the > defaults/getters/setters > # Consistent name of properties/defaults > # Enforce bounds more strictly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702318528 @vinothchandar The warn log issue is fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702308385 > > @wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? > > if this round of tests pass, and you confirm, we can land from my perspective > > Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log) > I think we should check `embeddedTimelineServiceHostAddr` instead of `hostAddr`. > > ``` > private void setHostAddr(String embeddedTimelineServiceHostAddr) { >// here we should check embeddedTimelineServiceHostAddr instead of hostAddr > if (hostAddr != null) { > LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + ") found in spark-conf. It was " + this.hostAddr); > this.hostAddr = embeddedTimelineServiceHostAddr; > } else { > LOG.warn("Unable to find driver bind address from spark config"); > this.hostAddr = NetworkUtils.getHostname(); > } > } > ``` I have tested the latest commit with the check condition changed to ``` if (embeddedTimelineServiceHostAddr != null) { It runs well in my local, and the warn log disappeared. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu edited a comment on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702302083 > @wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? > > if this round of tests pass, and you confirm, we can land from my perspective Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log) I think we should check `embeddedTimelineServiceHostAddr` instead of `hostAddr`. ``` private void setHostAddr(String embeddedTimelineServiceHostAddr) { // here we should check embeddedTimelineServiceHostAddr instead of hostAddr if (hostAddr != null) { LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + ") found in spark-conf. It was " + this.hostAddr); this.hostAddr = embeddedTimelineServiceHostAddr; } else { LOG.warn("Unable to find driver bind address from spark config"); this.hostAddr = NetworkUtils.getHostname(); } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702302083 > @wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? > > if this round of tests pass, and you confirm, we can land from my perspective Hi @vinothchandar. > @wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? > > if this round of tests pass, and you confirm, we can land from my perspective Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log) I think we should check `embeddedTimelineServiceHostAddr` instead of `hostAddr`. ``` private void setHostAddr(String embeddedTimelineServiceHostAddr) { // here we should check embeddedTimelineServiceHostAddr instead of hostAddr, hostAddr is always null if (hostAddr != null) { LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + ") found in spark-conf. It was " + this.hostAddr); this.hostAddr = embeddedTimelineServiceHostAddr; } else { LOG.warn("Unable to find driver bind address from spark config"); this.hostAddr = NetworkUtils.getHostname(); } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu edited a comment on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702302083 > @wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? > > if this round of tests pass, and you confirm, we can land from my perspective Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is ok, no warn log) I think we should check `embeddedTimelineServiceHostAddr` instead of `hostAddr`. ``` private void setHostAddr(String embeddedTimelineServiceHostAddr) { // here we should check embeddedTimelineServiceHostAddr instead of hostAddr, hostAddr is always null if (hostAddr != null) { LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + ") found in spark-conf. It was " + this.hostAddr); this.hostAddr = embeddedTimelineServiceHostAddr; } else { LOG.warn("Unable to find driver bind address from spark config"); this.hostAddr = NetworkUtils.getHostname(); } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2129: [HUDI-1302] Add support for timestamp field in HiveSync
satishkotha commented on pull request #2129: URL: https://github.com/apache/hudi/pull/2129#issuecomment-702275562 @pratyakshsharma will you be able to review this week? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702248781 @wangxianghu can you please test the latest commit. To be clear, you are saying you don't get the warning on master, but get it on this branch. right? if this round of tests pass, and you confirm, we can land from my perspective This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tandonraghavs edited a comment on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)
tandonraghavs edited a comment on issue #2131: URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811 @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But the problem is **preCombine** doesnt have reference to **Schema** and it is giving me bytes , so how do i get the Generic Record out of it? Which is the reason I am not able to implement any custom logic in _preCombine_ as I did in _combineAndGetUpdateValue_. I am using hudi via Spark Datasource (0.5.3). And due to the scale of data I dont want to run Compaction after every commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_ - How do i get hold of Schema in preCombine? Sample Code of my Spark job. **jsonDf** -> This is a simple Json String which contains the resords. Dataset data=jsonDf.map((MapFunction) record -> generateHoodieRecord(record, schemaStr),Encoders.bean(GenericRecord.class)); Dataset ds= AvroConversionUtils.createDataFrame(data.rdd(), schemaStr,sparkSession); ds .write().format("org.apache.hudi"). .options ()... .mode(SaveMode.Append) .save(tablePath); This Ticket also talks about the same - https://issues.apache.org/jira/browse/HUDI-898 - Also, i think we cannot add/remove any field values in preCombine, as doing it manually is causing _EOFException in reading log file._ - Am i missing something here? So, i dont think we can use Hudi if we have partial Records coming in for oplogs and we have to apply all the oplogs to the existing dataset. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tandonraghavs edited a comment on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)
tandonraghavs edited a comment on issue #2131: URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811 @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But the problem is **preCombine** doesnt have reference to **Schema** and it is giving me bytes , so how do i get the Generic Record out of it? Which is the reason I am not able to implement any custom logic in _preCombine_ as I did in _combineAndGetUpdateValue_. I am using hudi via Spark Datasource (0.5.3). And due to the scale of data I dont want to run Compaction after every commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_ - How do i get hold of Schema in preCombine? Sample Code of my Spark job. **jsonDf** -> This is a simple Json String which contains the resords. Dataset data=jsonDf.map((MapFunction) record -> generateHoodieRecord(record, schemaStr),Encoders.bean(GenericRecord.class)); Dataset ds= AvroConversionUtils.createDataFrame(data.rdd(), schemaStr,sparkSession); ds .write().format("org.apache.hudi"). .options ()... .mode(SaveMode.Append) .save(tablePath); This Ticket also talks about the same - https://issues.apache.org/jira/browse/HUDI-898 Also, i think we cannot add/remove any field values in preCombine, as doing it manually is causing _EOFException in reading log file._ Am i missing something here? So, i dont think we can use Hudi if we have partial Records coming in for oplogs and we have to apply all the oplogs to the existing dataset. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on issue #2110: [SUPPORT] Executor memory recommendation
n3nash commented on issue #2110: URL: https://github.com/apache/hudi/issues/2110#issuecomment-702239474 @tooptoop4 No, the size of the existing table should not matter. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu removed a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu removed a comment on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702217863 > @leesf do you see the following exception? could not understand how you ll get the other one even. > > ``` > LOG.info("Starting Timeline service !!"); > Option hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST); > if (!hostAddr.isPresent()) { > throw new HoodieException("Unable to find host address to bind timeline server to."); > } > timelineServer = Option.of(new EmbeddedTimelineService(context, hostAddr.get(), > config.getClientSpecifiedViewStorageConfig())); > ``` > > Either way, good pointer. the behavior has changed around this a bit actually. So will try and tweak and push a fix I got this warning too. The code here seems not the same. ``` // Run Embedded Timeline Server LOG.info("Starting Timeline service !!"); Option hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST); timelineServer = Option.of(new EmbeddedTimelineService(context, hostAddr.orElse(null), config.getClientSpecifiedViewStorageConfig())); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-70008 @vinothchandar @yanghua @leesf The demo runs well in my local, except the warning `WARN embedded.EmbeddedTimelineService: Unable to find driver bind address from spark config` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702217863 > @leesf do you see the following exception? could not understand how you ll get the other one even. > > ``` > LOG.info("Starting Timeline service !!"); > Option hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST); > if (!hostAddr.isPresent()) { > throw new HoodieException("Unable to find host address to bind timeline server to."); > } > timelineServer = Option.of(new EmbeddedTimelineService(context, hostAddr.get(), > config.getClientSpecifiedViewStorageConfig())); > ``` > > Either way, good pointer. the behavior has changed around this a bit actually. So will try and tweak and push a fix I got this warning too. The code here seems not the same. ``` // Run Embedded Timeline Server LOG.info("Starting Timeline service !!"); Option hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST); timelineServer = Option.of(new EmbeddedTimelineService(context, hostAddr.orElse(null), config.getClientSpecifiedViewStorageConfig())); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702174407 @leesf do you see the following exception? could not understand how you ll get the other one even. ``` LOG.info("Starting Timeline service !!"); Option hostAddr = context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST); if (!hostAddr.isPresent()) { throw new HoodieException("Unable to find host address to bind timeline server to."); } timelineServer = Option.of(new EmbeddedTimelineService(context, hostAddr.get(), config.getClientSpecifiedViewStorageConfig())); ``` Either way, good pointer. the behavior has changed around this a bit actually. So will try and tweak and push a fix This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
leesf edited a comment on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702125051 1. Run quickstart demo: found the warn log: `20/10/01 21:11:18 WARN embedded.EmbeddedTimelineService: Unable to find driver bind address from spark config`, but works fine, the warn log is not found in 0.6.0. @vinothchandar @wangxianghu 2. Ran my own unit tests, works fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
leesf commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-702125051 Run quickstart demo: found the warn log: `20/10/01 21:11:18 WARN embedded.EmbeddedTimelineService: Unable to find driver bind address from spark config`, but not found the warn log in 0.6.0 @vinothchandar @wangxianghu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
yanghua commented on a change in pull request #1827: URL: https://github.com/apache/hudi/pull/1827#discussion_r498188986 ## File path: hudi-cli/pom.xml ## @@ -148,7 +148,14 @@ org.apache.hudi - hudi-client + hudi-client-common + ${project.version} + test + test-jar + + + org.apache.hudi + hudi-spark-client Review comment: So, will we keep `hudi-spark-client`? We have had a `hudi-spark` module. IMHO, this naming may not seem so clear. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
yanghua commented on a change in pull request #1827: URL: https://github.com/apache/hudi/pull/1827#discussion_r498188986 ## File path: hudi-cli/pom.xml ## @@ -148,7 +148,14 @@ org.apache.hudi - hudi-client + hudi-client-common + ${project.version} + test + test-jar + + + org.apache.hudi + hudi-spark-client Review comment: So, will we keep `hudi-spark-client`? We have had a `hudi-spark` module. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tandonraghavs edited a comment on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)
tandonraghavs edited a comment on issue #2131: URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811 @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But the problem is **preCombine** doesnt have reference to **Schema** and it is giving me bytes , so how do i get the Generic Record out of it? Which is the reason I am not able to implement any custom logic in _preCombine_ as I did in _combineAndGetUpdateValue_. I am using hudi via Spark Datasource (0.5.3). And due to the scale of data I dont want to run Compaction after every commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_ - How do i get hold of Schema in preCombine? Sample Code of my Spark job. **jsonDf** -> This is a simple Json String which contains the resords. Dataset data=jsonDf.map((MapFunction) record -> generateHoodieRecord(record, schemaStr),Encoders.bean(GenericRecord.class)); Dataset ds= AvroConversionUtils.createDataFrame(data.rdd(), schemaStr,sparkSession); ds .write().format("org.apache.hudi"). .options ()... .mode(SaveMode.Append) .save(tablePath); This Ticket also talks about the same - https://issues.apache.org/jira/browse/HUDI-898 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tandonraghavs commented on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)
tandonraghavs commented on issue #2131: URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811 @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But the problem is **preCombine** doesnt have reference to **Schema** and it is giving me bytes , so how do i get the Generic Record out of it? Which is the reason I am not able to implement any custom logic in _preCombine_ as I did in _combineAndGetUpdateValue_. I am using hudi via Spark Datasource (0.5.3). And due to the scale of data I dont want to run Compaction after every commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_ - How do i get hold of Schema in preCombine? Sample Code of my Spark job. **jsonDf** -> This is a simple Json String which contains the resords. Dataset data=jsonDf.map((MapFunction) record -> generateHoodieRecord(record, schemaStr),Encoders.bean(GenericRecord.class)); Dataset ds= AvroConversionUtils.createDataFrame(data.rdd(), schemaStr,sparkSession); ds .write().format("org.apache.hudi"). .options ()... .mode(SaveMode.Append) .save(tablePath); This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205435#comment-17205435 ] Balaji Varadarajan commented on HUDI-1308: -- cc [~vinoth] > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1311) Writes creating/updating large number of files seeing errors when deleting marker files in S3
Balaji Varadarajan created HUDI-1311: Summary: Writes creating/updating large number of files seeing errors when deleting marker files in S3 Key: HUDI-1311 URL: https://issues.apache.org/jira/browse/HUDI-1311 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan Dont have the exception trace handy. Will add them when I run into this next time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1310) Corruption Block Handling too slow in S3
Balaji Varadarajan created HUDI-1310: Summary: Corruption Block Handling too slow in S3 Key: HUDI-1310 URL: https://issues.apache.org/jira/browse/HUDI-1310 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan The logic to figure out next valid starting block offset is too slow when run in S3. I have bolded the log message that takes long time to appear. 36589 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} 36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block in file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} with block size(3723305) running past EOF 36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Log HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} has a corrupted block at 14 *44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block in* HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} starts at 3723319 44566 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a corrupt block in s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1308: Assignee: Balaji Varadarajan (was: Prashant Wason) > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
[ https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1309: Assignee: Prashant Wason > Listing Metadata unreadable in S3 as the log block is deemed corrupted > -- > > Key: HUDI-1309 > URL: https://issues.apache.org/jira/browse/HUDI-1309 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Prashant Wason >Priority: Major > > When running metadata list-partitions CLI command, I am seeing the below > messages and the partition list is empty. Was expecting 10K partitions. > > 36589 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning > log file > HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} > 36590 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block > in file > HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} with block size(3723305) running past EOF > 36684 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Log > HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} has a corrupted block at 14 > 44515 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block > in > HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} starts at 3723319 > 44566 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a > corrupt block in > s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 > 44567 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-1308: Assignee: Prashant Wason > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Prashant Wason >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted
Balaji Varadarajan created HUDI-1309: Summary: Listing Metadata unreadable in S3 as the log block is deemed corrupted Key: HUDI-1309 URL: https://issues.apache.org/jira/browse/HUDI-1309 Project: Apache Hudi Issue Type: Sub-task Components: Writer Core Reporter: Balaji Varadarajan When running metadata list-partitions CLI command, I am seeing the below messages and the partition list is empty. Was expecting 10K partitions. 36589 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} 36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block in file HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} with block size(3723305) running past EOF 36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Log HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} has a corrupted block at 14 44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block in HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', fileLen=0} starts at 3723319 44566 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a corrupt block in s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 44567 [Spring Shell] INFO org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1308) Issues found during testing RFC-15
Balaji Varadarajan created HUDI-1308: Summary: Issues found during testing RFC-15 Key: HUDI-1308 URL: https://issues.apache.org/jira/browse/HUDI-1308 Project: Apache Hudi Issue Type: Improvement Components: Writer Core Reporter: Balaji Varadarajan THis is an umbrella ticket containing all the issues found during testing RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1308) Issues found during testing RFC-15
[ https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan updated HUDI-1308: - Status: Open (was: New) > Issues found during testing RFC-15 > -- > > Key: HUDI-1308 > URL: https://issues.apache.org/jira/browse/HUDI-1308 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > > THis is an umbrella ticket containing all the issues found during testing > RFC-15 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] bvaradar commented on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)
bvaradar commented on issue #2131: URL: https://github.com/apache/hudi/issues/2131#issuecomment-702072024 @tandonraghavs : For compaction, the payload class defined in hoodie.properties is used for pre-combining. Can you check what is the payload class configured in hoodie.properties ? As you would have noticed, compaction would first pre-combines all delta records first before merging using combineAndGetUpdateValue. So, you would need to implement custom merging in preCombine method in your payload class. IIUC, Oplogs are partial row image(updates). Wouldn't it be possible to employ the same merging logic that you were doing as part of combineAndGetUpdateValue ? Also, which version of Hudi are you using ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2130: [SUPPORT] Use hive jdbc to excute hudi query failed
bvaradar commented on issue #2130: URL: https://github.com/apache/hudi/issues/2130#issuecomment-702062546 @Trevor-zhang : IIUC, this is not specific to Hudi. Did you get the full exception stack trace from the server logs ? Are you sure this due to hive.input.format setting ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-701989092 @wangxianghu Please help test this out if possible. Once the tests pass again, planning to merge this, morning PST cc @yanghua This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Karl-WangSK commented on a change in pull request #2106: [HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal
Karl-WangSK commented on a change in pull request #2106: URL: https://github.com/apache/hudi/pull/2106#discussion_r498081682 ## File path: hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java ## @@ -186,11 +186,15 @@ protected void rollBackInflightBootstrap() { * @return JavaRDD[WriteStatus] - RDD of WriteStatus to inspect errors and counts */ public JavaRDD upsert(JavaRDD> records, final String instantTime) { +return upsert(records, instantTime, null); + } + + public JavaRDD upsert(JavaRDD> records, final String instantTime, String schema) { Review comment: ok This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
vinothchandar commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-701963298 I actually figured out that we can remove `P` altogether. since `HoodieIndex#fetchRecordLocation` is not used much outside of internal APIs. So will push a final change for that . tests are passing now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
wangxianghu commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-701951988 > @wangxianghu @yanghua I have rebased this against master. Please take a look at my changes. > > High level, we could re-use more code, but it needs an abstraction that can wrap `RDD` or `DataSet` or `D > @wangxianghu @yanghua I have rebased this against master. Please take a look at my changes. > > High level, we could re-use more code, but it needs an abstraction that can wrap `RDD` or `DataSet` or `DataStream` adequately and support basic operations like `.map()`, `reduceByKey()` etc. We can do this in a second pass once we have a working Flink impl. For now this will do. > > I am trying to get the tests to pass. if they do, we could go ahead and merge Thanks, @vinothchandar, this is really great work! Yes, we can do more abstractions about basic `map`, `reduceByKey` methods in `HoodieEngineContext`, or some Util classes next. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine
codecov-commenter commented on pull request #1827: URL: https://github.com/apache/hudi/pull/1827#issuecomment-701949671 # [Codecov](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=h1) Report > Merging [#1827](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/a99e93bed542c8ae30a641d1df616cc2cd5798e1?el=desc) will **decrease** coverage by `3.75%`. > The diff coverage is `30.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/1827/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master#1827 +/- ## - Coverage 59.89% 56.14% -3.76% + Complexity 4454 2658-1796 Files 558 324 -234 Lines 2337814775-8603 Branches 2348 1539 -809 - Hits 14003 8295-5708 + Misses 8355 5783-2572 + Partials 1020 697 -323 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | #hudicli | `38.37% <30.00%> (-27.83%)` | `193.00 <0.00> (-1615.00)` | | | #hudiclient | `100.00% <ø> (+25.46%)` | `0.00 <ø> (-1615.00)` | :arrow_up: | | #hudicommon | `54.74% <ø> (ø)` | `1793.00 <ø> (ø)` | | | #hudihadoopmr | `?` | `?` | | | #hudispark | `67.18% <ø> (-0.02%)` | `311.00 <ø> (ø)` | | | #huditimelineservice | `64.43% <ø> (ø)` | `49.00 <ø> (ø)` | | | #hudiutilities | `69.43% <ø> (+0.05%)` | `312.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=) | `14.28% <0.00%> (ø)` | `3.00 <0.00> (ø)` | | | [...main/java/org/apache/hudi/cli/utils/SparkUtil.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL3V0aWxzL1NwYXJrVXRpbC5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...n/java/org/apache/hudi/cli/commands/SparkMain.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NwYXJrTWFpbi5qYXZh) | `6.43% <37.50%> (+0.40%)` | `4.00 <0.00> (ø)` | | | [...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==) | `45.36% <0.00%> (ø)` | `21.00% <0.00%> (ø%)` | | | [...in/scala/org/apache/hudi/HoodieStreamingSink.scala](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3RyZWFtaW5nU2luay5zY2FsYQ==) | `24.00% <0.00%> (ø)` | `10.00% <0.00%> (ø%)` | | | [...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=) | `56.20% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | | | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | `64.59% <0.00%> (ø)` | `30.00% <0.00%> (ø%)` | | | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `68.16% <0.00%> (ø)` | `39.00% <0.00%> (ø%)` | | | [.../hudi/async/SparkStreamingAsyncCompactService.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9hc3luYy9TcGFya1N0cmVhbWluZ0FzeW5jQ29tcGFjdFNlcnZpY2UuamF2YQ==) | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | | | [.../hudi/internal/HoodieDataSourceInternalWriter.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9pbnRlcm5hbC9Ib29kaWVEYXRhU291cmNlSW50ZXJuYWxXcml0ZXIuamF2YQ==) | `87.50% <0.00%> (ø)` | `8.00% <0.00%> (ø%)` | | | ... and [4