[GitHub] [hudi] SteNicholas removed a comment on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-01 Thread GitBox


SteNicholas removed a comment on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-700524082


   > According to 
this([https://github.com/apache/hudi/issues/2051)](https://github.com/apache/hudi/issues/2051%EF%BC%89)
 test。I can't get the results I want。When we set different 
value(hoodie.parquet.small.file.limit), the results are still different
   
   @linshan-ma Could you please provide this test again? I couldn't visit the 
test you mentioned.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-01 Thread GitBox


SteNicholas commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-702550693


   @linshan-ma You could use the current commit to check your test case again. 
IMO, the current commit has already resolved your problem.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] SteNicholas removed a comment on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-01 Thread GitBox


SteNicholas removed a comment on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-701890658


   > > > According to 
this([https://github.com/apache/hudi/issues/2051)](https://github.com/apache/hudi/issues/2051%EF%BC%89)
 test。I can't get the results I want。When we set different 
value(hoodie.parquet.small.file.limit), the results are still different
   > > 
   > > 
   > > @linshan-ma Could you please provide this test again? I couldn't visit 
the test you mentioned.
   > 
   > @SteNicholas hi,this issue.#2051
   
   @linshan-ma You could use the latest commit to check your test case. IMO, 
the latest commit has already solved your problem.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-01 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968


   Remaining work items:
   
   - [x] 1. Support for rollbacks in MOR Table
   - [x] 2. Rollback of metadata if commit eventually fails on dataset 
   - [x] 3. HUDI-CLI extensions for metadata debugging
   - [x] 4. Ensure partial rollbacks do not use metadata table as it does not 
contain partial info 
   - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have 
older timestamp than INIT timestamp on metadata table
   - [ ] 6. Check if MergedBlockReader will neglect log blocks based on 
uncommitted commits.
   - [x] 7. Unit test for rollback of partial commits
   - [x] 8. Schema evolution strategy for metadata table
   - [x] 9. Unit test for marker based rollback
   - [x] 10. Can all compaction strategies work off of metadata table itself? 
Does it have all the data
   - [ ] 11. Async Clean and Async Compaction - how will they work with 
metadata table updates - check multi writer
   - [ ] 12. Fix the case when the table is non-partitioned
   - [ ] 13. Test for Async cases
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1312) Query side use of Metadata Table

2020-10-01 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1312:


 Summary: Query side use of Metadata Table
 Key: HUDI-1312
 URL: https://issues.apache.org/jira/browse/HUDI-1312
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Prashant Wason


Add support for opening Metadata Table on the query side and using it for 
eliminating file listings.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-970) HoodieTableFileSystem implementation to back API's using consolidated metadata

2020-10-01 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-970:

Status: Open  (was: New)

> HoodieTableFileSystem implementation to back API's using consolidated metadata
> --
>
> Key: HUDI-970
> URL: https://issues.apache.org/jira/browse/HUDI-970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-970) HoodieTableFileSystem implementation to back API's using consolidated metadata

2020-10-01 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-970.
---
Resolution: Fixed

> HoodieTableFileSystem implementation to back API's using consolidated metadata
> --
>
> Key: HUDI-970
> URL: https://issues.apache.org/jira/browse/HUDI-970
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-969) Implement compaction strategies for consolidated metadata table

2020-10-01 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-969.
---
Resolution: Invalid

> Implement compaction strategies for consolidated metadata table
> ---
>
> Key: HUDI-969
> URL: https://issues.apache.org/jira/browse/HUDI-969
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Compaction
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-968) Creation of first base/snapshot metadata (similar to onboarding/bootstrapping)

2020-10-01 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-968:

Status: Open  (was: New)

> Creation of first base/snapshot metadata (similar to onboarding/bootstrapping)
> --
>
> Key: HUDI-968
> URL: https://issues.apache.org/jira/browse/HUDI-968
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-969) Implement compaction strategies for consolidated metadata table

2020-10-01 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-969:

Status: Open  (was: New)

> Implement compaction strategies for consolidated metadata table
> ---
>
> Key: HUDI-969
> URL: https://issues.apache.org/jira/browse/HUDI-969
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Compaction
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-968) Creation of first base/snapshot metadata (similar to onboarding/bootstrapping)

2020-10-01 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-968.
---
Resolution: Fixed

> Creation of first base/snapshot metadata (similar to onboarding/bootstrapping)
> --
>
> Key: HUDI-968
> URL: https://issues.apache.org/jira/browse/HUDI-968
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-01 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968


   Remaining work items:
   
   - [x] 1. Support for rollbacks in MOR Table
   - [x] 2. Rollback of metadata if commit eventually fails on dataset 
   - [x] 3. HUDI-CLI extensions for metadata debugging
   - [x] 4. Ensure partial rollbacks do not use metadata table as it does not 
contain partial info 
   - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have 
older timestamp than INIT timestamp on metadata table
   - [ ] 6. Check if MergedBlockReader will neglect log blocks based on 
uncommitted commits.
   - [x] 7. Unit test for rollback of partial commits
   - [x] 8. Schema evolution strategy for metadata table
   - [x] 9. Unit test for marker based rollback
   - [x] 10. Can all compaction strategies work off of metadata table itself? 
Does it have all the data
   - [ ] 11. Async Clean and Async Compaction - how will they work with 
metadata table updates - check multi writer
   - [ ] 12. Query-side use of metadata table
   - [ ] 13. How we are going to add new metadata partitions in the background, 
as writers/cleaner/compactors keep running.
   - [ ] 14. Fix the case when the table is non-partitioned
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-01 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968


   Remaining work items:
   
   - [x] 1. Support for rollbacks in MOR Table
   - [x] 2. Rollback of metadata if commit eventually fails on dataset 
   - [x] 3. HUDI-CLI extensions for metadata debugging
   - [ ] 4. Ensure partial rollbacks do not use metadata table as it does not 
contain partial info 
   - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have 
older timestamp than INIT timestamp on metadata table
   - [ ] 6. Check if MergedBlockReader will neglect log blocks based on 
uncommitted commits.
   - [x] 7. Unit test for rollback of partial commits
   - [x] 8. Schema evolution strategy for metadata table
   - [x] 9. Unit test for marker based rollback
   - [x] 10. Can all compaction strategies work off of metadata table itself? 
Does it have all the data
   - [ ] 11. Async Clean and Async Compaction - how will they work with 
metadata table updates - check multi writer
   - [ ] 12. Query-side use of metadata table
   - [ ] 13. How we are going to add new metadata partitions in the background, 
as writers/cleaner/compactors keep running.
   - [ ] 14. Fix the case when the table is non-partitioned
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1929: [HUDI-1160] Support update partial fields for CoW table

2020-10-01 Thread GitBox


leesf commented on pull request #1929:
URL: https://github.com/apache/hudi/pull/1929#issuecomment-702500795


   > @leesf Any update? Let me know if you need any help here
   
   ack, will update the PR ASAP



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2082: [WIP] hudi cluster write path poc

2020-10-01 Thread GitBox


leesf commented on pull request #2082:
URL: https://github.com/apache/hudi/pull/2082#issuecomment-702494333


   > @leesf #2048 is landed. is it possible to merge this and address Balaji's 
comments? (I can help if needed)
   
   Sure, considering I am a little busy these days, it is wonderful if you 
would take over the PR and land it. Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf edited a comment on pull request #2082: [WIP] hudi cluster write path poc

2020-10-01 Thread GitBox


leesf edited a comment on pull request #2082:
URL: https://github.com/apache/hudi/pull/2082#issuecomment-702494333


   > @leesf #2048 is landed. is it possible to merge this and address Balaji's 
comments? (I can help if needed)
   
   Sure, considering I am a little busy these days, it is wonderful if you 
@satishkotha would take over the PR and land it. Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-01 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r498587425



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadataPayload.java
##
@@ -0,0 +1,227 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieMetadataException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericData;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ * This is a payload which saves information about a single entry in the 
Metadata Table. The type of the entry is
+ * determined by the "type" saved within the record. The following types of 
entries are saved:
+ *
+ *   1. List of partitions: There is a single such record
+ * key="__all_partitions__"
+ * filenameToSizeMap={"2020/01/01": 0, "2020/01/02": 0, ...}
+ *
+ *   2. List of files in a Partition: There is one such record for each 
partition
+ * key=Partition name
+ * filenameToSizeMap={"file1.parquet": 12345, "file2.parquet": 56789, 
"file1.log": 9876,
+ *"file0.parquet": -1, ...}
+ *
+ *  For deleted files, -1 is used as the size.
+ *
+ *  During compaction on the table, the deletions are merged with additions 
and hence pruned.
+ */
+public class HoodieMetadataPayload implements 
HoodieRecordPayload {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataPayload.class);
+
+  // Represents the size stored for a deleted file
+  private static final long DELETED_FILE_SIZE = -1;
+
+  // Key and type for the metadata record
+  private final String metadataKey;
+  private final PayloadType type;
+
+  // Filenames which are part of this record
+  // key=filename, value=file size (or DELETED_FILE_SIZE to represent a 
deleted file)
+  private final Map filenameMap = new HashMap<>();
+
+  // Type of the metadata record
+  public enum PayloadType {
+PARTITION_LIST(1),// list of partitions
+PARTITION_FILES(2);   // list of files in a partition
+
+private final int value;

Review comment:
   I have changed the schema. PTAL.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-01 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r498587284



##
File path: hudi-client/src/main/resources/metadataSchema.txt
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+{
+"namespace": "hudi.metadata",
+"type": "record",
+"name": "metadata",
+"fields": [
+{
+"name": "key",
+"type": "string"
+},
+{
+"name": "type",
+"type": "int",
+"doc": "Type of the metadata record (refer to 
HoodieMetadataPayload)"
+},
+{   "name": "filenameToSizeMap",
+"type": {
+"type": "map",
+"doc": "Filenames mapped to their sizes",
+"values": {
+"type": "long",
+"doc": "Size of this file in bytes or -1 for deleted files"

Review comment:
   I have changed the schema.

##
File path: hudi-client/src/main/resources/metadataSchema.txt
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+{
+"namespace": "hudi.metadata",
+"type": "record",
+"name": "metadata",

Review comment:
   I have changed the schema.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason edited a comment on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-01 Thread GitBox


prashantwason edited a comment on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-686688968


   Remaining work items:
   
   - [x] 1. Support for rollbacks in MOR Table
   - [ ] 2. Rollback of metadata if commit eventually fails on dataset 
   - [x] 3. HUDI-CLI extensions for metadata debugging
   - [ ] 4. Ensure partial rollbacks do not use metadata table as it does not 
contain partial info 
   - [ ] 5. Fix initialization when Async jobs are scheduled - these jobs have 
older timestamp than INIT timestamp on metadata table
   - [ ] 6. Check if MergedBlockReader will neglect log blocks based on 
uncommitted commits.
   - [x] 7. Unit test for rollback of partial commits
   - [x] 8. Schema evolution strategy for metadata table
   - [x] 9. Unit test for marker based rollback
   - [x] 10. Can all compaction strategies work off of metadata table itself? 
Does it have all the data
   - [ ] 11. Async Clean and Async Compaction - how will they work with 
metadata table updates - check multi writer
   - [ ] 12. Query-side use of metadata table
   - [ ] 13. How we are going to add new metadata partitions in the background, 
as writers/cleaner/compactors keep running.
   - [ ] 14. Fix the case when the table is non-partitioned
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1289) Using hbase index in spark hangs in Hudi 0.6.0

2020-10-01 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205913#comment-17205913
 ] 

Vinoth Chandar commented on HUDI-1289:
--

Great! Given how h base and guava are notorious for class mismatch hell,I'd 
prefer that we shade these if its doable at the cost of having to set the 
listener hard coded). 



if shading does not work, then we can go with the working combination that you 
have tested without shading. By shading, I mean relocating the package.

> Using hbase index in spark hangs in Hudi 0.6.0
> --
>
> Key: HUDI-1289
> URL: https://issues.apache.org/jira/browse/HUDI-1289
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ryan Pifer
>Priority: Major
> Fix For: 0.6.1
>
>
> In Hudi 0.6.0 I can see that there was a change to shade the hbase 
> dependencies in hudi-spark-bundle jar. When using HBASE index with only 
> hudi-spark-bundle jar specified in spark session there are several issues:
>  
>  # Dependencies are not being correctly resolved:
> Hbase default status listener class value is defined by the class name before 
> relocation
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener 
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427) at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:656)
>  ... 39 moreCaused by: java.lang.RuntimeException: class 
> org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener 
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2421) ... 
> 40 more{code}
>  
> [https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClusterStatusListener.java#L72-L73]
>  
> This can be fixed by overriding the status listener class in the hbase 
> configuration used in hudi 
> {code:java}
> hbaseConfig.set("hbase.status.listener.class", 
> "org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener"){code}
> [https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L134]
>  
> 2. After modifying the above, executors hang when trying to connect to hbase 
> and fail after about 45 minutes
> {code:java}
> Caused by: 
> org.apache.hudi.org.apache.hadoop.hbase.client.RetriesExhaustedException: 
> Failed after attempts=36, exceptions:Thu Sep 17 23:59:42 UTC 2020, null, 
> java.net.SocketTimeoutException: callTimeout=6, callDuration=68536: row 
> 'hudiindex,12345678,99' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=ip-10-81-236-56.ec2.internal,16020,1600130997457, seqNum=0
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1165)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1122)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:957)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:75)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR

[GitHub] [hudi] bschell commented on a change in pull request #2129: [HUDI-1302] Add support for timestamp field in HiveSync

2020-10-01 Thread GitBox


bschell commented on a change in pull request #2129:
URL: https://github.com/apache/hudi/pull/2129#discussion_r498571592



##
File path: 
hudi-sync/hudi-dla-sync/src/main/java/org/apache/hudi/dla/DLASyncConfig.java
##
@@ -68,6 +68,9 @@
   @Parameter(names = {"--help", "-h"}, help = true)
   public Boolean help = false;
 
+  @Parameter(names = {"--support-timestamp"}, description = "If true, converts 
int64(timestamp_micros) to timestamp type")
+  public Boolean supportTimestamp = false;

Review comment:
   I think we need to add this option into DataSourceOptions, 
DataSourceUtils, and HoodieSparkSqlWriter
   
   something like?
   "hoodie.datasource.hive_sync.support_timestamp"
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1289) Using hbase index in spark hangs in Hudi 0.6.0

2020-10-01 Thread Ryan Pifer (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205855#comment-17205855
 ] 

Ryan Pifer commented on HUDI-1289:
--

[~vinoth] I was able to surface the issue. Seems codec package is shaded in 
bundle but not included as part of bundle. Because of this, hbase references 
shaded pattern of codec but codec dependency is brought in by spark so class 
names are unchanged. By including these in bundle but not shading I am able to 
successfully use hbase index with hudi-spark-bundle jar. I will create a PR for 
this.

 

Question is do we want to shade hbase dependencies still? We can include codec 
as part of bundle and continue to shade all. However, this would require still 
setting status listener class 

 

 

> Using hbase index in spark hangs in Hudi 0.6.0
> --
>
> Key: HUDI-1289
> URL: https://issues.apache.org/jira/browse/HUDI-1289
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ryan Pifer
>Priority: Major
> Fix For: 0.6.1
>
>
> In Hudi 0.6.0 I can see that there was a change to shade the hbase 
> dependencies in hudi-spark-bundle jar. When using HBASE index with only 
> hudi-spark-bundle jar specified in spark session there are several issues:
>  
>  # Dependencies are not being correctly resolved:
> Hbase default status listener class value is defined by the class name before 
> relocation
> {code:java}
> Caused by: java.lang.RuntimeException: java.lang.RuntimeException: class 
> org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener 
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2427) at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:656)
>  ... 39 moreCaused by: java.lang.RuntimeException: class 
> org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener not 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$Listener 
> at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2421) ... 
> 40 more{code}
>  
> [https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ClusterStatusListener.java#L72-L73]
>  
> This can be fixed by overriding the status listener class in the hbase 
> configuration used in hudi 
> {code:java}
> hbaseConfig.set("hbase.status.listener.class", 
> "org.apache.hudi.org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener"){code}
> [https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java#L134]
>  
> 2. After modifying the above, executors hang when trying to connect to hbase 
> and fail after about 45 minutes
> {code:java}
> Caused by: 
> org.apache.hudi.org.apache.hadoop.hbase.client.RetriesExhaustedException: 
> Failed after attempts=36, exceptions:Thu Sep 17 23:59:42 UTC 2020, null, 
> java.net.SocketTimeoutException: callTimeout=6, callDuration=68536: row 
> 'hudiindex,12345678,99' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=ip-10-81-236-56.ec2.internal,16020,1600130997457, seqNum=0
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.throwEnrichedException(RpcRetryingCallerWithReadReplicas.java:276)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:210)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:212)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:186)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1275)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1165)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1122)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:957)
>  at 
> org.apache.hudi.org.apache.hadoop.hbase.client.HRe

[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702420836


   > @wangxianghu Just merged! Thanks again for the herculean effort.
   > 
   > May be some followups could pop up. Would you be interested in taking them 
up? if so, I ll mention you along the way
   
   sure, just ping me when needed



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702407275


   @wangxianghu Just merged! Thanks again for the herculean effort. 
   
   May be some followups could pop up. Would you be interested in taking them 
up? if so, I ll mention you along the way



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar merged pull request #1827:
URL: https://github.com/apache/hudi/pull/1827


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2082: [WIP] hudi cluster write path poc

2020-10-01 Thread GitBox


satishkotha commented on pull request #2082:
URL: https://github.com/apache/hudi/pull/2082#issuecomment-702405034


   @leesf #2048 is landed. is it possible to merge this and address Balaji's 
comments? (I can help if needed)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #1929: [HUDI-1160] Support update partial fields for CoW table

2020-10-01 Thread GitBox


satishkotha commented on pull request #1929:
URL: https://github.com/apache/hudi/pull/1929#issuecomment-702404415


   @leesf Any update? Let me know if you need any help here



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702335492


   @wangxianghu duh ofc. I understand now. Thanks for jumping in @wangxianghu ! 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-89) Clean up placement, naming, defaults of HoodieWriteConfig

2020-10-01 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205751#comment-17205751
 ] 

sivabalan narayanan edited comment on HUDI-89 at 10/1/20, 6:41 PM:
---

sorry, I was busy for the last few weeks. Here is my understanding. I don't 
have full context around moving configs to right classes. I need sometime to 
look into CompactionConfig, StorageConfig, HoodieCleanConfig. But wrt revamping 
Configs in general here is the idea. 

As of now, config management is naive. Let's say we want to add a new config, 
we add a key string to HoodieWriteConfig, and then add a default, expose getter 
and setter with builder pattern. Call into setting up defaults for properties 
not set. and then build the HoodieWriteConfig. We wish to introduce a class 
called ConfigOption (source: 
[[1|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]],
 
[[2]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/Configuration.java
 
|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]
 and 
[[3]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOptions.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]
 
[)|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]
 We are not looking for a full fledged ConfigOption(which include fallback keys 
and and stuff), but just key, value, defaultValue and description for now. We 
can iteratively add more features. For eg: there was some prep work done 
earlier on this regards [here|https://github.com/apache/hudi/pull/1094/files].

By this, we can bind a key, default value, description for every config value 
together. With this, the default value is maintained along w/ the actual config 
in ConfigOption and so get() should return the actual value if set. if not will 
return the default value if set. Also, description will come in handy when we 
want to generate release docs. 

And btw, with this change, we also want to rename HoodieWriteConfig to 
HoodieClientConfig. 

[~vinoth]: I understand we don't want to do a complete overhaul which involves 
changes to the way users set the properties. So, may I know how do we go about 
populating ConfigOptions from a map of properties or from a property file. In 
other words, how do we intercept the value type from the property. Or am I 
missing something on what changes we need to make. 

 


was (Author: shivnarayan):
sorry, I was busy for the last few weeks. Here is my understanding. I don't 
have full context around moving configs to right classes. I need sometime to 
look into CompactionConfig, StorageConfig, HoodieCleanConfig. But wrt revamping 
Configs in general here is the idea. 

As of now, config management is naive. Let's say we want to add a new config, 
we add a key string to HoodieWriteConfig, and then add a default, expose getter 
and setter with builder pattern. Call into setting up defaults for properties 
not set. and then build the HoodieWriteConfig. We wish to introduce a class 
called ConfigOption (source: 
[[1]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java
 
|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]],
 
[[2]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/Configuration.java
 
|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]
 and 
[[3]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOptions.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]
 
[)|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]
 We are not looking for a full fledged ConfigOption(which include fallback keys 
and and stuff), but just key, value, defaultValue and description for now. We 
can iteratively add more features. For eg: there was some prep work done 
earlier on this regards [here|https://github.com/apache/hudi/pull/1094/files].

By this, we can bind a key, default value, description for every config value 
together. With this, the default value is maintained along w/ the actual config 
in ConfigOption and so get() should return the actual value if 

[jira] [Commented] (HUDI-89) Clean up placement, naming, defaults of HoodieWriteConfig

2020-10-01 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205751#comment-17205751
 ] 

sivabalan narayanan commented on HUDI-89:
-

sorry, I was busy for the last few weeks. Here is my understanding. I don't 
have full context around moving configs to right classes. I need sometime to 
look into CompactionConfig, StorageConfig, HoodieCleanConfig. But wrt revamping 
Configs in general here is the idea. 

As of now, config management is naive. Let's say we want to add a new config, 
we add a key string to HoodieWriteConfig, and then add a default, expose getter 
and setter with builder pattern. Call into setting up defaults for properties 
not set. and then build the HoodieWriteConfig. We wish to introduce a class 
called ConfigOption (source: 
[[1]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java
 
|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]],
 
[[2]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/Configuration.java
 
|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]
 and 
[[3]|[https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOptions.java|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]]
 
[)|https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/ConfigOption.java)]
 We are not looking for a full fledged ConfigOption(which include fallback keys 
and and stuff), but just key, value, defaultValue and description for now. We 
can iteratively add more features. For eg: there was some prep work done 
earlier on this regards [here|https://github.com/apache/hudi/pull/1094/files].

By this, we can bind a key, default value, description for every config value 
together. With this, the default value is maintained along w/ the actual config 
in ConfigOption and so get() should return the actual value if set. if not will 
return the default value if set. Also, description will come in handy when we 
want to generate release docs. 

And btw, with this change, we also want to rename HoodieWriteConfig to 
HoodieClientConfig. 

[~vinoth]: I understand we don't want to do a complete overhaul which involves 
changes to the way users set the properties. So, may I know how do we go about 
populating ConfigOptions from a map of properties or from a property file. In 
other words, how do we intercept the value type from the property. Or am I 
missing something on what changes we need to make. 

 

> Clean up placement, naming, defaults of HoodieWriteConfig
> -
>
> Key: HUDI-89
> URL: https://issues.apache.org/jira/browse/HUDI-89
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Usability, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>
> # Rename HoodieWriteConfig to HoodieClientConfig 
>  # Move bunch of configs from  CompactionConfig to StorageConfig 
>  # Introduce new HoodieCleanConfig
>  # Should we consider lombok or something to automate the 
> defaults/getters/setters
>  # Consistent name of properties/defaults 
>  # Enforce bounds more strictly 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702318528


   @vinothchandar The warn log issue is fixed



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702308385


   > > @wangxianghu can you please test the latest commit. To be clear, you are 
saying you don't get the warning on master, but get it on this branch. right?
   > > if this round of tests pass, and you confirm, we can land from my 
perspective
   > 
   > Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master 
is ok, no warn log)
   > I think we should check `embeddedTimelineServiceHostAddr` instead of 
`hostAddr`.
   > 
   > ```
   >   private void setHostAddr(String embeddedTimelineServiceHostAddr) {
   >// here we should check embeddedTimelineServiceHostAddr instead of 
hostAddr
   > if (hostAddr != null) {
   >   LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr 
+ ") found in spark-conf. It was " + this.hostAddr);
   >   this.hostAddr = embeddedTimelineServiceHostAddr;
   > } else {
   >   LOG.warn("Unable to find driver bind address from spark config");
   >   this.hostAddr = NetworkUtils.getHostname();
   > }
   >   }
   > ```
   
   I have tested the latest commit with the check condition changed to 
   ```
   if (embeddedTimelineServiceHostAddr != null) {
   
   
   It runs well in my local, and the warn log disappeared.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702302083


   > @wangxianghu can you please test the latest commit. To be clear, you are 
saying you don't get the warning on master, but get it on this branch. right?
   > 
   > if this round of tests pass, and you confirm, we can land from my 
perspective
   
   Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is 
ok, no warn log)
   I think we should check `embeddedTimelineServiceHostAddr` instead of 
`hostAddr`.
   
   ```
 private void setHostAddr(String embeddedTimelineServiceHostAddr) {
  // here we should check embeddedTimelineServiceHostAddr instead of 
hostAddr
   if (hostAddr != null) {
 LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + 
") found in spark-conf. It was " + this.hostAddr);
 this.hostAddr = embeddedTimelineServiceHostAddr;
   } else {
 LOG.warn("Unable to find driver bind address from spark config");
 this.hostAddr = NetworkUtils.getHostname();
   }
 }
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702302083


   > @wangxianghu can you please test the latest commit. To be clear, you are 
saying you don't get the warning on master, but get it on this branch. right?
   > 
   > if this round of tests pass, and you confirm, we can land from my 
perspective
   
   Hi @vinothchandar. 
   
   > @wangxianghu can you please test the latest commit. To be clear, you are 
saying you don't get the warning on master, but get it on this branch. right?
   > 
   > if this round of tests pass, and you confirm, we can land from my 
perspective
   
   Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is 
ok, no warn log)
   I think we should check `embeddedTimelineServiceHostAddr` instead of 
`hostAddr`.
   
   ```
 private void setHostAddr(String embeddedTimelineServiceHostAddr) {
  // here we should check embeddedTimelineServiceHostAddr instead of 
hostAddr, hostAddr is always null
   if (hostAddr != null) {
 LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + 
") found in spark-conf. It was " + this.hostAddr);
 this.hostAddr = embeddedTimelineServiceHostAddr;
   } else {
 LOG.warn("Unable to find driver bind address from spark config");
 this.hostAddr = NetworkUtils.getHostname();
   }
 }
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702302083


   > @wangxianghu can you please test the latest commit. To be clear, you are 
saying you don't get the warning on master, but get it on this branch. right?
   > 
   > if this round of tests pass, and you confirm, we can land from my 
perspective
   
   Hi @vinothchandar The warn log is still there in HUDI-1089 branch.(master is 
ok, no warn log)
   I think we should check `embeddedTimelineServiceHostAddr` instead of 
`hostAddr`.
   
   ```
 private void setHostAddr(String embeddedTimelineServiceHostAddr) {
  // here we should check embeddedTimelineServiceHostAddr instead of 
hostAddr, hostAddr is always null
   if (hostAddr != null) {
 LOG.info("Overriding hostIp to (" + embeddedTimelineServiceHostAddr + 
") found in spark-conf. It was " + this.hostAddr);
 this.hostAddr = embeddedTimelineServiceHostAddr;
   } else {
 LOG.warn("Unable to find driver bind address from spark config");
 this.hostAddr = NetworkUtils.getHostname();
   }
 }
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2129: [HUDI-1302] Add support for timestamp field in HiveSync

2020-10-01 Thread GitBox


satishkotha commented on pull request #2129:
URL: https://github.com/apache/hudi/pull/2129#issuecomment-702275562


   @pratyakshsharma will you be able to review this week?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702248781


   @wangxianghu can you please test the latest commit.  To be clear, you are 
saying you don't get the warning on master, but get it on this branch. right?
   
   if this round of tests pass, and you confirm, we can land from my perspective



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tandonraghavs edited a comment on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)

2020-10-01 Thread GitBox


tandonraghavs edited a comment on issue #2131:
URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811


   @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But 
the problem is **preCombine** doesnt have reference to **Schema** and it is 
giving me bytes , so how do i get the Generic Record out of it? 
   Which is the reason I am not able to implement any custom logic in 
_preCombine_ as I did in _combineAndGetUpdateValue_.
   
   I am using hudi via Spark Datasource (0.5.3).
   
   And due to the scale of data I dont want to run Compaction after every 
commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_
   
   - How do i get hold of Schema in preCombine?
   
   Sample Code of my Spark job. 
   **jsonDf** -> This is a simple Json String which contains the resords.
   
   
  Dataset data=jsonDf.map((MapFunction) record ->
   generateHoodieRecord(record, 
schemaStr),Encoders.bean(GenericRecord.class));
   
  Dataset ds= 
AvroConversionUtils.createDataFrame(data.rdd(),
   schemaStr,sparkSession);
  
  ds
   .write().format("org.apache.hudi").
   .options ()...
   .mode(SaveMode.Append)
   .save(tablePath);
   
   
   
   This Ticket also talks about the same - 
https://issues.apache.org/jira/browse/HUDI-898
   
   - Also, i think we cannot add/remove any field values in preCombine, as 
doing it manually is causing _EOFException in reading log file._
   
   - Am i missing something here? So, i dont think we can use Hudi if we have 
partial Records coming in for oplogs and we have to apply all the oplogs to the 
existing dataset.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tandonraghavs edited a comment on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)

2020-10-01 Thread GitBox


tandonraghavs edited a comment on issue #2131:
URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811


   @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But 
the problem is **preCombine** doesnt have reference to **Schema** and it is 
giving me bytes , so how do i get the Generic Record out of it? 
   Which is the reason I am not able to implement any custom logic in 
_preCombine_ as I did in _combineAndGetUpdateValue_.
   
   I am using hudi via Spark Datasource (0.5.3).
   
   And due to the scale of data I dont want to run Compaction after every 
commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_
   
   - How do i get hold of Schema in preCombine?
   
   Sample Code of my Spark job. 
   **jsonDf** -> This is a simple Json String which contains the resords.
   
   
  Dataset data=jsonDf.map((MapFunction) record ->
   generateHoodieRecord(record, 
schemaStr),Encoders.bean(GenericRecord.class));
   
  Dataset ds= 
AvroConversionUtils.createDataFrame(data.rdd(),
   schemaStr,sparkSession);
  
  ds
   .write().format("org.apache.hudi").
   .options ()...
   .mode(SaveMode.Append)
   .save(tablePath);
   
   
   
   This Ticket also talks about the same - 
https://issues.apache.org/jira/browse/HUDI-898
   Also, i think we cannot add/remove any field values in preCombine, as doing 
it manually is causing _EOFException in reading log file._
   
   Am i missing something here? So, i dont think we can use Hudi if we have 
partial Records coming in for oplogs and we have to apply all the oplogs to the 
existing dataset.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2110: [SUPPORT] Executor memory recommendation

2020-10-01 Thread GitBox


n3nash commented on issue #2110:
URL: https://github.com/apache/hudi/issues/2110#issuecomment-702239474


   @tooptoop4 No, the size of the existing table should not matter.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu removed a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu removed a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702217863


   > @leesf do you see the following exception? could not understand how you ll 
get the other one even.
   > 
   > ```
   > LOG.info("Starting Timeline service !!");
   > Option hostAddr = 
context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
   > if (!hostAddr.isPresent()) {
   >   throw new HoodieException("Unable to find host address to bind 
timeline server to.");
   > }
   > timelineServer = Option.of(new EmbeddedTimelineService(context, 
hostAddr.get(),
   > config.getClientSpecifiedViewStorageConfig()));
   > ```
   > 
   > Either way, good pointer. the behavior has changed around this a bit 
actually. So will try and tweak and push a fix
   
   I got this warning too. The code here seems not the same.
   ```
// Run Embedded Timeline Server
   LOG.info("Starting Timeline service !!");
   Option hostAddr = 
context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
   timelineServer = Option.of(new EmbeddedTimelineService(context, 
hostAddr.orElse(null),
   config.getClientSpecifiedViewStorageConfig()));
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-70008


   @vinothchandar @yanghua @leesf The demo runs well in my local, except the 
warning `WARN embedded.EmbeddedTimelineService: Unable to find driver bind 
address from spark config`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702217863


   > @leesf do you see the following exception? could not understand how you ll 
get the other one even.
   > 
   > ```
   > LOG.info("Starting Timeline service !!");
   > Option hostAddr = 
context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
   > if (!hostAddr.isPresent()) {
   >   throw new HoodieException("Unable to find host address to bind 
timeline server to.");
   > }
   > timelineServer = Option.of(new EmbeddedTimelineService(context, 
hostAddr.get(),
   > config.getClientSpecifiedViewStorageConfig()));
   > ```
   > 
   > Either way, good pointer. the behavior has changed around this a bit 
actually. So will try and tweak and push a fix
   
   I got this warning too. The code here seems not the same.
   ```
// Run Embedded Timeline Server
   LOG.info("Starting Timeline service !!");
   Option hostAddr = 
context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
   timelineServer = Option.of(new EmbeddedTimelineService(context, 
hostAddr.orElse(null),
   config.getClientSpecifiedViewStorageConfig()));
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702174407


   @leesf do you see the following exception? could not understand how you ll 
get the other one even. 
   
   ```
   LOG.info("Starting Timeline service !!");
   Option hostAddr = 
context.getProperty(EngineProperty.EMBEDDED_SERVER_HOST);
   if (!hostAddr.isPresent()) {
 throw new HoodieException("Unable to find host address to bind 
timeline server to.");
   }
   timelineServer = Option.of(new EmbeddedTimelineService(context, 
hostAddr.get(),
   config.getClientSpecifiedViewStorageConfig()));
   ```
   
   Either way, good pointer. the behavior has changed around this a bit 
actually. So will try and tweak and push a fix 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


leesf edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702125051


   1. Run quickstart demo: found the warn log: 
   `20/10/01 21:11:18 WARN embedded.EmbeddedTimelineService: Unable to find 
driver bind address from spark config`, but works fine, the warn log is not 
found in 0.6.0. @vinothchandar @wangxianghu 
   2. Ran my own unit tests, works fine.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


leesf commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-702125051


   Run quickstart demo: found the warn log: 
   `20/10/01 21:11:18 WARN embedded.EmbeddedTimelineService: Unable to find 
driver bind address from spark config`, but not found the warn log in 0.6.0 
@vinothchandar @wangxianghu 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


yanghua commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r498188986



##
File path: hudi-cli/pom.xml
##
@@ -148,7 +148,14 @@
 
 
   org.apache.hudi
-  hudi-client
+  hudi-client-common
+  ${project.version}
+  test
+  test-jar
+
+
+  org.apache.hudi
+  hudi-spark-client

Review comment:
   So, will we keep `hudi-spark-client`? We have had a `hudi-spark` module. 
IMHO, this naming may not seem so clear.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


yanghua commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r498188986



##
File path: hudi-cli/pom.xml
##
@@ -148,7 +148,14 @@
 
 
   org.apache.hudi
-  hudi-client
+  hudi-client-common
+  ${project.version}
+  test
+  test-jar
+
+
+  org.apache.hudi
+  hudi-spark-client

Review comment:
   So, will we keep `hudi-spark-client`? We have had a `hudi-spark` module.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tandonraghavs edited a comment on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)

2020-10-01 Thread GitBox


tandonraghavs edited a comment on issue #2131:
URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811


   @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But 
the problem is **preCombine** doesnt have reference to **Schema** and it is 
giving me bytes , so how do i get the Generic Record out of it? 
   Which is the reason I am not able to implement any custom logic in 
_preCombine_ as I did in _combineAndGetUpdateValue_.
   
   I am using hudi via Spark Datasource (0.5.3).
   
   And due to the scale of data I dont want to run Compaction after every 
commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_
   
   - How do i get hold of Schema in preCombine?
   
   Sample Code of my Spark job. 
   **jsonDf** -> This is a simple Json String which contains the resords.
   
   
  Dataset data=jsonDf.map((MapFunction) record ->
   generateHoodieRecord(record, 
schemaStr),Encoders.bean(GenericRecord.class));
   
  Dataset ds= 
AvroConversionUtils.createDataFrame(data.rdd(),
   schemaStr,sparkSession);
  
  ds
   .write().format("org.apache.hudi").
   .options ()...
   .mode(SaveMode.Append)
   .save(tablePath);
   
   
   
   This Ticket also talks about the same - 
https://issues.apache.org/jira/browse/HUDI-898



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tandonraghavs commented on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)

2020-10-01 Thread GitBox


tandonraghavs commented on issue #2131:
URL: https://github.com/apache/hudi/issues/2131#issuecomment-702082811


   @bvaradar I am using my Custom class as `PAYLOAD_CLASS_OPT_KEY` key -> But 
the problem is **preCombine** doesnt have reference to **Schema** and it is 
giving me bytes , so how do i get the Generic Record out of it? 
   Which is the reason I am not able to implement any custom logic in 
_preCombine_ as I did in _combineAndGetUpdateValue_.
   
   I am using hudi via Spark Datasource (0.5.3).
   
   And due to the scale of data I dont want to run Compaction after every 
commit, so using _INLINE_COMPACT_NUM_DELTA_COMMITS_PROP_
   
   - How do i get hold of Schema in preCombine?
   
   Sample Code of my Spark job. 
   **jsonDf** -> This is a simple Json String which contains the resords.
   
   
  Dataset data=jsonDf.map((MapFunction) record ->
   generateHoodieRecord(record, 
schemaStr),Encoders.bean(GenericRecord.class));
   
  Dataset ds= 
AvroConversionUtils.createDataFrame(data.rdd(),
   schemaStr,sparkSession);
  
  ds
   .write().format("org.apache.hudi").
   .options ()...
   .mode(SaveMode.Append)
   .save(tablePath);
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205435#comment-17205435
 ] 

Balaji Varadarajan commented on HUDI-1308:
--

cc [~vinoth]

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1311) Writes creating/updating large number of files seeing errors when deleting marker files in S3

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1311:


 Summary: Writes creating/updating large number of files seeing 
errors when deleting marker files in S3
 Key: HUDI-1311
 URL: https://issues.apache.org/jira/browse/HUDI-1311
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


Dont have the exception trace handy. Will add them when I run into this next 
time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1310) Corruption Block Handling too slow in S3

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1310:


 Summary: Corruption Block Handling too slow in S3
 Key: HUDI-1310
 URL: https://issues.apache.org/jira/browse/HUDI-1310
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


The logic to figure out next valid starting block offset is too slow when run 
in S3. 

I have bolded the log message that takes long time to appear. 

 

 

36589 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log 
file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0}
36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Found corrupted block in file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} with block size(3723305) running past EOF
36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Log 
HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} has a corrupted block at 14
*44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Next available block in* 
HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} starts at 3723319
44566 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
corrupt block in 
s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1308:


Assignee: Balaji Varadarajan  (was: Prashant Wason)

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1309:


Assignee: Prashant Wason

> Listing Metadata unreadable in S3 as the log block is deemed corrupted
> --
>
> Key: HUDI-1309
> URL: https://issues.apache.org/jira/browse/HUDI-1309
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Prashant Wason
>Priority: Major
>
> When running metadata list-partitions CLI command, I am seeing the below 
> messages and the partition list is empty. Was expecting 10K partitions.
>  
> 36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
> 36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
> 36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
> 44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in 
> HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
> 44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
> 44567 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-1308:


Assignee: Prashant Wason

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Prashant Wason
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1309) Listing Metadata unreadable in S3 as the log block is deemed corrupted

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1309:


 Summary: Listing Metadata unreadable in S3 as the log block is 
deemed corrupted
 Key: HUDI-1309
 URL: https://issues.apache.org/jira/browse/HUDI-1309
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


When running metadata list-partitions CLI command, I am seeing the below 
messages and the partition list is empty. Was expecting 10K partitions.

 

36589 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning log 
file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0}
36590 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Found corrupted block in file 
HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} with block size(3723305) running past EOF
36684 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Log 
HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} has a corrupted block at 14
44515 [Spring Shell] INFO org.apache.hudi.common.table.log.HoodieLogFileReader 
- Next available block in 
HoodieLogFile\{pathStr='s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
 fileLen=0} starts at 3723319
44566 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
corrupt block in 
s3a:///dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045
44567 [Spring Shell] INFO 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - M



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1308:


 Summary: Issues found during testing RFC-15
 Key: HUDI-1308
 URL: https://issues.apache.org/jira/browse/HUDI-1308
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan


THis is an umbrella ticket containing all the issues found during testing RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1308) Issues found during testing RFC-15

2020-10-01 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1308:
-
Status: Open  (was: New)

> Issues found during testing RFC-15
> --
>
> Key: HUDI-1308
> URL: https://issues.apache.org/jira/browse/HUDI-1308
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> THis is an umbrella ticket containing all the issues found during testing 
> RFC-15



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on issue #2131: [SUPPORT] HUDI with Mongo Oplogs (Debezium)

2020-10-01 Thread GitBox


bvaradar commented on issue #2131:
URL: https://github.com/apache/hudi/issues/2131#issuecomment-702072024


   @tandonraghavs : For compaction, the payload class defined in 
hoodie.properties is used for pre-combining. Can you check what is the payload 
class configured in hoodie.properties ?
   
   As you would have noticed, compaction would first pre-combines all delta 
records first before merging using combineAndGetUpdateValue. So, you would need 
to implement custom merging in preCombine method in your payload class. 
   
   IIUC, Oplogs are partial row image(updates). Wouldn't it be possible to 
employ the same merging logic that you were doing as part of 
combineAndGetUpdateValue ?
   
   Also, which version of Hudi are you using ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2130: [SUPPORT] Use hive jdbc to excute hudi query failed

2020-10-01 Thread GitBox


bvaradar commented on issue #2130:
URL: https://github.com/apache/hudi/issues/2130#issuecomment-702062546


   @Trevor-zhang : IIUC, this is not specific to Hudi. Did you get the full 
exception stack trace from the server logs ? Are you sure this due to 
hive.input.format setting ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-701989092


   @wangxianghu Please help test this out if possible. Once the tests pass 
again, planning to merge this, morning PST 
   
   cc @yanghua 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Karl-WangSK commented on a change in pull request #2106: [HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal

2020-10-01 Thread GitBox


Karl-WangSK commented on a change in pull request #2106:
URL: https://github.com/apache/hudi/pull/2106#discussion_r498081682



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -186,11 +186,15 @@ protected void rollBackInflightBootstrap() {
* @return JavaRDD[WriteStatus] - RDD of WriteStatus to inspect errors and 
counts
*/
   public JavaRDD upsert(JavaRDD> records, final 
String instantTime) {
+return upsert(records, instantTime, null);
+  }
+
+  public JavaRDD upsert(JavaRDD> records, final 
String instantTime, String schema) {

Review comment:
   ok





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-701963298


   I actually figured out that we can remove `P` altogether. since 
`HoodieIndex#fetchRecordLocation` is not used much outside of internal APIs. So 
will push a final change for that . tests are passing now 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-701951988


   > @wangxianghu @yanghua I have rebased this against master. Please take a 
look at my changes.
   > 
   > High level, we could re-use more code, but it needs an abstraction that 
can wrap `RDD` or `DataSet` or `D
   
   > @wangxianghu @yanghua I have rebased this against master. Please take a 
look at my changes.
   > 
   > High level, we could re-use more code, but it needs an abstraction that 
can wrap `RDD` or `DataSet` or `DataStream` adequately and support basic 
operations like `.map()`, `reduceByKey()` etc. We can do this in a second pass 
once we have a working Flink impl. For now this will do.
   > 
   > I am trying to get the tests to pass. if they do, we could go ahead and 
merge
   
   Thanks, @vinothchandar, this is really great work! 
   Yes, we can do more abstractions about basic `map`, `reduceByKey` methods in 
`HoodieEngineContext`, or some Util classes next.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-10-01 Thread GitBox


codecov-commenter commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-701949671


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=h1) Report
   > Merging 
[#1827](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/a99e93bed542c8ae30a641d1df616cc2cd5798e1?el=desc)
 will **decrease** coverage by `3.75%`.
   > The diff coverage is `30.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1827/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1827  +/-   ##
   
   - Coverage 59.89%   56.14%   -3.76% 
   + Complexity 4454 2658-1796 
   
 Files   558  324 -234 
 Lines 2337814775-8603 
 Branches   2348 1539 -809 
   
   - Hits  14003 8295-5708 
   + Misses 8355 5783-2572 
   + Partials   1020  697 -323 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `38.37% <30.00%> (-27.83%)` | `193.00 <0.00> (-1615.00)` | |
   | #hudiclient | `100.00% <ø> (+25.46%)` | `0.00 <ø> (-1615.00)` | :arrow_up: 
|
   | #hudicommon | `54.74% <ø> (ø)` | `1793.00 <ø> (ø)` | |
   | #hudihadoopmr | `?` | `?` | |
   | #hudispark | `67.18% <ø> (-0.02%)` | `311.00 <ø> (ø)` | |
   | #huditimelineservice | `64.43% <ø> (ø)` | `49.00 <ø> (ø)` | |
   | #hudiutilities | `69.43% <ø> (+0.05%)` | `312.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1827?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...rg/apache/hudi/cli/commands/SavepointsCommand.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NhdmVwb2ludHNDb21tYW5kLmphdmE=)
 | `14.28% <0.00%> (ø)` | `3.00 <0.00> (ø)` | |
   | 
[...main/java/org/apache/hudi/cli/utils/SparkUtil.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL3V0aWxzL1NwYXJrVXRpbC5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/cli/commands/SparkMain.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL1NwYXJrTWFpbi5qYXZh)
 | `6.43% <37.50%> (+0.40%)` | `4.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==)
 | `45.36% <0.00%> (ø)` | `21.00% <0.00%> (ø%)` | |
   | 
[...in/scala/org/apache/hudi/HoodieStreamingSink.scala](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3RyZWFtaW5nU2luay5zY2FsYQ==)
 | `24.00% <0.00%> (ø)` | `10.00% <0.00%> (ø%)` | |
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=)
 | `56.20% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | `64.59% <0.00%> (ø)` | `30.00% <0.00%> (ø%)` | |
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `68.16% <0.00%> (ø)` | `39.00% <0.00%> (ø%)` | |
   | 
[.../hudi/async/SparkStreamingAsyncCompactService.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9hc3luYy9TcGFya1N0cmVhbWluZ0FzeW5jQ29tcGFjdFNlcnZpY2UuamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../hudi/internal/HoodieDataSourceInternalWriter.java](https://codecov.io/gh/apache/hudi/pull/1827/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9pbnRlcm5hbC9Ib29kaWVEYXRhU291cmNlSW50ZXJuYWxXcml0ZXIuamF2YQ==)
 | `87.50% <0.00%> (ø)` | `8.00% <0.00%> (ø%)` | |
   | ... and [4