[GitHub] [hudi] prashanthvg89 opened a new issue #2153: [SUPPORT] Failed to delete key: /.hoodie/.temp/20201006182950

2020-10-07 Thread GitBox


prashanthvg89 opened a new issue #2153:
URL: https://github.com/apache/hudi/issues/2153


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Random "Failed to delete key" error during UPSERT operation in a Spark 
Structured Streaming job even with "hoodie.consistency.check.enabled" set to 
true
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   UNKOWN - Occurs randomly after running streaming job - sometimes for 15 
hours and other times for about couple days
   
   **Expected behavior**
   
   UPSERT should be a simple operation and if there is an application bug then 
we should have faced this as soon as I launched the application but it appears 
intermittently. The only resolution so far is to restart the job
   
   **Environment Description**
   
   * Hudi version : 0.5.2-incubating
   
   * Spark version : 2.4.4
   
   * Hive version : 2.3.6
   
   * Hadoop version : EMR 5.29.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   This is a simple streaming application which listens to Kinesis stream with 
a batch interval of 15 minutes and updates the Hudi table using MERGE_ON_READ
   
   **Stacktrace**
   
   ```Caused by: org.apache.hudi.exception.HoodieIOException: Failed to delete 
key: /.hoodie/.temp/20201006182950
at 
org.apache.hudi.table.HoodieTable.deleteMarkerDir(HoodieTable.java:333)
at 
org.apache.hudi.table.HoodieTable.cleanFailedWrites(HoodieTable.java:409)
at org.apache.hudi.table.HoodieTable.finalizeWrite(HoodieTable.java:315)
at 
org.apache.hudi.table.HoodieMergeOnReadTable.finalizeWrite(HoodieMergeOnReadTable.java:317)
at 
org.apache.hudi.client.AbstractHoodieWriteClient.finalizeWrite(AbstractHoodieWriteClient.java:195)
... 66 more
   Caused by: java.io.IOException: Failed to delete key: 
/.hoodie/.temp/20201006182950
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:767)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:337)
at 
org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.delete(HoodieWrapperFileSystem.java:261)
at 
org.apache.hudi.table.HoodieTable.deleteMarkerDir(HoodieTable.java:330)
... 70 more
   Caused by: java.io.IOException: 1 exceptions thrown from 5 batch deletes
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:390)
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1494)
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:765)
... 73 more
   Caused by: java.io.IOException: MultiObjectDeleteException thrown with 2 
keys in error: 
/.hoodie/.temp/20201006182950/195/2ecca5ce-ba13-4d5a-a2e3-79713261dc49-0_2061-53-37527_20201006182950.marker,
 /.hoodie/.temp/20201006182950/290_$folder$
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375)
... 75 more
   Caused by: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException:
 One or more objects could not be deleted (Service: null; Status Code: 200; 
Error Code: null; Request ID: 457C53995454141D; S3 Extended Request ID: 
NKQEApW06BHPRG5oQBP4RffTd6OZQXOCNl6jurU690Ee+iE3cgbRbbtNPugjqa3qyADj6x5zqBk=), 
S3 Extended Request ID: 
NKQEApW06BHPRG5oQBP4RffTd6OZQXOCNl6jurU690Ee+iE3cgbRbbtNPugjqa3qyADj6x5zqBk=
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2267)
at 
com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:24)
at 
com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:10)
at 
com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:110)
at 
com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189)
at 
com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184)
at 
com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:128)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:370)```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, 

[jira] [Updated] (HUDI-1328) Introduce HoodieFlinkEngineContext to hudi-flink-client

2020-10-07 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1328:
--
Summary: Introduce HoodieFlinkEngineContext to hudi-flink-client  (was: 
IntroduceHoodieFlinkEngineContext to hudi-flink-client)

> Introduce HoodieFlinkEngineContext to hudi-flink-client
> ---
>
> Key: HUDI-1328
> URL: https://issues.apache.org/jira/browse/HUDI-1328
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1328) IntroduceHoodieFlinkEngineContext to hudi-flink-client

2020-10-07 Thread wangxianghu (Jira)
wangxianghu created HUDI-1328:
-

 Summary: IntroduceHoodieFlinkEngineContext to hudi-flink-client
 Key: HUDI-1328
 URL: https://issues.apache.org/jira/browse/HUDI-1328
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: wangxianghu
Assignee: wangxianghu
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1327) Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client

2020-10-07 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1327:
--
Parent: HUDI-909
Issue Type: Sub-task  (was: Task)

> Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client
> ---
>
> Key: HUDI-1327
> URL: https://issues.apache.org/jira/browse/HUDI-1327
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-io edited a comment on pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.

2020-10-07 Thread GitBox


codecov-io edited a comment on pull request #2152:
URL: https://github.com/apache/hudi/pull/2152#issuecomment-705276668


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=h1) Report
   > Merging 
[#2152](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/788d236c443eb4ced819f9305ed8e0460b5984b7?el=desc)
 will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2152/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#2152   +/-   ##
   =
 Coverage 53.61%   53.61%   
 Complexity 2845 2845   
   =
 Files   359  359   
 Lines 1653516535   
 Branches   1777 1777   
   =
 Hits   8866 8866   
 Misses 6912 6912   
 Partials757  757   
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | |
   | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | #hudicommon | `54.74% <ø> (ø)` | `1793.00 <ø> (ø)` | |
   | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | |
   | #hudispark | `65.51% <ø> (ø)` | `303.00 <ø> (ø)` | |
   | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | |
   | #hudiutilities | `69.98% <ø> (ø)` | `325.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.

2020-10-07 Thread GitBox


codecov-io commented on pull request #2152:
URL: https://github.com/apache/hudi/pull/2152#issuecomment-705276668


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=h1) Report
   > Merging 
[#2152](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/788d236c443eb4ced819f9305ed8e0460b5984b7?el=desc)
 will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2152/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#2152   +/-   ##
   =
 Coverage 53.61%   53.61%   
 Complexity 2845 2845   
   =
 Files   359  359   
 Lines 1653516535   
 Branches   1777 1777   
   =
 Hits   8866 8866   
 Misses 6912 6912   
 Partials757  757   
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | |
   | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | #hudicommon | `54.74% <ø> (ø)` | `1793.00 <ø> (ø)` | |
   | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | |
   | #hudispark | `65.51% <ø> (ø)` | `303.00 <ø> (ø)` | |
   | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | |
   | #hudiutilities | `69.98% <ø> (ø)` | `325.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1327) Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client

2020-10-07 Thread wangxianghu (Jira)
wangxianghu created HUDI-1327:
-

 Summary: Add Flink implemetation of BootstrapSchemaProvider to 
hudi-flink-client
 Key: HUDI-1327
 URL: https://issues.apache.org/jira/browse/HUDI-1327
 Project: Apache Hudi
  Issue Type: Task
Reporter: wangxianghu
Assignee: wangxianghu
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.

2020-10-07 Thread GitBox


prashantwason commented on pull request #2152:
URL: https://github.com/apache/hudi/pull/2152#issuecomment-705266279


   @n3nash Can you please check if this satisfied the requirements of 
publishing metrics from the spark streaming too?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1326) Publishing metrics from hudi-test-suite after each job

2020-10-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1326:
-
Labels: pull-request-available  (was: )

> Publishing metrics from hudi-test-suite after each job
> --
>
> Key: HUDI-1326
> URL: https://issues.apache.org/jira/browse/HUDI-1326
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>
> HUDI metrics are published at two stages:
> 1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics)
> 2. When the JVM is shutdown (Metrics.java addShutdownHook)
> hudi-test-suite includes multiple jobs each of which is an action on the 
> table. All these jobs are run in a single JVM. Hence, currently some metrics 
> are not published till the entire hudi-test-suite run is over.
> Enhancement:
> 1. Allow metrics to be published after each job
> 2. Flush metrics so redundant metrics (from the last job) are not published 
> again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason opened a new pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.

2020-10-07 Thread GitBox


prashantwason opened a new pull request #2152:
URL: https://github.com/apache/hudi/pull/2152


   Using the added API, publish metrics after each level of the DAG completed 
in hudi-test-suite.
   
   
   ## What is the purpose of the pull request
   
   HUDI metrics are published at two stages:
   1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics)
   2. When the JVM is shutdown (Metrics.java addShutdownHook)
   
   hudi-test-suite includes multiple jobs each of which is an action on the 
table. All these jobs are run in a single JVM. Hence, currently some metrics 
are not published till the entire hudi-test-suite run is over.
   
   Enhancements:
   1. Allow metrics to be published after each job
   2. Flush metrics so redundant metrics (from the last job) are not published 
again.
   
   ## Brief change log
   
   Added Metrics.flush()
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1326) Publishing metrics from hudi-test-suite after each job

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1326:


 Summary: Publishing metrics from hudi-test-suite after each job
 Key: HUDI-1326
 URL: https://issues.apache.org/jira/browse/HUDI-1326
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Prashant Wason
Assignee: Prashant Wason


HUDI metrics are published at two stages:
1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics)
2. When the JVM is shutdown (Metrics.java addShutdownHook)

hudi-test-suite includes multiple jobs each of which is an action on the table. 
All these jobs are run in a single JVM. Hence, currently some metrics are not 
published till the entire hudi-test-suite run is over.

Enhancement:
1. Allow metrics to be published after each job
2. Flush metrics so redundant metrics (from the last job) are not published 
again.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-07 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r501383913



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.utils.SparkUtil;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieMetadata;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * CLI commands to operate on the Metadata Table.
+ */
+@Component
+public class MetadataCommand implements CommandMarker {
+  private JavaSparkContext jsc;
+
+  @CliCommand(value = "metadata set", help = "Set options for Metadata Table")
+  public String set(@CliOption(key = {"metadataDir"},
+  help = "Directory to read/write metadata table (can be different from 
dataset)", unspecifiedDefaultValue = "")
+  final String metadataDir) {
+if (!metadataDir.isEmpty()) {
+  HoodieMetadata.setMetadataBaseDirectory(metadataDir);
+}
+
+return String.format("Ok");
+  }
+
+  @CliCommand(value = "metadata create", help = "Create the Metadata Table if 
it does not exist")
+  public String create() throws IOException {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") not empty.");
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist yet
+  HoodieCLI.fs.mkdirs(metadataPath);
+}
+
+long t1 = System.currentTimeMillis();
+HoodieWriteConfig writeConfig = 
HoodieWriteConfig.newBuilder().withPath(HoodieCLI.basePath)
+.withUseFileListingMetadata(true).build();
+initJavaSparkContext();
+HoodieMetadata.init(jsc, writeConfig);
+long t2 = System.currentTimeMillis();
+
+return String.format("Created Metadata Table in %s (duration=%.2fsec)", 
metadataPath, (t2 - t1) / 1000.0);
+  }
+
+  @CliCommand(value = "metadata delete", help = "Remove the Metadata Table")
+  public String delete() throws Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+HoodieCLI.fs.delete(metadataPath, true);
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+}
+
+HoodieMetadata.remove(HoodieCLI.basePath);
+
+return String.format("Removed Metdata Table from %s", metadataPath);
+  }
+
+  @CliCommand(value = "metadata init", help = "Update the metadata table from 
commits since the creation")
+  public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = 
"false",
+  help = "Open in read-only mode") final boolean readOnly) throws 
Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+  throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") does not exist.");
+

[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-07 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r501383798



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.utils.SparkUtil;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieMetadata;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * CLI commands to operate on the Metadata Table.
+ */
+@Component
+public class MetadataCommand implements CommandMarker {
+  private JavaSparkContext jsc;
+
+  @CliCommand(value = "metadata set", help = "Set options for Metadata Table")
+  public String set(@CliOption(key = {"metadataDir"},
+  help = "Directory to read/write metadata table (can be different from 
dataset)", unspecifiedDefaultValue = "")
+  final String metadataDir) {
+if (!metadataDir.isEmpty()) {
+  HoodieMetadata.setMetadataBaseDirectory(metadataDir);
+}
+
+return String.format("Ok");
+  }
+
+  @CliCommand(value = "metadata create", help = "Create the Metadata Table if 
it does not exist")
+  public String create() throws IOException {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") not empty.");
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist yet
+  HoodieCLI.fs.mkdirs(metadataPath);
+}
+
+long t1 = System.currentTimeMillis();
+HoodieWriteConfig writeConfig = 
HoodieWriteConfig.newBuilder().withPath(HoodieCLI.basePath)
+.withUseFileListingMetadata(true).build();
+initJavaSparkContext();
+HoodieMetadata.init(jsc, writeConfig);
+long t2 = System.currentTimeMillis();
+
+return String.format("Created Metadata Table in %s (duration=%.2fsec)", 
metadataPath, (t2 - t1) / 1000.0);
+  }
+
+  @CliCommand(value = "metadata delete", help = "Remove the Metadata Table")
+  public String delete() throws Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+HoodieCLI.fs.delete(metadataPath, true);
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+}
+
+HoodieMetadata.remove(HoodieCLI.basePath);
+
+return String.format("Removed Metdata Table from %s", metadataPath);
+  }
+
+  @CliCommand(value = "metadata init", help = "Update the metadata table from 
commits since the creation")
+  public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = 
"false",
+  help = "Open in read-only mode") final boolean readOnly) throws 
Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+  throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") does not exist.");
+

[jira] [Commented] (HUDI-1325) Handle the corner case with syncing completed compaction from data timeline to metadata timeline.

2020-10-07 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209938#comment-17209938
 ] 

Prashant Wason commented on HUDI-1325:
--

[~vinoth] One way to handle this can be to merge (in-memory) the changes from 
the compaction instant. 

So on the reader side, we read:
1. From the hfile base file
2. From the log file blocks upto the last-sync instant
3. The in-memory changes

Since this issue only arises for async compaction and async clean operation, it 
should not be too much overhead. Also, this is probably not required for async 
clean as queries should not care about files which may be asnc cleaned anyways.

> Handle the corner case with syncing completed compaction from data timeline 
> to metadata timeline. 
> --
>
> Key: HUDI-1325
> URL: https://issues.apache.org/jira/browse/HUDI-1325
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Priority: Major
>
> Here is a corner case with syncing completed compaction from data timeline to 
> metadata timeline. Consider the following sequence of events
> t0: writer schedules compaction at time instant c
> t1: Compactor starts processing c's plan
> t2: compaction finishes with c.commit published on the data timeline (not yet 
> synced to metadata timeline)
> t3: Next round of writing, writer opens metadata table, which adds the base 
> file produced in c.commit to metadata table.
> Any queries running between t2 and t3, cannot rely on metadata since the new 
> base file will not be present in metadata table. The timeline will indicate 
> that the compaction completed, and the latest file slice will be computed as 
> simply the logs written to the file groups since compaction. This will lead 
> to incorrect results.
> If we consider just writer alone, we may be okay since we first sync the 
> metadata table before we do anything for the delta commit at t3. But in 
> general for queries, we should advise enabling metadata table based listings 
> only, after all writers/cleaner/compactor have been enabled to use metadata 
> and been successfully using it to publish new/deleted files directly to the 
> metadata table. In short, queries cannot rely on metadata table, with the 
> syncing mechanism as the main thing that keeps data and metadata timelines 
> together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1325) Handle the corner case with syncing completed compaction from data timeline to metadata timeline.

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1325:


 Summary: Handle the corner case with syncing completed compaction 
from data timeline to metadata timeline. 
 Key: HUDI-1325
 URL: https://issues.apache.org/jira/browse/HUDI-1325
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason


Here is a corner case with syncing completed compaction from data timeline to 
metadata timeline. Consider the following sequence of events

t0: writer schedules compaction at time instant c
t1: Compactor starts processing c's plan
t2: compaction finishes with c.commit published on the data timeline (not yet 
synced to metadata timeline)
t3: Next round of writing, writer opens metadata table, which adds the base 
file produced in c.commit to metadata table.

Any queries running between t2 and t3, cannot rely on metadata since the new 
base file will not be present in metadata table. The timeline will indicate 
that the compaction completed, and the latest file slice will be computed as 
simply the logs written to the file groups since compaction. This will lead to 
incorrect results.

If we consider just writer alone, we may be okay since we first sync the 
metadata table before we do anything for the delta commit at t3. But in general 
for queries, we should advise enabling metadata table based listings only, 
after all writers/cleaner/compactor have been enabled to use metadata and been 
successfully using it to publish new/deleted files directly to the metadata 
table. In short, queries cannot rely on metadata table, with the syncing 
mechanism as the main thing that keeps data and metadata timelines together.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-07 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r501375567



##
File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
##
@@ -174,8 +174,17 @@ public static String getRelativePartitionPath(Path 
basePath, Path fullPartitionP
   /**
* Obtain all the partition paths, that are present in this table, denoted 
by presence of
* {@link HoodiePartitionMetadata#HOODIE_PARTITION_METAFILE}.
+   *
+   * If thee basePathStr is a subdirectory of .hoodie folder then we assume 
that the partitions of an internal
+   * table (a hoodie table within the .hoodie directory) are to be obtained.
+   *
+   * @param fs FileSystem instance
+   * @param basePathStr base directory
*/
   public static List getAllFoldersWithPartitionMetaFile(FileSystem fs, 
String basePathStr) throws IOException {
+// If the basePathStr is a folder within the .hoodie directory then we are 
listing partitions within an
+// internal table.
+final boolean isInternalTable = 
basePathStr.contains(HoodieTableMetaClient.METAFOLDER_NAME);

Review comment:
   Done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1324) FSUtils: rename:isMetadataTable

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1324:


 Summary: FSUtils: rename:isMetadataTable
 Key: HUDI-1324
 URL: https://issues.apache.org/jira/browse/HUDI-1324
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason
Assignee: Prashant Wason






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1323) Fence metadata reads using latest data timeline commit times!

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1323:


 Summary: Fence metadata reads using latest data timeline commit 
times!
 Key: HUDI-1323
 URL: https://issues.apache.org/jira/browse/HUDI-1323
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason


Problem D: We need to fence metadata reads using latest data timeline commit 
times! and limit to only handing out files that belong to a committed instant 
on the data timeline. Otherwise, metadata table can hand uncommitted files to 
cleaner etc and cause us to delete legit latest file slices i.e data loss



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1319) Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer

2020-10-07 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1319:
-
Description: 
Two cases:
1. Async process / pipeline (Handle out-of-process Clean and Compaction)
2. AsyncClean service


NOTE: We need to decide where to log update for async operations. If we do it 
at Async operation completion then we may end up in multi writer scenario.

  was:
Two cases:
1. Async process / pipeline (Handle out-of-process Clean and Compaction)
2. AsyncClean service


> Async Clean and Async Compaction - how will they work with metadata table 
> updates - check multi writer
> --
>
> Key: HUDI-1319
> URL: https://issues.apache.org/jira/browse/HUDI-1319
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Priority: Major
>
> Two cases:
> 1. Async process / pipeline (Handle out-of-process Clean and Compaction)
> 2. AsyncClean service
> NOTE: We need to decide where to log update for async operations. If we do it 
> at Async operation completion then we may end up in multi writer scenario.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-07 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r501374106



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java
##
@@ -244,6 +245,10 @@ private HoodieCleanMetadata runClean(HoodieTable table, 
HoodieInstant cleanIn
 inflightInstant = cleanInstant;
   }
 
+  // Update Metadata Table before even finishing clean. This ensures that 
an async clean operation in the
+  // background does not lead to stale metadata being returned from 
Metadata Table.
+  HoodieMetadata.update(config, cleanerPlan, cleanInstant.getTimestamp());

Review comment:
   Created HUDI-1319





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-10-07 Thread GitBox


prashantwason commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r501373801



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -121,6 +122,9 @@ public boolean commitStats(String instantTime, 
List stats, Opti
 metadata.setOperationType(operationType);
 
 try {
+  // Update Metadata Table
+  HoodieMetadata.update(config, metadata, instantTime);

Review comment:
   Created HUDI-1320. Will handle there.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1319) Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer

2020-10-07 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-1319:
-
Description: 
Two cases:
1. Async process / pipeline (Handle out-of-process Clean and Compaction)
2. AsyncClean service

  was:
Two cases:
1. Async process / pipeline 
2. AsyncClean service


> Async Clean and Async Compaction - how will they work with metadata table 
> updates - check multi writer
> --
>
> Key: HUDI-1319
> URL: https://issues.apache.org/jira/browse/HUDI-1319
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Priority: Major
>
> Two cases:
> 1. Async process / pipeline (Handle out-of-process Clean and Compaction)
> 2. AsyncClean service



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1322) Refactor into Reader & Writer side for Metadata

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1322:


 Summary: Refactor into Reader & Writer side for Metadata
 Key: HUDI-1322
 URL: https://issues.apache.org/jira/browse/HUDI-1322
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1321) Support properties for metadata table via a properties.file

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1321:


 Summary: Support properties for metadata table via a 
properties.file
 Key: HUDI-1321
 URL: https://issues.apache.org/jira/browse/HUDI-1321
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason


metadata properties should be in its own namespace



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1320) Move static invocations of HoodieMetadata.xxx to HoodieTable

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1320:


 Summary: Move static invocations of HoodieMetadata.xxx to 
HoodieTable
 Key: HUDI-1320
 URL: https://issues.apache.org/jira/browse/HUDI-1320
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason


Also take care to guard against multi invocations



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1319) Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1319:


 Summary: Async Clean and Async Compaction - how will they work 
with metadata table updates - check multi writer
 Key: HUDI-1319
 URL: https://issues.apache.org/jira/browse/HUDI-1319
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason


Two cases:
1. Async process / pipeline 
2. AsyncClean service



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1318) Check if MergedBlockReader will neglect log blocks based on uncommitted commits.

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1318:


 Summary: Check if MergedBlockReader will neglect log blocks based 
on uncommitted commits.
 Key: HUDI-1318
 URL: https://issues.apache.org/jira/browse/HUDI-1318
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason


This is required to ensure that metdata readers always get a consistent view of 
the metadata table.

Case: Metadata table commit completed but dataset commit failed. Before the 
rollover, the readers should get a view which does not include the log block 
written by the last update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1317) Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table

2020-10-07 Thread Prashant Wason (Jira)
Prashant Wason created HUDI-1317:


 Summary: Fix initialization when Async jobs are scheduled - these 
jobs have older timestamp than INIT timestamp on metadata table
 Key: HUDI-1317
 URL: https://issues.apache.org/jira/browse/HUDI-1317
 Project: Apache Hudi
  Issue Type: Sub-task
Reporter: Prashant Wason
Assignee: Prashant Wason






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1316) Enhance logging of configs such as timeline layout, upgrade/downgrade, backwards compatible configs

2020-10-07 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1316:
-

 Summary: Enhance logging of configs such as timeline layout, 
upgrade/downgrade, backwards compatible configs
 Key: HUDI-1316
 URL: https://issues.apache.org/jira/browse/HUDI-1316
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Code Cleanup
Reporter: Nishith Agarwal


Recently we ran into an issue where the code has a bug around backwards 
compatibility for timeline. We should log these kinds of configs whereever this 
config is excercised in the code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


ashishmgofficial edited a comment on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-70517


   I found this Error message in the logs : 
   
   ```
INFO DAGScheduler: Job 11 finished: sum at DeltaSync.java:406, took 
0.108824 s
   20/10/07 20:27:07 ERROR DeltaSync: Delta Sync found errors when writing. 
Errors/Total=16/16
   20/10/07 20:27:07 ERROR DeltaSync: Printing out the top 100 errors
   ```
   
   Im able to see the transformed dataset properly when i sysout the dataset 
from transformer : 
   
   ```
   
+---+-+--++---+--+--+++
   |_op|   _ts_ms|inc_id|year| 
violation_desc|violation_code|case_individual_id|flag|last_modified_ts|
   
+---+-+--++---+--+--+++
   |  r|1601531771650| 1|2016|  DRIVING WHILE INTOXICATED| 
11923|  17475366|   I|160094538000|
   |  r|1601531771650| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV|
5112A1|  17475367|   U|16009596|
   |  r|1601531771650| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV|
5112A2|  17475368|   I|15693372|
   |  r|1601531771651| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475569|   I|16013742|
   |  r|1601531771651| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475571|   I|16013754|
   |  r|1601531771651| 2|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
2180F|  17475569|   U|16013772|
   |  r|1601531771651| 6|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
2180F|  17475570|   I|16013748|
   |  r|1601531771651| 8|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475572|   I|16013766|
   |  r|1601531771652| 9|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475573|   I|16013772|
   |  r|1601531771652|10|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180D|  17475574|   I|16013778|
   |  r|1601531771652|11|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180D|  17475574|   I|16013814|
   |  r|1601531771652|12|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013850|
   |  r|1601531771652|13|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013886|
   |  r|1601531771653|34|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013922|
   |  c|1601531913459|35|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013958|
   |  r|1601532941011| 1|2016|  DRIVING WHILE INTOXICATED| 
11923|  17475366|   I|160094538000|
   |  r|1601532941011| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV|
5112A1|  17475367|   U|16009596|
   |  r|1601532941012| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV|
5112A2|  17475368|   I|15693372|
   |  r|1601532941012| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475569|   I|16013742|
   |  r|1601532941012| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475571|   I|16013754|
   
+---+-+--++---+--+--+++
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


ashishmgofficial commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-70517


   I found this Error message in the logs : 
   
   ```
INFO DAGScheduler: Job 11 finished: sum at DeltaSync.java:406, took 
0.108824 s
   20/10/07 20:27:07 ERROR DeltaSync: Delta Sync found errors when writing. 
Errors/Total=16/16
   20/10/07 20:27:07 ERROR DeltaSync: Printing out the top 100 errors
   ```
   
   Im able to see the transformed dataset properly when i sysout the dataset : 
   
   ```
   
+---+-+--++---+--+--+++
   |_op|   _ts_ms|inc_id|year| 
violation_desc|violation_code|case_individual_id|flag|last_modified_ts|
   
+---+-+--++---+--+--+++
   |  r|1601531771650| 1|2016|  DRIVING WHILE INTOXICATED| 
11923|  17475366|   I|160094538000|
   |  r|1601531771650| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV|
5112A1|  17475367|   U|16009596|
   |  r|1601531771650| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV|
5112A2|  17475368|   I|15693372|
   |  r|1601531771651| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475569|   I|16013742|
   |  r|1601531771651| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475571|   I|16013754|
   |  r|1601531771651| 2|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
2180F|  17475569|   U|16013772|
   |  r|1601531771651| 6|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
2180F|  17475570|   I|16013748|
   |  r|1601531771651| 8|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475572|   I|16013766|
   |  r|1601531771652| 9|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475573|   I|16013772|
   |  r|1601531771652|10|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180D|  17475574|   I|16013778|
   |  r|1601531771652|11|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180D|  17475574|   I|16013814|
   |  r|1601531771652|12|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013850|
   |  r|1601531771652|13|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013886|
   |  r|1601531771653|34|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013922|
   |  c|1601531913459|35|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475574|   I|16013958|
   |  r|1601532941011| 1|2016|  DRIVING WHILE INTOXICATED| 
11923|  17475366|   I|160094538000|
   |  r|1601532941011| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV|
5112A1|  17475367|   U|16009596|
   |  r|1601532941012| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV|
5112A2|  17475368|   I|15693372|
   |  r|1601532941012| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475569|   I|16013742|
   |  r|1601532941012| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 
1180E|  17475571|   I|16013754|
   
+---+-+--++---+--+--+++
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


ashishmgofficial commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-705148134


   
![hudi-kafka](https://user-images.githubusercontent.com/40498599/95378569-b880d400-0901-11eb-8f5c-9268b4b3a92f.JPG)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


ashishmgofficial commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-705121899


   @bvaradar  I changed the code to as previous and ran the deltastreamer . But 
some reason is causing error and data is getting rolled back :  
   
   ``` 
   20/10/07 18:34:18 ERROR Client: Application diagnostics message: User class 
threw exception: org.apache.hudi.exception.HoodieException: Commit 
20201007183359 failed and rolled-back !
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:449)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:249)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:163)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:161)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:466)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)
   
   Exception in thread "main" org.apache.spark.SparkException: Application 
application_1601158208025_9743 finished with failed status
   at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149)
   at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526)
   at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
   at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] tandonraghav opened a new issue #2151: [SUPPORT] How to run Periodic Compaction? Multiple Tables - When no Upserts

2020-10-07 Thread GitBox


tandonraghav opened a new issue #2151:
URL: https://github.com/apache/hudi/issues/2151


   I have a use case where Mongo Oplogs are ingested into Kafka Topic via 
Debezium.
   
   These oplogs are from N Collections, since Hudi does'nt support Multiple 
table insertions in a DataSource.
   
   I am taking these oplogs (Spark+Kafka Dstreams) and bucketing based on 
collection names and then inserting it into Hudi Tables. (N Hudi Tables 
corresponding to N collections)
   
   - AFAIU Hudi doesnot support Async Compaction with Spark (Not using 
Structured Streaming). so, is there any other way to run Async Comapction or 
Periodic Compaction? Is there any tool which can be ran periodically to do 
compactions?
   
   - If I am using inline compaction, then how to do compaction every 15min or 
periodically if there are no Upserts on the table?
   - There can be around 100 tables for oplogs. 

   Code Snippet (Spark Streaming)
   
   
   //Read From Kafka 
   List db_names = df.select("source.ns").distinct().collectAsList();
   
   for(int i =0;i 
ds=kafkaDf.select("*").where(kafkaDf.col("source.ns").equalTo(dbName)); 
   // Few other Transformations
   persistDFInHudi(ds,sanitizedDBName,tablePath);
   }
   
   private void persistDFInHudi(Dataset ds, String dbName, String 
tablePath) {
   ds
   .write().format("org.apache.hudi").
   options(QuickstartUtils.getQuickstartWriteConfigs()).
   
   
option(DataSourceWriteOptions.OPERATION_OPT_KEY(),DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL()).
   
option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(),DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL()).
   option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), 
"ts_ms").
   
   option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), 
"_id").
   option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY(), 
MergeHudiPayload.class.getName()).
   option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), 
"db_name").
   option(HoodieWriteConfig.TABLE_NAME, dbName).
   option(HoodieCompactionConfig.INLINE_COMPACT_PROP,true).
   
option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(),"true").
   option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), dbName).
   
option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(),"db_name").
   
option(DataSourceWriteOptions.DEFAULT_HIVE_ASSUME_DATE_PARTITION_OPT_VAL(),"false").
   
option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP,1).
   
option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY(),false).
   
option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(), 
MultiPartKeysValueExtractor.class.getName()).
   mode(SaveMode.Append).
   save(tablePath);
   }
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


bvaradar commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-705086014


   @ashishmgofficial : You dont need _hoodie_is_deleted if you are using the 
custom transformer. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


ashishmgofficial edited a comment on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-704918685


   @bvaradar  : Thanks for the code . I followed your instructions but tried to 
add _is_hoodie_deleted column to the dataset using following code for testing 
   
   Im getting the following error with the code mentioned in the post
   
   ```
   Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
   at scala.Option.foreach(Option.scala:257)
   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1409)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
   at org.apache.spark.rdd.RDD.take(RDD.scala:1382)
   at 
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1517)
   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517)
   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1516)
   at 
org.apache.spark.api.java.JavaRDDLike$class.isEmpty(JavaRDDLike.scala:544)
   at 
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:349)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:238)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:163)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:161)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:466)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
   at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: java.lang.RuntimeException: Error while decoding: 
java.lang.NegativeArraySizeException
   c

[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer

2020-10-07 Thread GitBox


ashishmgofficial commented on issue #2149:
URL: https://github.com/apache/hudi/issues/2149#issuecomment-704918685


   @bvaradar  : I followed your instructions but tried to add 
_is_hoodie_deleted column to the dataset using following code for testing 
   
   Im getting the following error with the code mentioned in the post
   
   ```
   Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967)
   at scala.Option.foreach(Option.scala:257)
   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213)
   at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
   at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101)
   at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1409)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
   at org.apache.spark.rdd.RDD.take(RDD.scala:1382)
   at 
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1517)
   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517)
   at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
   at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
   at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1516)
   at 
org.apache.spark.api.java.JavaRDDLike$class.isEmpty(JavaRDDLike.scala:544)
   at 
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:349)
   at 
org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:238)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:163)
   at org.apache.hudi.common.util.Option.ifPresent(Option.java:96)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:161)
   at 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:466)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at 
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
   at 
org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853)
   at 
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
   at 
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: java.lang.RuntimeException: Error while decoding: 
java.lang.NegativeArraySizeException
   createexternalrow(input[0, big

[GitHub] [hudi] nsivabalan merged pull request #2128: [HUDI-1303] Some improvements for the HUDI Test Suite.

2020-10-07 Thread GitBox


nsivabalan merged pull request #2128:
URL: https://github.com/apache/hudi/pull/2128


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)

2020-10-07 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 788d236  [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
788d236 is described below

commit 788d236c443eb4ced819f9305ed8e0460b5984b7
Author: Prashant Wason 
AuthorDate: Wed Oct 7 05:33:51 2020 -0700

[HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)

1. Use the DAG Node's label from the yaml as its name instead of UUID names 
which are not descriptive when debugging issues from logs.
2. Fix CleanNode constructor which is not correctly implemented
3. When generating upsets, allows more granualar control over the number of 
inserts and upserts - zero or more inserts and upserts can be specified instead 
of always requiring both inserts and upserts.
4. Fixed generation of records of specific size
   - The current code was using a class variable "shouldAddMore" which was 
reset to false after the first record generation causing subsequent records to 
be of minimum size.
   - In this change, we pre-calculate the extra size of the complex fields. 
When generating records, for complex fields we read the field size from this 
map.
5. Refresh the timeline of the DeltaSync service before calling 
readFromSource. This ensures that only the newest generated data is read and 
data generated in the older Dag Nodes is ignored (as their AVRO files will have 
an older timestamp).
6. Making --workload-generator-classname an optional parameter as most 
probably the default will be used
---
 docker/compose/hadoop.env  |   3 +-
 .../testsuite/HoodieDeltaStreamerWrapper.java  |   1 +
 .../hudi/integ/testsuite/HoodieTestSuiteJob.java   |   6 +-
 .../integ/testsuite/configuration/DeltaConfig.java |   5 +
 .../apache/hudi/integ/testsuite/dag/DagUtils.java  |  12 +-
 .../hudi/integ/testsuite/dag/nodes/CleanNode.java  |   4 +-
 .../integ/testsuite/generator/DeltaGenerator.java  |  65 +
 .../GenericRecordFullPayloadGenerator.java | 152 ++---
 .../GenericRecordPartialPayloadGenerator.java  |   2 +-
 .../reader/DFSHoodieDatasetInputReader.java|   4 +-
 .../hudi/utilities/deltastreamer/DeltaSync.java|   2 +-
 .../utilities/sources/helpers/DFSPathSelector.java |   1 +
 12 files changed, 134 insertions(+), 123 deletions(-)

diff --git a/docker/compose/hadoop.env b/docker/compose/hadoop.env
index 474c3db..4e8a942 100644
--- a/docker/compose/hadoop.env
+++ b/docker/compose/hadoop.env
@@ -21,12 +21,13 @@ HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive
 HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive
 HIVE_SITE_CONF_datanucleus_autoCreateSchema=false
 HIVE_SITE_CONF_hive_metastore_uris=thrift://hivemetastore:9083
-HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
 
+HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false
 HDFS_CONF_dfs_webhdfs_enabled=true
 HDFS_CONF_dfs_permissions_enabled=false
 #HDFS_CONF_dfs_client_use_datanode_hostname=true
 #HDFS_CONF_dfs_namenode_use_datanode_hostname=true
+HDFS_CONF_dfs_replication=1
 
 CORE_CONF_fs_defaultFS=hdfs://namenode:8020
 CORE_CONF_hadoop_http_staticuser_user=root
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
index 5179e89..6e5027b 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
@@ -66,6 +66,7 @@ public class HoodieDeltaStreamerWrapper extends 
HoodieDeltaStreamer {
 
   public Pair>> 
fetchSource() throws Exception {
 DeltaSync service = deltaSyncService.get().getDeltaSync();
+service.refreshTimeline();
 return service.readFromSource(service.getCommitTimelineOpt());
   }
 
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
index 2c4b73a..c2c242a 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java
@@ -156,8 +156,7 @@ public class HoodieTestSuiteJob {
 public String inputBasePath;
 
 @Parameter(names = {
-"--workload-generator-classname"}, description = "WorkflowDag of 
operations to generate the workload",
-required = true)
+"--workload-generator-classname"}, description = "WorkflowDag of 
operations to generate the workload")
 public String workloadDagGenerator = WorkflowDagGenerator.class.getName();
 
 @Par

[jira] [Commented] (HUDI-1310) Corruption Block Handling too slow in S3

2020-10-07 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209457#comment-17209457
 ] 

Steve Loughran commented on HUDI-1310:
--

how are you enumerating the blocks? listStatus calls? treewalks? 

> Corruption Block Handling too slow in S3
> 
>
> Key: HUDI-1310
> URL: https://issues.apache.org/jira/browse/HUDI-1310
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> The logic to figure out next valid starting block offset is too slow when run 
> in S3. 
> I have bolded the log message that takes long time to appear. 
>  
>  
> 36589 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning 
> log file 
> HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0}
> 36590 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block 
> in file 
> HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} with block size(3723305) running past EOF
> 36684 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Log 
> HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} has a corrupted block at 14
> *44515 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block 
> in* 
> HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045',
>  fileLen=0} starts at 3723319
> 44566 [Spring Shell] INFO 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a 
> corrupt block in 
> s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-10-07 Thread GitBox


SteNicholas commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-704795298


   @bvaradar @vinothchandar Could you please help to review this pull request?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2129: [HUDI-1302] Add support for timestamp field in HiveSync

2020-10-07 Thread GitBox


satishkotha commented on pull request #2129:
URL: https://github.com/apache/hudi/pull/2129#issuecomment-704748964


   @pratyakshsharma @n3nash Please take a look.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org