[GitHub] [hudi] prashanthvg89 opened a new issue #2153: [SUPPORT] Failed to delete key: /.hoodie/.temp/20201006182950
prashanthvg89 opened a new issue #2153: URL: https://github.com/apache/hudi/issues/2153 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. **Describe the problem you faced** Random "Failed to delete key" error during UPSERT operation in a Spark Structured Streaming job even with "hoodie.consistency.check.enabled" set to true **To Reproduce** Steps to reproduce the behavior: UNKOWN - Occurs randomly after running streaming job - sometimes for 15 hours and other times for about couple days **Expected behavior** UPSERT should be a simple operation and if there is an application bug then we should have faced this as soon as I launched the application but it appears intermittently. The only resolution so far is to restart the job **Environment Description** * Hudi version : 0.5.2-incubating * Spark version : 2.4.4 * Hive version : 2.3.6 * Hadoop version : EMR 5.29.0 * Storage (HDFS/S3/GCS..) : S3 * Running on Docker? (yes/no) : no **Additional context** This is a simple streaming application which listens to Kinesis stream with a batch interval of 15 minutes and updates the Hudi table using MERGE_ON_READ **Stacktrace** ```Caused by: org.apache.hudi.exception.HoodieIOException: Failed to delete key: /.hoodie/.temp/20201006182950 at org.apache.hudi.table.HoodieTable.deleteMarkerDir(HoodieTable.java:333) at org.apache.hudi.table.HoodieTable.cleanFailedWrites(HoodieTable.java:409) at org.apache.hudi.table.HoodieTable.finalizeWrite(HoodieTable.java:315) at org.apache.hudi.table.HoodieMergeOnReadTable.finalizeWrite(HoodieMergeOnReadTable.java:317) at org.apache.hudi.client.AbstractHoodieWriteClient.finalizeWrite(AbstractHoodieWriteClient.java:195) ... 66 more Caused by: java.io.IOException: Failed to delete key: /.hoodie/.temp/20201006182950 at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:767) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:337) at org.apache.hudi.common.io.storage.HoodieWrapperFileSystem.delete(HoodieWrapperFileSystem.java:261) at org.apache.hudi.table.HoodieTable.deleteMarkerDir(HoodieTable.java:330) ... 70 more Caused by: java.io.IOException: 1 exceptions thrown from 5 batch deletes at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:390) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1494) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:765) ... 73 more Caused by: java.io.IOException: MultiObjectDeleteException thrown with 2 keys in error: /.hoodie/.temp/20201006182950/195/2ecca5ce-ba13-4d5a-a2e3-79713261dc49-0_2061-53-37527_20201006182950.marker, /.hoodie/.temp/20201006182950/290_$folder$ at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:375) ... 75 more Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.MultiObjectDeleteException: One or more objects could not be deleted (Service: null; Status Code: 200; Error Code: null; Request ID: 457C53995454141D; S3 Extended Request ID: NKQEApW06BHPRG5oQBP4RffTd6OZQXOCNl6jurU690Ee+iE3cgbRbbtNPugjqa3qyADj6x5zqBk=), S3 Extended Request ID: NKQEApW06BHPRG5oQBP4RffTd6OZQXOCNl6jurU690Ee+iE3cgbRbbtNPugjqa3qyADj6x5zqBk= at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.deleteObjects(AmazonS3Client.java:2267) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:24) at com.amazon.ws.emr.hadoop.fs.s3.lite.call.DeleteObjectsCall.perform(DeleteObjectsCall.java:10) at com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:110) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:189) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:184) at com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.deleteObjects(AmazonS3LiteClient.java:128) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:370)``` This is an automated message from the Apache Git Service. To respond to the message,
[jira] [Updated] (HUDI-1328) Introduce HoodieFlinkEngineContext to hudi-flink-client
[ https://issues.apache.org/jira/browse/HUDI-1328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1328: -- Summary: Introduce HoodieFlinkEngineContext to hudi-flink-client (was: IntroduceHoodieFlinkEngineContext to hudi-flink-client) > Introduce HoodieFlinkEngineContext to hudi-flink-client > --- > > Key: HUDI-1328 > URL: https://issues.apache.org/jira/browse/HUDI-1328 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Fix For: 0.6.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1328) IntroduceHoodieFlinkEngineContext to hudi-flink-client
wangxianghu created HUDI-1328: - Summary: IntroduceHoodieFlinkEngineContext to hudi-flink-client Key: HUDI-1328 URL: https://issues.apache.org/jira/browse/HUDI-1328 Project: Apache Hudi Issue Type: Sub-task Reporter: wangxianghu Assignee: wangxianghu Fix For: 0.6.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1327) Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client
[ https://issues.apache.org/jira/browse/HUDI-1327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1327: -- Parent: HUDI-909 Issue Type: Sub-task (was: Task) > Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client > --- > > Key: HUDI-1327 > URL: https://issues.apache.org/jira/browse/HUDI-1327 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Fix For: 0.6.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.
codecov-io edited a comment on pull request #2152: URL: https://github.com/apache/hudi/pull/2152#issuecomment-705276668 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=h1) Report > Merging [#2152](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/788d236c443eb4ced819f9305ed8e0460b5984b7?el=desc) will **not change** coverage. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2152/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=tree) ```diff @@Coverage Diff@@ ## master#2152 +/- ## = Coverage 53.61% 53.61% Complexity 2845 2845 = Files 359 359 Lines 1653516535 Branches 1777 1777 = Hits 8866 8866 Misses 6912 6912 Partials757 757 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | | | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | #hudicommon | `54.74% <ø> (ø)` | `1793.00 <ø> (ø)` | | | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | | | #hudispark | `65.51% <ø> (ø)` | `303.00 <ø> (ø)` | | | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | | | #hudiutilities | `69.98% <ø> (ø)` | `325.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.
codecov-io commented on pull request #2152: URL: https://github.com/apache/hudi/pull/2152#issuecomment-705276668 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=h1) Report > Merging [#2152](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=desc) into [master](https://codecov.io/gh/apache/hudi/commit/788d236c443eb4ced819f9305ed8e0460b5984b7?el=desc) will **not change** coverage. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2152/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2152?src=pr&el=tree) ```diff @@Coverage Diff@@ ## master#2152 +/- ## = Coverage 53.61% 53.61% Complexity 2845 2845 = Files 359 359 Lines 1653516535 Branches 1777 1777 = Hits 8866 8866 Misses 6912 6912 Partials757 757 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | #hudicli | `38.37% <ø> (ø)` | `193.00 <ø> (ø)` | | | #hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | #hudicommon | `54.74% <ø> (ø)` | `1793.00 <ø> (ø)` | | | #hudihadoopmr | `33.05% <ø> (ø)` | `181.00 <ø> (ø)` | | | #hudispark | `65.51% <ø> (ø)` | `303.00 <ø> (ø)` | | | #huditimelineservice | `62.29% <ø> (ø)` | `50.00 <ø> (ø)` | | | #hudiutilities | `69.98% <ø> (ø)` | `325.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1327) Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client
wangxianghu created HUDI-1327: - Summary: Add Flink implemetation of BootstrapSchemaProvider to hudi-flink-client Key: HUDI-1327 URL: https://issues.apache.org/jira/browse/HUDI-1327 Project: Apache Hudi Issue Type: Task Reporter: wangxianghu Assignee: wangxianghu Fix For: 0.6.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason commented on pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.
prashantwason commented on pull request #2152: URL: https://github.com/apache/hudi/pull/2152#issuecomment-705266279 @n3nash Can you please check if this satisfied the requirements of publishing metrics from the spark streaming too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1326) Publishing metrics from hudi-test-suite after each job
[ https://issues.apache.org/jira/browse/HUDI-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1326: - Labels: pull-request-available (was: ) > Publishing metrics from hudi-test-suite after each job > -- > > Key: HUDI-1326 > URL: https://issues.apache.org/jira/browse/HUDI-1326 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Prashant Wason >Assignee: Prashant Wason >Priority: Minor > Labels: pull-request-available > > HUDI metrics are published at two stages: > 1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics) > 2. When the JVM is shutdown (Metrics.java addShutdownHook) > hudi-test-suite includes multiple jobs each of which is an action on the > table. All these jobs are run in a single JVM. Hence, currently some metrics > are not published till the entire hudi-test-suite run is over. > Enhancement: > 1. Allow metrics to be published after each job > 2. Flush metrics so redundant metrics (from the last job) are not published > again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason opened a new pull request #2152: [HUDI-1326] Added an API to force publish metrics and flush them.
prashantwason opened a new pull request #2152: URL: https://github.com/apache/hudi/pull/2152 Using the added API, publish metrics after each level of the DAG completed in hudi-test-suite. ## What is the purpose of the pull request HUDI metrics are published at two stages: 1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics) 2. When the JVM is shutdown (Metrics.java addShutdownHook) hudi-test-suite includes multiple jobs each of which is an action on the table. All these jobs are run in a single JVM. Hence, currently some metrics are not published till the entire hudi-test-suite run is over. Enhancements: 1. Allow metrics to be published after each job 2. Flush metrics so redundant metrics (from the last job) are not published again. ## Brief change log Added Metrics.flush() ## Verify this pull request This pull request is a trivial rework / code cleanup without any test coverage. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1326) Publishing metrics from hudi-test-suite after each job
Prashant Wason created HUDI-1326: Summary: Publishing metrics from hudi-test-suite after each job Key: HUDI-1326 URL: https://issues.apache.org/jira/browse/HUDI-1326 Project: Apache Hudi Issue Type: Improvement Reporter: Prashant Wason Assignee: Prashant Wason HUDI metrics are published at two stages: 1. When an action completed (e.g. HoodieMetrics.updateCommitMetrics) 2. When the JVM is shutdown (Metrics.java addShutdownHook) hudi-test-suite includes multiple jobs each of which is an action on the table. All these jobs are run in a single JVM. Hence, currently some metrics are not published till the entire hudi-test-suite run is over. Enhancement: 1. Allow metrics to be published after each job 2. Flush metrics so redundant metrics (from the last job) are not published again. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r501383913 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.cli.commands; + +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.cli.HoodieCLI; +import org.apache.hudi.cli.utils.SparkUtil; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.metadata.HoodieMetadata; +import org.apache.spark.api.java.JavaSparkContext; +import org.springframework.shell.core.CommandMarker; +import org.springframework.shell.core.annotation.CliCommand; +import org.springframework.shell.core.annotation.CliOption; +import org.springframework.stereotype.Component; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.Arrays; +import java.util.List; +import java.util.Map; + +/** + * CLI commands to operate on the Metadata Table. + */ +@Component +public class MetadataCommand implements CommandMarker { + private JavaSparkContext jsc; + + @CliCommand(value = "metadata set", help = "Set options for Metadata Table") + public String set(@CliOption(key = {"metadataDir"}, + help = "Directory to read/write metadata table (can be different from dataset)", unspecifiedDefaultValue = "") + final String metadataDir) { +if (!metadataDir.isEmpty()) { + HoodieMetadata.setMetadataBaseDirectory(metadataDir); +} + +return String.format("Ok"); + } + + @CliCommand(value = "metadata create", help = "Create the Metadata Table if it does not exist") + public String create() throws IOException { +HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); +Path metadataPath = new Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath)); +try { + FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath); + if (statuses.length > 0) { +throw new RuntimeException("Metadata directory (" + metadataPath.toString() + ") not empty."); + } +} catch (FileNotFoundException e) { + // Metadata directory does not exist yet + HoodieCLI.fs.mkdirs(metadataPath); +} + +long t1 = System.currentTimeMillis(); +HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder().withPath(HoodieCLI.basePath) +.withUseFileListingMetadata(true).build(); +initJavaSparkContext(); +HoodieMetadata.init(jsc, writeConfig); +long t2 = System.currentTimeMillis(); + +return String.format("Created Metadata Table in %s (duration=%.2fsec)", metadataPath, (t2 - t1) / 1000.0); + } + + @CliCommand(value = "metadata delete", help = "Remove the Metadata Table") + public String delete() throws Exception { +HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); +Path metadataPath = new Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath)); +try { + FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath); + if (statuses.length > 0) { +HoodieCLI.fs.delete(metadataPath, true); + } +} catch (FileNotFoundException e) { + // Metadata directory does not exist +} + +HoodieMetadata.remove(HoodieCLI.basePath); + +return String.format("Removed Metdata Table from %s", metadataPath); + } + + @CliCommand(value = "metadata init", help = "Update the metadata table from commits since the creation") + public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = "false", + help = "Open in read-only mode") final boolean readOnly) throws Exception { +HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); +Path metadataPath = new Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath)); +try { + FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath); +} catch (FileNotFoundException e) { + // Metadata directory does not exist + throw new RuntimeException("Metadata directory (" + metadataPath.toString() + ") does not exist."); +
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r501383798 ## File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java ## @@ -0,0 +1,202 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.cli.commands; + +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.Path; +import org.apache.hudi.cli.HoodieCLI; +import org.apache.hudi.cli.utils.SparkUtil; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.metadata.HoodieMetadata; +import org.apache.spark.api.java.JavaSparkContext; +import org.springframework.shell.core.CommandMarker; +import org.springframework.shell.core.annotation.CliCommand; +import org.springframework.shell.core.annotation.CliOption; +import org.springframework.stereotype.Component; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.util.Arrays; +import java.util.List; +import java.util.Map; + +/** + * CLI commands to operate on the Metadata Table. + */ +@Component +public class MetadataCommand implements CommandMarker { + private JavaSparkContext jsc; + + @CliCommand(value = "metadata set", help = "Set options for Metadata Table") + public String set(@CliOption(key = {"metadataDir"}, + help = "Directory to read/write metadata table (can be different from dataset)", unspecifiedDefaultValue = "") + final String metadataDir) { +if (!metadataDir.isEmpty()) { + HoodieMetadata.setMetadataBaseDirectory(metadataDir); +} + +return String.format("Ok"); + } + + @CliCommand(value = "metadata create", help = "Create the Metadata Table if it does not exist") + public String create() throws IOException { +HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); +Path metadataPath = new Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath)); +try { + FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath); + if (statuses.length > 0) { +throw new RuntimeException("Metadata directory (" + metadataPath.toString() + ") not empty."); + } +} catch (FileNotFoundException e) { + // Metadata directory does not exist yet + HoodieCLI.fs.mkdirs(metadataPath); +} + +long t1 = System.currentTimeMillis(); +HoodieWriteConfig writeConfig = HoodieWriteConfig.newBuilder().withPath(HoodieCLI.basePath) +.withUseFileListingMetadata(true).build(); +initJavaSparkContext(); +HoodieMetadata.init(jsc, writeConfig); +long t2 = System.currentTimeMillis(); + +return String.format("Created Metadata Table in %s (duration=%.2fsec)", metadataPath, (t2 - t1) / 1000.0); + } + + @CliCommand(value = "metadata delete", help = "Remove the Metadata Table") + public String delete() throws Exception { +HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); +Path metadataPath = new Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath)); +try { + FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath); + if (statuses.length > 0) { +HoodieCLI.fs.delete(metadataPath, true); + } +} catch (FileNotFoundException e) { + // Metadata directory does not exist +} + +HoodieMetadata.remove(HoodieCLI.basePath); + +return String.format("Removed Metdata Table from %s", metadataPath); + } + + @CliCommand(value = "metadata init", help = "Update the metadata table from commits since the creation") + public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = "false", + help = "Open in read-only mode") final boolean readOnly) throws Exception { +HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient(); +Path metadataPath = new Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath)); +try { + FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath); +} catch (FileNotFoundException e) { + // Metadata directory does not exist + throw new RuntimeException("Metadata directory (" + metadataPath.toString() + ") does not exist."); +
[jira] [Commented] (HUDI-1325) Handle the corner case with syncing completed compaction from data timeline to metadata timeline.
[ https://issues.apache.org/jira/browse/HUDI-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209938#comment-17209938 ] Prashant Wason commented on HUDI-1325: -- [~vinoth] One way to handle this can be to merge (in-memory) the changes from the compaction instant. So on the reader side, we read: 1. From the hfile base file 2. From the log file blocks upto the last-sync instant 3. The in-memory changes Since this issue only arises for async compaction and async clean operation, it should not be too much overhead. Also, this is probably not required for async clean as queries should not care about files which may be asnc cleaned anyways. > Handle the corner case with syncing completed compaction from data timeline > to metadata timeline. > -- > > Key: HUDI-1325 > URL: https://issues.apache.org/jira/browse/HUDI-1325 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Prashant Wason >Priority: Major > > Here is a corner case with syncing completed compaction from data timeline to > metadata timeline. Consider the following sequence of events > t0: writer schedules compaction at time instant c > t1: Compactor starts processing c's plan > t2: compaction finishes with c.commit published on the data timeline (not yet > synced to metadata timeline) > t3: Next round of writing, writer opens metadata table, which adds the base > file produced in c.commit to metadata table. > Any queries running between t2 and t3, cannot rely on metadata since the new > base file will not be present in metadata table. The timeline will indicate > that the compaction completed, and the latest file slice will be computed as > simply the logs written to the file groups since compaction. This will lead > to incorrect results. > If we consider just writer alone, we may be okay since we first sync the > metadata table before we do anything for the delta commit at t3. But in > general for queries, we should advise enabling metadata table based listings > only, after all writers/cleaner/compactor have been enabled to use metadata > and been successfully using it to publish new/deleted files directly to the > metadata table. In short, queries cannot rely on metadata table, with the > syncing mechanism as the main thing that keeps data and metadata timelines > together. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1325) Handle the corner case with syncing completed compaction from data timeline to metadata timeline.
Prashant Wason created HUDI-1325: Summary: Handle the corner case with syncing completed compaction from data timeline to metadata timeline. Key: HUDI-1325 URL: https://issues.apache.org/jira/browse/HUDI-1325 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Here is a corner case with syncing completed compaction from data timeline to metadata timeline. Consider the following sequence of events t0: writer schedules compaction at time instant c t1: Compactor starts processing c's plan t2: compaction finishes with c.commit published on the data timeline (not yet synced to metadata timeline) t3: Next round of writing, writer opens metadata table, which adds the base file produced in c.commit to metadata table. Any queries running between t2 and t3, cannot rely on metadata since the new base file will not be present in metadata table. The timeline will indicate that the compaction completed, and the latest file slice will be computed as simply the logs written to the file groups since compaction. This will lead to incorrect results. If we consider just writer alone, we may be okay since we first sync the metadata table before we do anything for the delta commit at t3. But in general for queries, we should advise enabling metadata table based listings only, after all writers/cleaner/compactor have been enabled to use metadata and been successfully using it to publish new/deleted files directly to the metadata table. In short, queries cannot rely on metadata table, with the syncing mechanism as the main thing that keeps data and metadata timelines together. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r501375567 ## File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java ## @@ -174,8 +174,17 @@ public static String getRelativePartitionPath(Path basePath, Path fullPartitionP /** * Obtain all the partition paths, that are present in this table, denoted by presence of * {@link HoodiePartitionMetadata#HOODIE_PARTITION_METAFILE}. + * + * If thee basePathStr is a subdirectory of .hoodie folder then we assume that the partitions of an internal + * table (a hoodie table within the .hoodie directory) are to be obtained. + * + * @param fs FileSystem instance + * @param basePathStr base directory */ public static List getAllFoldersWithPartitionMetaFile(FileSystem fs, String basePathStr) throws IOException { +// If the basePathStr is a folder within the .hoodie directory then we are listing partitions within an +// internal table. +final boolean isInternalTable = basePathStr.contains(HoodieTableMetaClient.METAFOLDER_NAME); Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1324) FSUtils: rename:isMetadataTable
Prashant Wason created HUDI-1324: Summary: FSUtils: rename:isMetadataTable Key: HUDI-1324 URL: https://issues.apache.org/jira/browse/HUDI-1324 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Assignee: Prashant Wason -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1323) Fence metadata reads using latest data timeline commit times!
Prashant Wason created HUDI-1323: Summary: Fence metadata reads using latest data timeline commit times! Key: HUDI-1323 URL: https://issues.apache.org/jira/browse/HUDI-1323 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Problem D: We need to fence metadata reads using latest data timeline commit times! and limit to only handing out files that belong to a committed instant on the data timeline. Otherwise, metadata table can hand uncommitted files to cleaner etc and cause us to delete legit latest file slices i.e data loss -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1319) Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer
[ https://issues.apache.org/jira/browse/HUDI-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1319: - Description: Two cases: 1. Async process / pipeline (Handle out-of-process Clean and Compaction) 2. AsyncClean service NOTE: We need to decide where to log update for async operations. If we do it at Async operation completion then we may end up in multi writer scenario. was: Two cases: 1. Async process / pipeline (Handle out-of-process Clean and Compaction) 2. AsyncClean service > Async Clean and Async Compaction - how will they work with metadata table > updates - check multi writer > -- > > Key: HUDI-1319 > URL: https://issues.apache.org/jira/browse/HUDI-1319 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Prashant Wason >Priority: Major > > Two cases: > 1. Async process / pipeline (Handle out-of-process Clean and Compaction) > 2. AsyncClean service > NOTE: We need to decide where to log update for async operations. If we do it > at Async operation completion then we may end up in multi writer scenario. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r501374106 ## File path: hudi-client/src/main/java/org/apache/hudi/table/action/clean/CleanActionExecutor.java ## @@ -244,6 +245,10 @@ private HoodieCleanMetadata runClean(HoodieTable table, HoodieInstant cleanIn inflightInstant = cleanInstant; } + // Update Metadata Table before even finishing clean. This ensures that an async clean operation in the + // background does not lead to stale metadata being returned from Metadata Table. + HoodieMetadata.update(config, cleanerPlan, cleanInstant.getTimestamp()); Review comment: Created HUDI-1319 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.
prashantwason commented on a change in pull request #2064: URL: https://github.com/apache/hudi/pull/2064#discussion_r501373801 ## File path: hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java ## @@ -121,6 +122,9 @@ public boolean commitStats(String instantTime, List stats, Opti metadata.setOperationType(operationType); try { + // Update Metadata Table + HoodieMetadata.update(config, metadata, instantTime); Review comment: Created HUDI-1320. Will handle there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1319) Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer
[ https://issues.apache.org/jira/browse/HUDI-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prashant Wason updated HUDI-1319: - Description: Two cases: 1. Async process / pipeline (Handle out-of-process Clean and Compaction) 2. AsyncClean service was: Two cases: 1. Async process / pipeline 2. AsyncClean service > Async Clean and Async Compaction - how will they work with metadata table > updates - check multi writer > -- > > Key: HUDI-1319 > URL: https://issues.apache.org/jira/browse/HUDI-1319 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Prashant Wason >Priority: Major > > Two cases: > 1. Async process / pipeline (Handle out-of-process Clean and Compaction) > 2. AsyncClean service -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1322) Refactor into Reader & Writer side for Metadata
Prashant Wason created HUDI-1322: Summary: Refactor into Reader & Writer side for Metadata Key: HUDI-1322 URL: https://issues.apache.org/jira/browse/HUDI-1322 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1321) Support properties for metadata table via a properties.file
Prashant Wason created HUDI-1321: Summary: Support properties for metadata table via a properties.file Key: HUDI-1321 URL: https://issues.apache.org/jira/browse/HUDI-1321 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason metadata properties should be in its own namespace -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1320) Move static invocations of HoodieMetadata.xxx to HoodieTable
Prashant Wason created HUDI-1320: Summary: Move static invocations of HoodieMetadata.xxx to HoodieTable Key: HUDI-1320 URL: https://issues.apache.org/jira/browse/HUDI-1320 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Also take care to guard against multi invocations -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1319) Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer
Prashant Wason created HUDI-1319: Summary: Async Clean and Async Compaction - how will they work with metadata table updates - check multi writer Key: HUDI-1319 URL: https://issues.apache.org/jira/browse/HUDI-1319 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Two cases: 1. Async process / pipeline 2. AsyncClean service -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1318) Check if MergedBlockReader will neglect log blocks based on uncommitted commits.
Prashant Wason created HUDI-1318: Summary: Check if MergedBlockReader will neglect log blocks based on uncommitted commits. Key: HUDI-1318 URL: https://issues.apache.org/jira/browse/HUDI-1318 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason This is required to ensure that metdata readers always get a consistent view of the metadata table. Case: Metadata table commit completed but dataset commit failed. Before the rollover, the readers should get a view which does not include the log block written by the last update. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1317) Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table
Prashant Wason created HUDI-1317: Summary: Fix initialization when Async jobs are scheduled - these jobs have older timestamp than INIT timestamp on metadata table Key: HUDI-1317 URL: https://issues.apache.org/jira/browse/HUDI-1317 Project: Apache Hudi Issue Type: Sub-task Reporter: Prashant Wason Assignee: Prashant Wason -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1316) Enhance logging of configs such as timeline layout, upgrade/downgrade, backwards compatible configs
Nishith Agarwal created HUDI-1316: - Summary: Enhance logging of configs such as timeline layout, upgrade/downgrade, backwards compatible configs Key: HUDI-1316 URL: https://issues.apache.org/jira/browse/HUDI-1316 Project: Apache Hudi Issue Type: Improvement Components: Code Cleanup Reporter: Nishith Agarwal Recently we ran into an issue where the code has a bug around backwards compatibility for timeline. We should log these kinds of configs whereever this config is excercised in the code. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-70517 I found this Error message in the logs : ``` INFO DAGScheduler: Job 11 finished: sum at DeltaSync.java:406, took 0.108824 s 20/10/07 20:27:07 ERROR DeltaSync: Delta Sync found errors when writing. Errors/Total=16/16 20/10/07 20:27:07 ERROR DeltaSync: Printing out the top 100 errors ``` Im able to see the transformed dataset properly when i sysout the dataset from transformer : ``` +---+-+--++---+--+--+++ |_op| _ts_ms|inc_id|year| violation_desc|violation_code|case_individual_id|flag|last_modified_ts| +---+-+--++---+--+--+++ | r|1601531771650| 1|2016| DRIVING WHILE INTOXICATED| 11923| 17475366| I|160094538000| | r|1601531771650| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV| 5112A1| 17475367| U|16009596| | r|1601531771650| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV| 5112A2| 17475368| I|15693372| | r|1601531771651| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475569| I|16013742| | r|1601531771651| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475571| I|16013754| | r|1601531771651| 2|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 2180F| 17475569| U|16013772| | r|1601531771651| 6|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 2180F| 17475570| I|16013748| | r|1601531771651| 8|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475572| I|16013766| | r|1601531771652| 9|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475573| I|16013772| | r|1601531771652|10|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180D| 17475574| I|16013778| | r|1601531771652|11|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180D| 17475574| I|16013814| | r|1601531771652|12|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013850| | r|1601531771652|13|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013886| | r|1601531771653|34|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013922| | c|1601531913459|35|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013958| | r|1601532941011| 1|2016| DRIVING WHILE INTOXICATED| 11923| 17475366| I|160094538000| | r|1601532941011| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV| 5112A1| 17475367| U|16009596| | r|1601532941012| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV| 5112A2| 17475368| I|15693372| | r|1601532941012| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475569| I|16013742| | r|1601532941012| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475571| I|16013754| +---+-+--++---+--+--+++ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
ashishmgofficial commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-70517 I found this Error message in the logs : ``` INFO DAGScheduler: Job 11 finished: sum at DeltaSync.java:406, took 0.108824 s 20/10/07 20:27:07 ERROR DeltaSync: Delta Sync found errors when writing. Errors/Total=16/16 20/10/07 20:27:07 ERROR DeltaSync: Printing out the top 100 errors ``` Im able to see the transformed dataset properly when i sysout the dataset : ``` +---+-+--++---+--+--+++ |_op| _ts_ms|inc_id|year| violation_desc|violation_code|case_individual_id|flag|last_modified_ts| +---+-+--++---+--+--+++ | r|1601531771650| 1|2016| DRIVING WHILE INTOXICATED| 11923| 17475366| I|160094538000| | r|1601531771650| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV| 5112A1| 17475367| U|16009596| | r|1601531771650| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV| 5112A2| 17475368| I|15693372| | r|1601531771651| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475569| I|16013742| | r|1601531771651| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475571| I|16013754| | r|1601531771651| 2|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 2180F| 17475569| U|16013772| | r|1601531771651| 6|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 2180F| 17475570| I|16013748| | r|1601531771651| 8|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475572| I|16013766| | r|1601531771652| 9|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475573| I|16013772| | r|1601531771652|10|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180D| 17475574| I|16013778| | r|1601531771652|11|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180D| 17475574| I|16013814| | r|1601531771652|12|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013850| | r|1601531771652|13|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013886| | r|1601531771653|34|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013922| | c|1601531913459|35|2020| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475574| I|16013958| | r|1601532941011| 1|2016| DRIVING WHILE INTOXICATED| 11923| 17475366| I|160094538000| | r|1601532941011| 3|2016|AGGRAVATED UNLIC OPER 2ND/PREV CONV| 5112A1| 17475367| U|16009596| | r|1601532941012| 4|2019| AGGRAVATED UNLIC OPER 2ND/PREV| 5112A2| 17475368| I|15693372| | r|1601532941012| 5|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475569| I|16013742| | r|1601532941012| 7|2018| UNREASONABLE SPEED/SPECIAL HAZARDS| 1180E| 17475571| I|16013754| +---+-+--++---+--+--+++ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
ashishmgofficial commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-705148134 ![hudi-kafka](https://user-images.githubusercontent.com/40498599/95378569-b880d400-0901-11eb-8f5c-9268b4b3a92f.JPG) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
ashishmgofficial commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-705121899 @bvaradar I changed the code to as previous and ran the deltastreamer . But some reason is causing error and data is getting rolled back : ``` 20/10/07 18:34:18 ERROR Client: Application diagnostics message: User class threw exception: org.apache.hudi.exception.HoodieException: Commit 20201007183359 failed and rolled-back ! at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:449) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:249) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:163) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:161) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:466) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685) Exception in thread "main" org.apache.spark.SparkException: Application application_1601158208025_9743 finished with failed status at org.apache.spark.deploy.yarn.Client.run(Client.scala:1149) at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1526) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tandonraghav opened a new issue #2151: [SUPPORT] How to run Periodic Compaction? Multiple Tables - When no Upserts
tandonraghav opened a new issue #2151: URL: https://github.com/apache/hudi/issues/2151 I have a use case where Mongo Oplogs are ingested into Kafka Topic via Debezium. These oplogs are from N Collections, since Hudi does'nt support Multiple table insertions in a DataSource. I am taking these oplogs (Spark+Kafka Dstreams) and bucketing based on collection names and then inserting it into Hudi Tables. (N Hudi Tables corresponding to N collections) - AFAIU Hudi doesnot support Async Compaction with Spark (Not using Structured Streaming). so, is there any other way to run Async Comapction or Periodic Compaction? Is there any tool which can be ran periodically to do compactions? - If I am using inline compaction, then how to do compaction every 15min or periodically if there are no Upserts on the table? - There can be around 100 tables for oplogs. Code Snippet (Spark Streaming) //Read From Kafka List db_names = df.select("source.ns").distinct().collectAsList(); for(int i =0;i ds=kafkaDf.select("*").where(kafkaDf.col("source.ns").equalTo(dbName)); // Few other Transformations persistDFInHudi(ds,sanitizedDBName,tablePath); } private void persistDFInHudi(Dataset ds, String dbName, String tablePath) { ds .write().format("org.apache.hudi"). options(QuickstartUtils.getQuickstartWriteConfigs()). option(DataSourceWriteOptions.OPERATION_OPT_KEY(),DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL()). option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY(),DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL()). option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(), "ts_ms"). option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(), "_id"). option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY(), MergeHudiPayload.class.getName()). option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(), "db_name"). option(HoodieWriteConfig.TABLE_NAME, dbName). option(HoodieCompactionConfig.INLINE_COMPACT_PROP,true). option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY(),"true"). option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY(), dbName). option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY(),"db_name"). option(DataSourceWriteOptions.DEFAULT_HIVE_ASSUME_DATE_PARTITION_OPT_VAL(),"false"). option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP,1). option(DataSourceWriteOptions.ASYNC_COMPACT_ENABLE_OPT_KEY(),false). option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY(), MultiPartKeysValueExtractor.class.getName()). mode(SaveMode.Append). save(tablePath); } This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
bvaradar commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-705086014 @ashishmgofficial : You dont need _hoodie_is_deleted if you are using the custom transformer. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] ashishmgofficial edited a comment on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
ashishmgofficial edited a comment on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-704918685 @bvaradar : Thanks for the code . I followed your instructions but tried to add _is_hoodie_deleted column to the dataset using following code for testing Im getting the following error with the code mentioned in the post ``` Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1409) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:385) at org.apache.spark.rdd.RDD.take(RDD.scala:1382) at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1517) at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517) at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:385) at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1516) at org.apache.spark.api.java.JavaRDDLike$class.isEmpty(JavaRDDLike.scala:544) at org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:349) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:238) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:163) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:161) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:466) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Error while decoding: java.lang.NegativeArraySizeException c
[GitHub] [hudi] ashishmgofficial commented on issue #2149: Help with Reading Kafka topic written using Debezium Connector - Deltastreamer
ashishmgofficial commented on issue #2149: URL: https://github.com/apache/hudi/issues/2149#issuecomment-704918685 @bvaradar : I followed your instructions but tried to add _is_hoodie_deleted column to the dataset using following code for testing Im getting the following error with the code mentioned in the post ``` Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:967) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2264) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2213) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2202) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:778) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1409) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:385) at org.apache.spark.rdd.RDD.take(RDD.scala:1382) at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1517) at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517) at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1517) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:385) at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1516) at org.apache.spark.api.java.JavaRDDLike$class.isEmpty(JavaRDDLike.scala:544) at org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45) at org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:349) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:238) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.lambda$sync$2(HoodieDeltaStreamer.java:163) at org.apache.hudi.common.util.Option.ifPresent(Option.java:96) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:161) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:466) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:853) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:928) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:937) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.RuntimeException: Error while decoding: java.lang.NegativeArraySizeException createexternalrow(input[0, big
[GitHub] [hudi] nsivabalan merged pull request #2128: [HUDI-1303] Some improvements for the HUDI Test Suite.
nsivabalan merged pull request #2128: URL: https://github.com/apache/hudi/pull/2128 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128)
This is an automated email from the ASF dual-hosted git repository. sivabalan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 788d236 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128) 788d236 is described below commit 788d236c443eb4ced819f9305ed8e0460b5984b7 Author: Prashant Wason AuthorDate: Wed Oct 7 05:33:51 2020 -0700 [HUDI-1303] Some improvements for the HUDI Test Suite. (#2128) 1. Use the DAG Node's label from the yaml as its name instead of UUID names which are not descriptive when debugging issues from logs. 2. Fix CleanNode constructor which is not correctly implemented 3. When generating upsets, allows more granualar control over the number of inserts and upserts - zero or more inserts and upserts can be specified instead of always requiring both inserts and upserts. 4. Fixed generation of records of specific size - The current code was using a class variable "shouldAddMore" which was reset to false after the first record generation causing subsequent records to be of minimum size. - In this change, we pre-calculate the extra size of the complex fields. When generating records, for complex fields we read the field size from this map. 5. Refresh the timeline of the DeltaSync service before calling readFromSource. This ensures that only the newest generated data is read and data generated in the older Dag Nodes is ignored (as their AVRO files will have an older timestamp). 6. Making --workload-generator-classname an optional parameter as most probably the default will be used --- docker/compose/hadoop.env | 3 +- .../testsuite/HoodieDeltaStreamerWrapper.java | 1 + .../hudi/integ/testsuite/HoodieTestSuiteJob.java | 6 +- .../integ/testsuite/configuration/DeltaConfig.java | 5 + .../apache/hudi/integ/testsuite/dag/DagUtils.java | 12 +- .../hudi/integ/testsuite/dag/nodes/CleanNode.java | 4 +- .../integ/testsuite/generator/DeltaGenerator.java | 65 + .../GenericRecordFullPayloadGenerator.java | 152 ++--- .../GenericRecordPartialPayloadGenerator.java | 2 +- .../reader/DFSHoodieDatasetInputReader.java| 4 +- .../hudi/utilities/deltastreamer/DeltaSync.java| 2 +- .../utilities/sources/helpers/DFSPathSelector.java | 1 + 12 files changed, 134 insertions(+), 123 deletions(-) diff --git a/docker/compose/hadoop.env b/docker/compose/hadoop.env index 474c3db..4e8a942 100644 --- a/docker/compose/hadoop.env +++ b/docker/compose/hadoop.env @@ -21,12 +21,13 @@ HIVE_SITE_CONF_javax_jdo_option_ConnectionUserName=hive HIVE_SITE_CONF_javax_jdo_option_ConnectionPassword=hive HIVE_SITE_CONF_datanucleus_autoCreateSchema=false HIVE_SITE_CONF_hive_metastore_uris=thrift://hivemetastore:9083 -HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false +HDFS_CONF_dfs_namenode_datanode_registration_ip___hostname___check=false HDFS_CONF_dfs_webhdfs_enabled=true HDFS_CONF_dfs_permissions_enabled=false #HDFS_CONF_dfs_client_use_datanode_hostname=true #HDFS_CONF_dfs_namenode_use_datanode_hostname=true +HDFS_CONF_dfs_replication=1 CORE_CONF_fs_defaultFS=hdfs://namenode:8020 CORE_CONF_hadoop_http_staticuser_user=root diff --git a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java index 5179e89..6e5027b 100644 --- a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java +++ b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java @@ -66,6 +66,7 @@ public class HoodieDeltaStreamerWrapper extends HoodieDeltaStreamer { public Pair>> fetchSource() throws Exception { DeltaSync service = deltaSyncService.get().getDeltaSync(); +service.refreshTimeline(); return service.readFromSource(service.getCommitTimelineOpt()); } diff --git a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java index 2c4b73a..c2c242a 100644 --- a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java +++ b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieTestSuiteJob.java @@ -156,8 +156,7 @@ public class HoodieTestSuiteJob { public String inputBasePath; @Parameter(names = { -"--workload-generator-classname"}, description = "WorkflowDag of operations to generate the workload", -required = true) +"--workload-generator-classname"}, description = "WorkflowDag of operations to generate the workload") public String workloadDagGenerator = WorkflowDagGenerator.class.getName(); @Par
[jira] [Commented] (HUDI-1310) Corruption Block Handling too slow in S3
[ https://issues.apache.org/jira/browse/HUDI-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17209457#comment-17209457 ] Steve Loughran commented on HUDI-1310: -- how are you enumerating the blocks? listStatus calls? treewalks? > Corruption Block Handling too slow in S3 > > > Key: HUDI-1310 > URL: https://issues.apache.org/jira/browse/HUDI-1310 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Priority: Major > > The logic to figure out next valid starting block offset is too slow when run > in S3. > I have bolded the log message that takes long time to appear. > > > 36589 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Scanning > log file > HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} > 36590 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Found corrupted block > in file > HoodieLogFile\{pathStr='s3a://robinhood-encrypted-hudi-data-cove/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} with block size(3723305) running past EOF > 36684 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Log > HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} has a corrupted block at 14 > *44515 [Spring Shell] INFO > org.apache.hudi.common.table.log.HoodieLogFileReader - Next available block > in* > HoodieLogFile\{pathStr='s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045', > fileLen=0} starts at 3723319 > 44566 [Spring Shell] INFO > org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner - Found a > corrupt block in > s3a://x/dev_hudi_tables/balaji_varadarajan/benchmark_1M_10K_partitions/.hoodie/metadata/metadata_partition/.f02585bd-bb02-43f6-8bc8-cec71df87d1e-0_00.log.1_0-23-206045 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation
SteNicholas commented on pull request #2111: URL: https://github.com/apache/hudi/pull/2111#issuecomment-704795298 @bvaradar @vinothchandar Could you please help to review this pull request? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2129: [HUDI-1302] Add support for timestamp field in HiveSync
satishkotha commented on pull request #2129: URL: https://github.com/apache/hudi/pull/2129#issuecomment-704748964 @pratyakshsharma @n3nash Please take a look. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org