[GitHub] [hudi] bvaradar commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-25 Thread GitBox


bvaradar commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r495422383



##
File path: 
hudi-client/src/main/java/org/apache/hudi/metadata/HoodieMetadataImpl.java
##
@@ -0,0 +1,1104 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.metadata;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.avro.model.HoodieCleanerPlan;
+import org.apache.hudi.avro.model.HoodieRestoreMetadata;
+import org.apache.hudi.avro.model.HoodieRollbackMetadata;
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.ClientUtils;
+import org.apache.hudi.common.config.SerializableConfiguration;
+import org.apache.hudi.common.fs.ConsistencyGuardConfig;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.FileSlice;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieFileFormat;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
+import org.apache.hudi.common.table.view.TableFileSystemView.SliceView;
+import org.apache.hudi.common.util.CleanerUtils;
+import org.apache.hudi.common.util.FileIOUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.SpillableMapUtils;
+import org.apache.hudi.common.util.ValidationUtils;
+import org.apache.hudi.config.HoodieMetricsConfig;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.exception.HoodieMetadataException;
+import org.apache.hudi.io.storage.HoodieFileReader;
+import org.apache.hudi.io.storage.HoodieFileReaderFactory;
+import org.apache.hudi.metrics.HoodieMetrics;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import com.codahale.metrics.Timer;
+
+import scala.Tuple2;
+
+/**
+ * Metadata implementation which saves partition and file listing within an 
internal MOR table
+ * called Metadata Table. This table is created by listing files and 
partitions (first time) and kept in sync
+ * using the instants on the main dataset.
+ */
+public class HoodieMetadataImpl {
+  private static final Logger LOG = 
LogManager.getLogger(HoodieMetadataImpl.class);
+
+  // Table name suffix
+  private static final String METADATA_TABLE_NAME_SUFFIX = "_metadata";
+  // Timestamp for a commit when the base dataset had not had any commits yet.
+  private static final String SOLO_COMMIT_TIMESTAMP = "00";
+
+  // Name

[GitHub] [hudi] yanghua commented on a change in pull request #1968: [HUDI-1192] Make create hive database automatically configurable

2020-09-25 Thread GitBox


yanghua commented on a change in pull request #1968:
URL: https://github.com/apache/hudi/pull/1968#discussion_r495422071



##
File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala
##
@@ -290,6 +290,7 @@ object DataSourceWriteOptions {
   val HIVE_ASSUME_DATE_PARTITION_OPT_KEY = 
"hoodie.datasource.hive_sync.assume_date_partitioning"
   val HIVE_USE_PRE_APACHE_INPUT_FORMAT_OPT_KEY = 
"hoodie.datasource.hive_sync.use_pre_apache_input_format"
   val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc"
+  val HIVE_AUTO_CREATE_DATABASE_OPT_KEY = 
"hoodie.datasource.hive_sync_auto_create_database"

Review comment:
   Can you follow the other pattern??? `hoodie.datasource.hive_sync.xxx` 
   Check it carefully, please.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #2074: [HUDI-1233] Deltastreamer Kafka consumption delay reporting indicators

2020-09-25 Thread GitBox


yanghua commented on pull request #2074:
URL: https://github.com/apache/hudi/pull/2074#issuecomment-699436698


   @wangxianghu Can you help to review this PR firstly?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-25 Thread GitBox


bvaradar commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r495419947



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.utils.SparkUtil;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieMetadata;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * CLI commands to operate on the Metadata Table.
+ */
+@Component
+public class MetadataCommand implements CommandMarker {
+  private JavaSparkContext jsc;
+
+  @CliCommand(value = "metadata set", help = "Set options for Metadata Table")
+  public String set(@CliOption(key = {"metadataDir"},
+  help = "Directory to read/write metadata table (can be different from 
dataset)", unspecifiedDefaultValue = "")
+  final String metadataDir) {
+if (!metadataDir.isEmpty()) {
+  HoodieMetadata.setMetadataBaseDirectory(metadataDir);
+}
+
+return String.format("Ok");
+  }
+
+  @CliCommand(value = "metadata create", help = "Create the Metadata Table if 
it does not exist")
+  public String create() throws IOException {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") not empty.");
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist yet
+  HoodieCLI.fs.mkdirs(metadataPath);
+}
+
+long t1 = System.currentTimeMillis();
+HoodieWriteConfig writeConfig = 
HoodieWriteConfig.newBuilder().withPath(HoodieCLI.basePath)
+.withUseFileListingMetadata(true).build();
+initJavaSparkContext();
+HoodieMetadata.init(jsc, writeConfig);
+long t2 = System.currentTimeMillis();
+
+return String.format("Created Metadata Table in %s (duration=%.2fsec)", 
metadataPath, (t2 - t1) / 1000.0);
+  }
+
+  @CliCommand(value = "metadata delete", help = "Remove the Metadata Table")
+  public String delete() throws Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+HoodieCLI.fs.delete(metadataPath, true);
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+}
+
+HoodieMetadata.remove(HoodieCLI.basePath);
+
+return String.format("Removed Metdata Table from %s", metadataPath);
+  }
+
+  @CliCommand(value = "metadata init", help = "Update the metadata table from 
commits since the creation")
+  public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = 
"false",
+  help = "Open in read-only mode") final boolean readOnly) throws 
Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+  throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") does not exist.");
+}

[GitHub] [hudi] liujinhui1994 commented on pull request #2074: [HUDI-1233] Deltastreamer Kafka consumption delay reporting indicators

2020-09-25 Thread GitBox


liujinhui1994 commented on pull request #2074:
URL: https://github.com/apache/hudi/pull/2074#issuecomment-699388007


   @yanghua  please help review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-25 Thread GitBox


bvaradar commented on a change in pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#discussion_r495413600



##
File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/MetadataCommand.java
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.utils.SparkUtil;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.metadata.HoodieMetadata;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * CLI commands to operate on the Metadata Table.
+ */
+@Component
+public class MetadataCommand implements CommandMarker {
+  private JavaSparkContext jsc;
+
+  @CliCommand(value = "metadata set", help = "Set options for Metadata Table")
+  public String set(@CliOption(key = {"metadataDir"},
+  help = "Directory to read/write metadata table (can be different from 
dataset)", unspecifiedDefaultValue = "")
+  final String metadataDir) {
+if (!metadataDir.isEmpty()) {
+  HoodieMetadata.setMetadataBaseDirectory(metadataDir);
+}
+
+return String.format("Ok");
+  }
+
+  @CliCommand(value = "metadata create", help = "Create the Metadata Table if 
it does not exist")
+  public String create() throws IOException {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") not empty.");
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist yet
+  HoodieCLI.fs.mkdirs(metadataPath);
+}
+
+long t1 = System.currentTimeMillis();
+HoodieWriteConfig writeConfig = 
HoodieWriteConfig.newBuilder().withPath(HoodieCLI.basePath)
+.withUseFileListingMetadata(true).build();
+initJavaSparkContext();
+HoodieMetadata.init(jsc, writeConfig);
+long t2 = System.currentTimeMillis();
+
+return String.format("Created Metadata Table in %s (duration=%.2fsec)", 
metadataPath, (t2 - t1) / 1000.0);
+  }
+
+  @CliCommand(value = "metadata delete", help = "Remove the Metadata Table")
+  public String delete() throws Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+  if (statuses.length > 0) {
+HoodieCLI.fs.delete(metadataPath, true);
+  }
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+}
+
+HoodieMetadata.remove(HoodieCLI.basePath);
+
+return String.format("Removed Metdata Table from %s", metadataPath);
+  }
+
+  @CliCommand(value = "metadata init", help = "Update the metadata table from 
commits since the creation")
+  public String init(@CliOption(key = {"readonly"}, unspecifiedDefaultValue = 
"false",
+  help = "Open in read-only mode") final boolean readOnly) throws 
Exception {
+HoodieTableMetaClient metaClient = HoodieCLI.getTableMetaClient();
+Path metadataPath = new 
Path(HoodieMetadata.getMetadataTableBasePath(HoodieCLI.basePath));
+try {
+  FileStatus[] statuses = HoodieCLI.fs.listStatus(metadataPath);
+} catch (FileNotFoundException e) {
+  // Metadata directory does not exist
+  throw new RuntimeException("Metadata directory (" + 
metadataPath.toString() + ") does not exist.");
+}

[GitHub] [hudi] dugenkui03 opened a new pull request #2115: [MINOR] Mark started and shutdownRequested with volatile.

2020-09-25 Thread GitBox


dugenkui03 opened a new pull request #2115:
URL: https://github.com/apache/hudi/pull/2115


   Mark `started` and `shutdownRequested` with volatile.
   
   Add getter method for `started`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dugenkui03 closed issue #2114: [SUPPORT]Replace org.apache.hudi.common.util.StringUtils with JDK and org.apache.commons.lang3.StringUtils

2020-09-25 Thread GitBox


dugenkui03 closed issue #2114:
URL: https://github.com/apache/hudi/issues/2114


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dugenkui03 opened a new issue #2114: [SUPPORT]Replace org.apache.hudi.common.util.StringUtils with JDK and org.apache.commons.lang3.StringUtils

2020-09-25 Thread GitBox


dugenkui03 opened a new issue #2114:
URL: https://github.com/apache/hudi/issues/2114


   **Describe the problem you faced**
   I find that most methods in `hudi.StringUtils` is easy to be replaced by JDK 
and `lang3.StringUtils`. and there is no special operation in 
`hudi.StringUtils`. I think it would be a good choice to replace 
`hudi.StringUtils` with JDK and `lang3.StringUtils`.
   
   Another important reason is that the operation of 
`hudi.StringUtils.objToString` is different from `Objects.toString()`, the user 
which is used to `Objects.toString()` is easy to use `StringUtils.objToString` 
by mistake.
   
   **jira**
   https://issues.apache.org/jira/browse/HUDI-1300
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1300) Replace org.apache.hudi.common.util.StringUtils with JDK and org.apache.commons.lang3.StringUtils

2020-09-25 Thread dugenkui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dugenkui updated HUDI-1300:
---
Description: 
I find that most methods in hudi.StringUtils is easy to be replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to replace hudi.StringUtils with JDK and 
lang3.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.

  was:
I find that most methods in hudi.StringUtils is easy to be replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to remove hudi.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.


> Replace org.apache.hudi.common.util.StringUtils with JDK and 
> org.apache.commons.lang3.StringUtils
> -
>
> Key: HUDI-1300
> URL: https://issues.apache.org/jira/browse/HUDI-1300
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: dugenkui
>Priority: Major
>
> I find that most methods in hudi.StringUtils is easy to be replaced by JDK 
> and lang3.StringUtils. and there is no special operation in hudi.StringUtils. 
> I think it would be a good choice to replace hudi.StringUtils with JDK and 
> lang3.StringUtils.
> Another important reason is that the operation of 
> hudi.StringUtils.objToString is different from Objects.toString(), the user 
> which is used to Objects.toString() is easy to use StringUtils.objToString by 
> mistake.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1300) Replace org.apache.hudi.common.util.StringUtils with JDK and org.apache.commons.lang3.StringUtils

2020-09-25 Thread dugenkui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dugenkui updated HUDI-1300:
---
Description: 
I find that most methods in hudi.StringUtils is easy to be replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to remove hudi.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.

  was:
I find that hudi.StringUtils is easy to be replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to remove hudi.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.


> Replace org.apache.hudi.common.util.StringUtils with JDK and 
> org.apache.commons.lang3.StringUtils
> -
>
> Key: HUDI-1300
> URL: https://issues.apache.org/jira/browse/HUDI-1300
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: dugenkui
>Priority: Major
>
> I find that most methods in hudi.StringUtils is easy to be replaced by JDK 
> and lang3.StringUtils. and there is no special operation in hudi.StringUtils. 
> I think it would be a good choice to remove hudi.StringUtils.
> Another important reason is that the operation of 
> hudi.StringUtils.objToString is different from Objects.toString(), the user 
> which is used to Objects.toString() is easy to use StringUtils.objToString by 
> mistake.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1300) Replace org.apache.hudi.common.util.StringUtils with JDK and org.apache.commons.lang3.StringUtils

2020-09-25 Thread dugenkui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dugenkui updated HUDI-1300:
---
Description: 
I find that hudi.StringUtils is easy to be replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to remove hudi.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.

  was:
I find that hudi.StringUtils is easy to be completely replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to remove hudi.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.


> Replace org.apache.hudi.common.util.StringUtils with JDK and 
> org.apache.commons.lang3.StringUtils
> -
>
> Key: HUDI-1300
> URL: https://issues.apache.org/jira/browse/HUDI-1300
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: dugenkui
>Priority: Major
>
> I find that hudi.StringUtils is easy to be replaced by JDK and 
> lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
> think it would be a good choice to remove hudi.StringUtils.
> Another important reason is that the operation of 
> hudi.StringUtils.objToString is different from Objects.toString(), the user 
> which is used to Objects.toString() is easy to use StringUtils.objToString by 
> mistake.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1300) Replace org.apache.hudi.common.util.StringUtils with JDK and org.apache.commons.lang3.StringUtils

2020-09-25 Thread dugenkui (Jira)
dugenkui created HUDI-1300:
--

 Summary: Replace org.apache.hudi.common.util.StringUtils with JDK 
and org.apache.commons.lang3.StringUtils
 Key: HUDI-1300
 URL: https://issues.apache.org/jira/browse/HUDI-1300
 Project: Apache Hudi
  Issue Type: Wish
Reporter: dugenkui


I find that hudi.StringUtils is easy to be completely replaced by JDK and 
lang3.StringUtils. and there is no special operation in hudi.StringUtils. I 
think it would be a good choice to remove hudi.StringUtils.

Another important reason is that the operation of hudi.StringUtils.objToString 
is different from Objects.toString(), the user which is used to 
Objects.toString() is easy to use StringUtils.objToString by mistake.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] wangxianghu commented on pull request #2105: [MINOR] Fix ClassCastException when use QuickstartUtils generate data

2020-09-25 Thread GitBox


wangxianghu commented on pull request #2105:
URL: https://github.com/apache/hudi/pull/2105#issuecomment-699251448


   > LGTM. I believe this change was introduced a week back or so from this pr 
- #2071
   
   yes, it's a small oversight.
   thanks for your review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2092: [HUDI-1285] Fix merge on read DAG to make docker demo pass

2020-09-25 Thread GitBox


nsivabalan commented on pull request #2092:
URL: https://github.com/apache/hudi/pull/2092#issuecomment-699219984


   @n3nash : I tested this patch and looks good. once you remove the rollback 
node, we can merge it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1213) Set Default for the bootstrap config : hoodie.bootstrap.full.input.provider

2020-09-25 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra resolved HUDI-1213.
-
Resolution: Fixed

> Set Default for the bootstrap config : hoodie.bootstrap.full.input.provider
> ---
>
> Key: HUDI-1213
> URL: https://issues.apache.org/jira/browse/HUDI-1213
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: bootstrap
>Reporter: Balaji Varadarajan
>Assignee: Likai Yu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> hoodie.bootstrap.full.input.provider needs to have the default set as 
> "org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider"
>  
> There are no defaults currently and this is a pain for uers to set this 
> configuration every time they need to do FULL_RECORD bootstrap.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] tooptoop4 commented on issue #2110: [SUPPORT] Executor memory recommendation

2020-09-25 Thread GitBox


tooptoop4 commented on issue #2110:
URL: https://github.com/apache/hudi/issues/2110#issuecomment-699071257


   @n3nash " if your target file is not very large (<256 MB)" do you mean the 
size of the incoming CSV or the size of the existing table? Does the size of 
the existing table matter at all if you mention 2GB could be ok on 20GB table?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-25 Thread GitBox


satishkotha commented on a change in pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#discussion_r495137989



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -586,24 +602,39 @@ public String startCommit() {
* @param instantTime Instant time to be generated
*/
   public void startCommitWithTime(String instantTime) {
+HoodieTableMetaClient metaClient = createMetaClient(true);
+startCommitWithTime(instantTime, metaClient.getCommitActionType(), 
metaClient);
+  }
+
+  /**
+   * Completes a new commit time for a write operation (insert/update/delete) 
with specified action.
+   */
+  public void startCommitWithTime(String instantTime, String actionType) {
+HoodieTableMetaClient metaClient = createMetaClient(true);
+startCommitWithTime(instantTime, actionType, metaClient);
+  }
+
+  /**
+   * Completes a new commit time for a write operation (insert/update/delete) 
with specified action.
+   */
+  private void startCommitWithTime(String instantTime, String actionType, 
HoodieTableMetaClient metaClient) {
 // NOTE : Need to ensure that rollback is done before a new commit is 
started
 if (rollbackPending) {
   // Only rollback inflight commit/delta-commits. Do not touch compaction 
commits
   rollbackPendingCommits();
 }
-startCommit(instantTime);
+startCommit(instantTime, actionType, metaClient);
   }
 
-  private void startCommit(String instantTime) {
-LOG.info("Generate a new instant time " + instantTime);
-HoodieTableMetaClient metaClient = createMetaClient(true);
+  private void startCommit(String instantTime, String actionType, 
HoodieTableMetaClient metaClient) {

Review comment:
   This is a private method. Do you want to make this public static?  
Personally, I think having all startCommit methods in HoodieWriteClient makes 
more sense because user workflow is
   
   1) writeClient#startCommit
   2) writeClient#upsert
   3) writeClient#commit
   
   But if you have a strong preference to make this part of CommitUtils, I can 
move it. let me know.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (2eaba09 -> 1dd6635)

2020-09-25 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 2eaba09  [HUDI-544] Archived commits command code cleanup (#1242)
 add 1dd6635  [MINOR] Fix ClassCastException when use QuickstartUtils 
generate data (#2105)

No new revisions were added by this update.

Summary of changes:
 hudi-spark/src/main/java/org/apache/hudi/QuickstartUtils.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [hudi] bhasudha merged pull request #2105: [MINOR] Fix ClassCastException when use QuickstartUtils generate data

2020-09-25 Thread GitBox


bhasudha merged pull request #2105:
URL: https://github.com/apache/hudi/pull/2105


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #1958: [SUPPORT] Global Indexes return old partition value when querying Hive tables

2020-09-25 Thread GitBox


n3nash commented on issue #1958:
URL: https://github.com/apache/hudi/issues/1958#issuecomment-699031170


   We're waiting on the PR to be landed after which we can close this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (6837118 -> 2eaba09)

2020-09-25 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 6837118  [MINOR] Improve description (#2113)
 add 2eaba09  [HUDI-544] Archived commits command code cleanup (#1242)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/cli/commands/ArchivedCommitsCommand.java  |  9 ++---
 .../java/org/apache/hudi/integ/ITTestHoodieSanity.java|  7 +++
 .../main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala | 15 +++
 3 files changed, 24 insertions(+), 7 deletions(-)



[GitHub] [hudi] n3nash merged pull request #1242: [HUDI-544] Archived commits command code cleanup

2020-09-25 Thread GitBox


n3nash merged pull request #1242:
URL: https://github.com/apache/hudi/pull/1242


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #1978: [HUDI-1184] Fix the support of hbase index partition path change

2020-09-25 Thread GitBox


n3nash commented on pull request #1978:
URL: https://github.com/apache/hudi/pull/1978#issuecomment-699029537


   @hj2016 Can you rebase and squash your commits please ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2104: [SUPPORT] MOR Hive sync - _rt table read issue

2020-09-25 Thread GitBox


n3nash commented on issue #2104:
URL: https://github.com/apache/hudi/issues/2104#issuecomment-699028582


   Okay, which jar have you deployed on the Hive server, is it 
[hoodie-hadoop-mr-bundle](https://github.com/apache/hudi/tree/master/packaging/hudi-hadoop-mr-bundle)
 ? For the Hive server to know how to load the Hudi classes, you need to drop 
this jar in the Hive server classpath.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2101: [SUPPORT]Unable to interpret Child JSON fields value as a separate columns rather it is loaded as one single field value. Any way to interpret that.

2020-09-25 Thread GitBox


n3nash commented on issue #2101:
URL: https://github.com/apache/hudi/issues/2101#issuecomment-699027752


   @getniz Those are good questions, we're all learners here :) You can 
definitely create a schema in confluent schema registry but the flattening will 
depend on what is your schema structure. If you create a AVRO schema similar to 
your JSON structure, then you will have a nested column in your avro schema and 
you will have the same problem. If you want to create a flat schema in avro, 
then your JSON has to be flattened as well.
   I would recommend trying option 2 which is straight forward and see what 
performance penalty (if at all) is there.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2104: [SUPPORT] MOR Hive sync - _rt table read issue

2020-09-25 Thread GitBox


ashishmgofficial commented on issue #2104:
URL: https://github.com/apache/hudi/issues/2104#issuecomment-699027470


   @n3nash Im using hudi 0.6.0 from maven



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2110: [SUPPORT] Executor memory recommendation

2020-09-25 Thread GitBox


n3nash commented on issue #2110:
URL: https://github.com/apache/hudi/issues/2110#issuecomment-699025467


   @tooptoop4 If you're talking about executor memory, in this particular 
scenario, you can start with 2GB. In general, the upsert code performs a hash 
merge for which it uses a spillable map which has the following default setting 
-> 
https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java#L89.
 
   Ideally, if your target file is not very large (<256 MB), you should be able 
to use 2GB executor memory. The driver memory depends on how many records 
you're ingesting and what is the size of them, in this case that you described, 
I think you can even set the driver memory to 2GB and see if it OOMs. 
   For better analysis of memory usage, you can use some open source tools such 
as https://github.com/uber-common/jvm-profiler or 
https://github.com/linkedin/dr-elephant



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2104: [SUPPORT] MOR Hive sync - _rt table read issue

2020-09-25 Thread GitBox


n3nash commented on issue #2104:
URL: https://github.com/apache/hudi/issues/2104#issuecomment-699023400


   @ashishmgofficial Looks like a `class not found` issue. Can you tell me 
which hoodie jar have you deployed on your Hive server ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on issue #2108: [SUPPORT]Submit rollback -->Pending job --> kill YARN --> lost data

2020-09-25 Thread GitBox


n3nash commented on issue #2108:
URL: https://github.com/apache/hudi/issues/2108#issuecomment-699022804


   @JiaDe-Wu Can you please list the entire `.hoodie` folder and show the 
contents ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (83d2e03 -> 6837118)

2020-09-25 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 83d2e03  [MINOR] Adding scripts to checkout and push to PRs (#2109)
 add 6837118  [MINOR] Improve description (#2113)

No new revisions were added by this update.

Summary of changes:
 hudi-client/src/main/java/org/apache/hudi/metrics/Metrics.java  | 3 +--
 .../org/apache/hudi/examples/common/ExampleDataSchemaProvider.java  | 2 +-
 .../org/apache/hudi/examples/spark/HoodieWriteClientExample.java| 6 --
 3 files changed, 6 insertions(+), 5 deletions(-)



[GitHub] [hudi] leesf merged pull request #2113: [MINOR] fix typo

2020-09-25 Thread GitBox


leesf merged pull request #2113:
URL: https://github.com/apache/hudi/pull/2113


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bradleyhurley closed issue #2068: [SUPPORT]Deltastreamer Upsert Very Slow / Never Completes After Initial Data Load

2020-09-25 Thread GitBox


bradleyhurley closed issue #2068:
URL: https://github.com/apache/hudi/issues/2068


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


leesf commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-698882595


   @SteNicholas would you please also add some tests to the new changes?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2112: [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable

2020-09-25 Thread GitBox


yanghua commented on a change in pull request #2112:
URL: https://github.com/apache/hudi/pull/2112#discussion_r494952442



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCommitsCommand.java
##
@@ -168,10 +170,12 @@ public void testShowArchivedCommits() throws IOException {
 data.put("102", new Integer[] {25, 45});
 data.put("101", new Integer[] {35, 15});
 
-data.forEach((key, value) -> {
+for (Map.Entry entry : data.entrySet()) {

Review comment:
   ditto

##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCommitsCommand.java
##
@@ -76,17 +76,19 @@ public void init() throws IOException {
 "", TimelineLayoutVersion.VERSION_1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
   }
 
-  private LinkedHashMap generateData() {
+  private LinkedHashMap generateData() throws Exception {
 // generate data and metadata
 LinkedHashMap data = new LinkedHashMap<>();
 data.put("102", new Integer[] {15, 10});
 data.put("101", new Integer[] {20, 10});
 data.put("100", new Integer[] {15, 15});
 
-data.forEach((key, value) -> {
+for (Map.Entry entry : data.entrySet()) {

Review comment:
   Why we should do this refactor?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] getniz commented on issue #2101: [SUPPORT]Unable to interpret Child JSON fields value as a separate columns rather it is loaded as one single field value. Any way to interpret that.

2020-09-25 Thread GitBox


getniz commented on issue #2101:
URL: https://github.com/apache/hudi/issues/2101#issuecomment-698761326


   @n3nash thanks for the response with details, 1 & 3 option I may not be able 
to consider as I need to build this layer as immediate target tables for 
further consumption in Reporting layer. If I use option 2, can I able to 
consume the topic and flatten the schema in deltastreamer with-out staging and 
then load directly to the immediate target layer using above Spark submit batch 
command.  Also, I came to know that Hudi supports Confluent schema registry, in 
that case if I get the JSON schema from Source and register with Confluent 
registry can I achieve in flattening the file. Sorry, my questions may be silly 
sometimes please bear with me, I'm a learner here : )  Objective of what I'm 
trying to do is to consume Data from several topics in near real-time(all the 
topics data are formatted/structured) and push to DataLake using Hudi. If I 
stage and transform it, then I may end up eating time. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] SteNicholas commented on a change in pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


SteNicholas commented on a change in pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#discussion_r494938396



##
File path: hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java
##
@@ -54,13 +55,23 @@
*/
   private final WorkloadStat globalStat;
 
+  /**
+   * Write operation type.
+   */
+  private WriteOperationType operationType;

Review comment:
   @leesf Yes, WriteOperationType should be Serializable, forget to check 
this. I would like to add `implements Serializable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


leesf commented on a change in pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#discussion_r494930532



##
File path: hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java
##
@@ -54,13 +55,23 @@
*/
   private final WorkloadStat globalStat;
 
+  /**
+   * Write operation type.
+   */
+  private WriteOperationType operationType;

Review comment:
   WriteOperationType should be Serializable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on a change in pull request #2112: [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable

2020-09-25 Thread GitBox


xushiyan commented on a change in pull request #2112:
URL: https://github.com/apache/hudi/pull/2112#discussion_r494635687



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/testutils/HoodieTestCommitMetadataGenerator.java
##
@@ -62,67 +62,53 @@
   /**
* Create a commit file with default CommitMetadata.
*/
-  public static void createCommitFileWithMetadata(String basePath, String 
commitTime, Configuration configuration) {
+  public static void createCommitFileWithMetadata(String basePath, String 
commitTime, Configuration configuration) throws Exception {
 createCommitFileWithMetadata(basePath, commitTime, configuration, 
Option.empty(), Option.empty());
   }
 
   public static void createCommitFileWithMetadata(String basePath, String 
commitTime, Configuration configuration,
-  Option writes, Option updates) {
+  Option writes, Option updates) throws Exception {
 createCommitFileWithMetadata(basePath, commitTime, configuration, 
UUID.randomUUID().toString(),
 UUID.randomUUID().toString(), writes, updates);
   }
 
   public static void createCommitFileWithMetadata(String basePath, String 
commitTime, Configuration configuration,
-  String fileId1, String fileId2, Option writes, Option 
updates) {
-Arrays.asList(HoodieTimeline.makeCommitFileName(commitTime), 
HoodieTimeline.makeInflightCommitFileName(commitTime),
-HoodieTimeline.makeRequestedCommitFileName(commitTime))
-.forEach(f -> {
-  Path commitFile = new Path(
-  basePath + "/" + HoodieTableMetaClient.METAFOLDER_NAME + "/" + 
f);
-  FSDataOutputStream os = null;
-  try {
-FileSystem fs = FSUtils.getFs(basePath, configuration);
-os = fs.create(commitFile, true);
-// Generate commitMetadata
-HoodieCommitMetadata commitMetadata =
-generateCommitMetadata(basePath, commitTime, fileId1, fileId2, 
writes, updates);
-// Write empty commit metadata
-os.writeBytes(new 
String(commitMetadata.toJsonString().getBytes(StandardCharsets.UTF_8)));
-  } catch (IOException ioe) {
-throw new HoodieIOException(ioe.getMessage(), ioe);
-  } finally {
-if (null != os) {
-  try {
-os.close();
-  } catch (IOException e) {
-throw new HoodieIOException(e.getMessage(), e);
-  }
-}
-  }
-});
+  String fileId1, String fileId2, Option writes, Option 
updates) throws Exception {
+List commitFileNames = 
Arrays.asList(HoodieTimeline.makeCommitFileName(commitTime), 
HoodieTimeline.makeInflightCommitFileName(commitTime),
+HoodieTimeline.makeRequestedCommitFileName(commitTime));
+for (String name : commitFileNames) {
+  Path commitFilePath = new Path(Paths.get(basePath, 
HoodieTableMetaClient.METAFOLDER_NAME, name).toString());
+  try (FSDataOutputStream os = FSUtils.getFs(basePath, 
configuration).create(commitFilePath, true)) {
+// Generate commitMetadata
+HoodieCommitMetadata commitMetadata =
+generateCommitMetadata(basePath, commitTime, fileId1, fileId2, 
writes, updates);
+// Write empty commit metadata
+os.writeBytes(new 
String(commitMetadata.toJsonString().getBytes(StandardCharsets.UTF_8)));
+  }
+}

Review comment:
   change stream#foreach to for-loop to avoid the noisy try-catch block.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Karl-WangSK removed a comment on pull request #2106: [HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal

2020-09-25 Thread GitBox


Karl-WangSK removed a comment on pull request #2106:
URL: https://github.com/apache/hudi/pull/2106#issuecomment-698291855







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial edited a comment on issue #2104: [SUPPORT] MOR Hive sync - _rt table read issue

2020-09-25 Thread GitBox


ashishmgofficial edited a comment on issue #2104:
URL: https://github.com/apache/hudi/issues/2104#issuecomment-698567986


   @n3nash  : The following is the stacktrace I got when I queried on Hive CLI
   
   `2020-09-24 20:17:49,028 ERROR [39f399ee-de3f-4d33-a1cd-407d2e252f20 main] 
log.AbstractHoodieLogRecordScanner: Got exception when reading log file
   org.apache.hudi.exception.HoodieException: Unable to load class
   at 
org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:42)
   at 
org.apache.hudi.common.util.ReflectionUtils.loadPayload(ReflectionUtils.java:62)
   at 
org.apache.hudi.common.util.SpillableMapUtils.convertToHoodieRecordPayload(SpillableMapUtils.java:110)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processAvroDataBlock(AbstractHoodieLogRecordScanner.java:274)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:303)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:236)
   at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:79)
   at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.getMergedLogRecordScanner(RealtimeCompactedRecordReader.java:67)
   at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:50)
   at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
   at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:45)
   at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:242)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:776)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:344)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:540)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:509)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2691)
   at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
   at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
   at 
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188)
   at 
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402)
   at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
   
   Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:264)
   at 
org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:39)
   ... 30 more
   Failed with exception 
java.io.IOException:org.apache.hudi.exception.HoodieIOException: IOException 
when reading log file `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-25 Thread GitBox


wangxianghu commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r494100797



##
File path: hudi-cli/pom.xml
##
@@ -148,7 +148,14 @@
 
 
   org.apache.hudi
-  hudi-client
+  hudi-client-common
+  ${project.version}
+  test
+  test-jar
+
+
+  org.apache.hudi
+  hudi-spark-client

Review comment:
   > cc @wangxianghu probably `hudi-client-spark` is easier on the eyes?
   
   I am ok with both of them :)  let's rename it to `hudi-client-spark`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1567: [HUDI-840]Clean blank file created by HoodieLogFormatWriter

2020-09-25 Thread GitBox


bvaradar commented on pull request #1567:
URL: https://github.com/apache/hudi/pull/1567#issuecomment-698715649


   @hddong : I went ahead and  redid this change in the interest of time :)
   Instead of deleting on close, I have made changes to lazily create the log 
file when appending next block. This should avoid creating empty files. 
   (cc @vinothchandar )
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Karl-WangSK commented on pull request #2106: [HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal

2020-09-25 Thread GitBox


Karl-WangSK commented on pull request #2106:
URL: https://github.com/apache/hudi/pull/2106#issuecomment-698291855







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar merged pull request #2109: [MINOR] Adding scripts to checkout and push to PRs

2020-09-25 Thread GitBox


bvaradar merged pull request #2109:
URL: https://github.com/apache/hudi/pull/2109


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-25 Thread GitBox


wangxianghu edited a comment on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-698161234







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #2085: [HUDI-1209] Properties File must be optional when running deltastreamer

2020-09-25 Thread GitBox


shenh062326 commented on a change in pull request #2085:
URL: https://github.com/apache/hudi/pull/2085#discussion_r494256445



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
##
@@ -112,9 +112,14 @@ public HoodieDeltaStreamer(Config cfg, JavaSparkContext 
jssc, FileSystem fs, Con
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration conf,
  TypedProperties props) throws IOException {
 // Resolving the properties first in a consistent way
-this.properties = props != null ? props : UtilHelpers.readConfig(
-FSUtils.getFs(cfg.propsFilePath, jssc.hadoopConfiguration()),
-new Path(cfg.propsFilePath), cfg.configs).getConfig();
+if (props != null) {
+  this.properties = props;
+} else {

Review comment:
   sure





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-25 Thread GitBox


satishkotha commented on a change in pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#discussion_r494727484



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -586,24 +602,39 @@ public String startCommit() {
* @param instantTime Instant time to be generated
*/
   public void startCommitWithTime(String instantTime) {
+HoodieTableMetaClient metaClient = createMetaClient(true);

Review comment:
   This is not calling function in the next line (calling method after 
that). So we only create meta client once. Please double check and let me know 
if i'm misinterpreting your suggestion.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-25 Thread GitBox


vinothchandar commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-698151969







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] SteNicholas commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


SteNicholas commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-698658606


   @leesf @bvaradar Could you please review this pull request?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial edited a comment on issue #2072: [SUPPORT] Hudi Pyspark Application Example

2020-09-25 Thread GitBox


ashishmgofficial edited a comment on issue #2072:
URL: https://github.com/apache/hudi/issues/2072#issuecomment-698472403


   @bvaradar  @n3nash I can see that the rollback commands are looking for 
.rollback files in .hoodie folder but all i can see is 
.restore files
   
   The above scenario happens when I issue show rollback --instant 

   
   But returns empty dataframe for show rollbacks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on pull request #2064: WIP - [HUDI-842] Implementation of HUDI RFC-15.

2020-09-25 Thread GitBox


prashantwason commented on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-698099305


   @umehrot2  Directly using hudi datasource or delta streamer for testing
   should work too. I haven't testing this yet so please let me know if that
   doesn't work.
   
   Query side changes are not implemented yet. So this is ingestion side
   improvement as of today.
   
   On Wed, Sep 23, 2020 at 7:09 PM Udit Mehrotra 
   wrote:
   
   > @umehrot2  I have updated the RFC doc with 
details
   > on how to test RFC-15
   > 
.
   > Please take a look and let me know if I can help in any way.
   >
   > @prashantwason  I missed your earlier
   > pings on this PR. We will start the testing of this PR with S3. Looking at
   > the testing details you provided, I am a bit confused, and have a couple of
   > questions:
   >
   >-
   >
   >Do we need to directly use HoodieWriteClient from spark-shell to be
   >able to test this ? Can't we directly use hudi datasource or delta
   >streamer for testing with following options set to true while writing:
   >hoodie.metadata.file.listings.enable,
   >hoodie.metadata.file.listings.verify ?
   >-
   >
   >Also for testing query performance using spark datasource, spark-sql,
   >hive and presto I am assuming it will detect that metadata table is
   >present and automatically use that for getting the list ?
   >
   > Will review the PR this week as well to understand the implementation
   > details.
   >
   > —
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > , or
   > unsubscribe
   > 

   > .
   >
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu edited a comment on pull request #2105: [MINOR] Fix ClassCastException when use QuickstartUtils generate data

2020-09-25 Thread GitBox


wangxianghu edited a comment on pull request #2105:
URL: https://github.com/apache/hudi/pull/2105#issuecomment-698074779


   @bhasudha This exception occurs because the methods to generate data in 
`QuickstartUtils` treat `ts` field as `long` type, while the schema provided by 
`QuickstartUtils` shows `ts` is `double` type. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ShortFinger commented on issue #143: Tracking ticket for folks to be added to slack group

2020-09-25 Thread GitBox


ShortFinger commented on issue #143:
URL: https://github.com/apache/hudi/issues/143#issuecomment-698296175


   please add linfour@gmail.com
   Thanks!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-25 Thread GitBox


wangxianghu commented on pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#issuecomment-698161234


   > @wangxianghu Looks like we have much less class splitting now. I want to 
try and reduce this further if possible.
   > If its alright with you, I can take over from here, make some changes and 
push another commit on top of yours, to try and get this across the finish 
line. Want to coordinate so that we are not making parallel changes,
   
   @vinothchandar, Yeah, Of course, you can take over from here, this will 
greatly facilitate the process
   thanks 👍 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-09-25 Thread GitBox


bvaradar commented on a change in pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#discussion_r493957561



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -554,14 +608,16 @@ protected HoodieBaseFile 
addBootstrapBaseFileIfPresent(HoodieFileGroupId fileGro
   readLock.lock();
   String partition = formatPartitionKey(partitionStr);
   ensurePartitionLoadedCorrectly(partition);
-  return fetchAllStoredFileGroups(partition).map(fileGroup -> {
-Option fileSlice = 
fileGroup.getLatestFileSliceBeforeOrOn(maxInstantTime);
-// if the file-group is under construction, pick the latest before 
compaction instant time.
-if (fileSlice.isPresent()) {
-  fileSlice = Option.of(fetchMergedFileSlice(fileGroup, 
fileSlice.get()));
-}
-return fileSlice;
-  
}).filter(Option::isPresent).map(Option::get).map(this::addBootstrapBaseFileIfPresent);
+  return fetchAllStoredFileGroups(partition)
+  .filter(fg -> !isFileGroupReplaced(fg.getFileGroupId()))

Review comment:
   same case here,, we need to use the maxInstantTime passed here instead 
of the timeline's maxInstant.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -425,10 +459,14 @@ protected HoodieBaseFile 
addBootstrapBaseFileIfPresent(HoodieFileGroupId fileGro
   readLock.lock();
   String partitionPath = formatPartitionKey(partitionStr);
   ensurePartitionLoadedCorrectly(partitionPath);
-  return fetchHoodieFileGroup(partitionPath, fileId).map(fileGroup -> 
fileGroup.getAllBaseFiles()
-  .filter(baseFile -> 
HoodieTimeline.compareTimestamps(baseFile.getCommitTime(), 
HoodieTimeline.EQUALS,
-  instantTime)).filter(df -> 
!isBaseFileDueToPendingCompaction(df)).findFirst().orElse(null))
-  .map(df -> addBootstrapBaseFileIfPresent(new 
HoodieFileGroupId(partitionPath, fileId), df));
+  if (isFileGroupReplaced(partitionPath, fileId)) {

Review comment:
   @satishkotha : Dont we need to use instant time when checking for 
replaced-file here ?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/view/AbstractTableFileSystemView.java
##
@@ -727,6 +795,26 @@ private String formatPartitionKey(String partitionStr) {
*/
   abstract Stream fetchAllStoredFileGroups();
 
+  /**
+   * Track instant time for file groups replaced.
+   */
+  protected abstract void resetReplacedFileGroups(final Map replacedFileGroups);
+
+  /**
+   * Track instant time for new file groups replaced.
+   */
+  protected abstract void addReplacedFileGroups(final Map replacedFileGroups);
+
+  /**
+   * Remove file groups that are replaced in any of the specified instants.
+   */
+  protected abstract void removeReplacedFileIds(Set instants);

Review comment:
   rename removeReplacedFileIdsAtInstants ?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##
@@ -586,24 +602,39 @@ public String startCommit() {
* @param instantTime Instant time to be generated
*/
   public void startCommitWithTime(String instantTime) {
+HoodieTableMetaClient metaClient = createMetaClient(true);

Review comment:
   We are creating metaclient and loading timeline once here and in the 
function called in the next line. Can you make sure you create metaclient only 
once without loading timeline.

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/CommitUtils.java
##
@@ -0,0 +1,96 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.model.HoodieReplaceCommitMetadata;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.model.HoodieWriteStat;
+import org.apache.hudi.common.model.WriteOperationType;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.exception.Hood

[GitHub] [hudi] vinothchandar commented on a change in pull request #1827: [HUDI-1089] Refactor hudi-client to support multi-engine

2020-09-25 Thread GitBox


vinothchandar commented on a change in pull request #1827:
URL: https://github.com/apache/hudi/pull/1827#discussion_r494073875



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java
##
@@ -134,7 +138,7 @@ private void init(HoodieRecord record) {
   writeStatus.setPartitionPath(partitionPath);
   writeStatus.getStat().setPartitionPath(partitionPath);
   writeStatus.getStat().setFileId(fileId);
-  averageRecordSize = SizeEstimator.estimate(record);
+  averageRecordSize = sizeEstimator.sizeEstimate(record);
   try {

Review comment:
   Should be okay. 

##
File path: hudi-cli/pom.xml
##
@@ -148,7 +148,14 @@
 
 
   org.apache.hudi
-  hudi-client
+  hudi-client-common
+  ${project.version}
+  test
+  test-jar
+
+
+  org.apache.hudi
+  hudi-spark-client

Review comment:
   cc @wangxianghu probably `hudi-client-spark` is easier on the eyes?  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2104: [SUPPORT] MOR Hive sync - _rt table read issue

2020-09-25 Thread GitBox


ashishmgofficial commented on issue #2104:
URL: https://github.com/apache/hudi/issues/2104#issuecomment-698567986


   @n3nash  : The following is the stacktrace I got when I queried on Hive CLI
   
   `2020-09-24 20:17:49,028 ERROR [39f399ee-de3f-4d33-a1cd-407d2e252f20 main] 
log.AbstractHoodieLogRecordScanner: Got exception when reading log file
   org.apache.hudi.exception.HoodieException: Unable to load class
   at 
org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:42)
   at 
org.apache.hudi.common.util.ReflectionUtils.loadPayload(ReflectionUtils.java:62)
   at 
org.apache.hudi.common.util.SpillableMapUtils.convertToHoodieRecordPayload(SpillableMapUtils.java:110)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processAvroDataBlock(AbstractHoodieLogRecordScanner.java:274)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.processQueuedBlocksForInstant(AbstractHoodieLogRecordScanner.java:303)
   at 
org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner.scan(AbstractHoodieLogRecordScanner.java:236)
   at 
org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner.(HoodieMergedLogRecordScanner.java:79)
   at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.getMergedLogRecordScanner(RealtimeCompactedRecordReader.java:67)
   at 
org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:50)
   at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67)
   at 
org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:45)
   at 
org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:242)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator$FetchInputFormatSplit.getRecordReader(FetchOperator.java:776)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:344)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:540)
   at 
org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:509)
   at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:146)
   at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2691)
   at 
org.apache.hadoop.hive.ql.reexec.ReExecDriver.getResults(ReExecDriver.java:229)
   at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:259)
   at 
org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188)
   at 
org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402)
   at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821)
   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759)
   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:498)
   at org.apache.hadoop.util.RunJar.run(RunJar.java:323)
   at org.apache.hadoop.util.RunJar.main(RunJar.java:236)
   Caused by: java.lang.ClassNotFoundException: 
org.apache.hudi.common.model.OverwriteWithLatestAvroPayload
   at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:264)
   at 
org.apache.hudi.common.util.ReflectionUtils.getClass(ReflectionUtils.java:39)
   ... 30 more
   Failed with exception 
java.io.IOException:org.apache.hudi.exception.HoodieIOException: IOException 
when reading log file `



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] ashishmgofficial commented on issue #2072: [SUPPORT] Hudi Pyspark Application Example

2020-09-25 Thread GitBox


ashishmgofficial commented on issue #2072:
URL: https://github.com/apache/hudi/issues/2072#issuecomment-698449771







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2112: [HUDI-995] Migrate HoodieTestUtils APIs to HoodieTestTable

2020-09-25 Thread GitBox


yanghua commented on a change in pull request #2112:
URL: https://github.com/apache/hudi/pull/2112#discussion_r494952442



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCommitsCommand.java
##
@@ -168,10 +170,12 @@ public void testShowArchivedCommits() throws IOException {
 data.put("102", new Integer[] {25, 45});
 data.put("101", new Integer[] {35, 15});
 
-data.forEach((key, value) -> {
+for (Map.Entry entry : data.entrySet()) {

Review comment:
   ditto

##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCommitsCommand.java
##
@@ -76,17 +76,19 @@ public void init() throws IOException {
 "", TimelineLayoutVersion.VERSION_1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
   }
 
-  private LinkedHashMap generateData() {
+  private LinkedHashMap generateData() throws Exception {
 // generate data and metadata
 LinkedHashMap data = new LinkedHashMap<>();
 data.put("102", new Integer[] {15, 10});
 data.put("101", new Integer[] {20, 10});
 data.put("100", new Integer[] {15, 15});
 
-data.forEach((key, value) -> {
+for (Map.Entry entry : data.entrySet()) {

Review comment:
   Why we should do this refactor?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dugenkui03 opened a new pull request #2113: [MINOR] fix typo

2020-09-25 Thread GitBox


dugenkui03 opened a new pull request #2113:
URL: https://github.com/apache/hudi/pull/2113


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] SteNicholas commented on a change in pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


SteNicholas commented on a change in pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#discussion_r494938396



##
File path: hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java
##
@@ -54,13 +55,23 @@
*/
   private final WorkloadStat globalStat;
 
+  /**
+   * Write operation type.
+   */
+  private WriteOperationType operationType;

Review comment:
   @leesf Yes, WriteOperationType should be Serializable, forget to check 
this. I would like to add `implements Serializable`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


leesf commented on pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#issuecomment-698882595


   @SteNicholas would you please also add some tests to the new changes?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #2111: [HUDI-1234] Insert new records regardless of small file when using insert operation

2020-09-25 Thread GitBox


leesf commented on a change in pull request #2111:
URL: https://github.com/apache/hudi/pull/2111#discussion_r494930532



##
File path: hudi-client/src/main/java/org/apache/hudi/table/WorkloadProfile.java
##
@@ -54,13 +55,23 @@
*/
   private final WorkloadStat globalStat;
 
+  /**
+   * Write operation type.
+   */
+  private WriteOperationType operationType;

Review comment:
   WriteOperationType should be Serializable?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1161) Support update partial fields for MoR table

2020-09-25 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-1161:
---

Assignee: Nicholas Jiang  (was: leesf)

> Support update partial fields for MoR table
> ---
>
> Key: HUDI-1161
> URL: https://issues.apache.org/jira/browse/HUDI-1161
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: leesf
>Assignee: Nicholas Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Karl-WangSK removed a comment on pull request #2106: [HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal

2020-09-25 Thread GitBox


Karl-WangSK removed a comment on pull request #2106:
URL: https://github.com/apache/hudi/pull/2106#issuecomment-698291855







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Karl-WangSK commented on pull request #2106: [HUDI-1284] preCombine all HoodieRecords and update all fields according to orderingVal

2020-09-25 Thread GitBox


Karl-WangSK commented on pull request #2106:
URL: https://github.com/apache/hudi/pull/2106#issuecomment-698824053


   cc @yanghua @leesf  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] getniz commented on issue #2101: [SUPPORT]Unable to interpret Child JSON fields value as a separate columns rather it is loaded as one single field value. Any way to interpret that.

2020-09-25 Thread GitBox


getniz commented on issue #2101:
URL: https://github.com/apache/hudi/issues/2101#issuecomment-698761326


   @n3nash thanks for the response with details, 1 & 3 option I may not be able 
to consider as I need to build this layer as immediate target tables for 
further consumption in Reporting layer. If I use option 2, can I able to 
consume the topic and flatten the schema in deltastreamer with-out staging and 
then load directly to the immediate target layer using above Spark submit batch 
command.  Also, I came to know that Hudi supports Confluent schema registry, in 
that case if I get the JSON schema from Source and register with Confluent 
registry can I achieve in flattening the file. Sorry, my questions may be silly 
sometimes please bear with me, I'm a learner here : )  Objective of what I'm 
trying to do is to consume Data from several topics in near real-time(all the 
topics data are formatted/structured) and push to DataLake using Hudi. If I 
stage and transform it, then I may end up eating time. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org