[GitHub] [hudi] hudi-bot commented on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008247208


   
   ## CI report:
   
   * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN
   * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN
   * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861)
 
   * 2722bfcfd29a95f27338c1c8b026185472eefba0 UNKNOWN
   * d361b823d0bf09c7ba103070d43cb81c6eb9467d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3158) Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471296#comment-17471296
 ] 

Raymond Xu commented on HUDI-3158:
--

[~shivnarayan][~dongkelun] the needed changes look bigger than I thought, we 
may skip this for the minor release.

> Reduce warn logs in Spark SQL INSERT OVERWRITE
> --
>
> Key: HUDI-3158
> URL: https://issues.apache.org/jira/browse/HUDI-3158
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.10.1
>
>
> {code:java}
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]{code}
> To reduce the repeated warn logs
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008245348


   
   ## CI report:
   
   * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN
   * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN
   * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861)
 
   * 2722bfcfd29a95f27338c1c8b026185472eefba0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3158) Reduce warn logs in Spark SQL INSERT OVERWRITE

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3158:
-
Story Points: 1  (was: 0.5)

> Reduce warn logs in Spark SQL INSERT OVERWRITE
> --
>
> Key: HUDI-3158
> URL: https://issues.apache.org/jira/browse/HUDI-3158
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: 董可伦
>Priority: Major
>  Labels: pull-request-available, sev:normal
> Fix For: 0.10.1
>
>
> {code:java}
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]
> 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file 
> for instant [==>20220103192919722__replacecommit__REQUESTED]{code}
> To reduce the repeated warn logs
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[hudi] branch master updated (0d8ca8d -> 3679070)

2022-01-08 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 0d8ca8d  [HUDI-3104] Kafka-connect support of hadoop config 
environments and properties (#4451)
 add 3679070  [HUDI-3125] spark-sql write timestamp directly (#4471)

No new revisions were added by this update.

Summary of changes:
 .../apache/hudi/keygen/RowKeyGeneratorHelper.java  |  19 +++-
 .../org/apache/hudi/AvroConversionHelper.scala |  15 ++-
 .../hudi/keygen/TestRowGeneratorHelper.scala   | 102 +
 .../apache/spark/sql/hudi/TestCreateTable.scala|  27 ++
 .../apache/spark/sql/hudi/TestInsertTable.scala|  32 +++
 5 files changed, 188 insertions(+), 7 deletions(-)
 create mode 100644 
hudi-client/hudi-spark-client/src/test/scala/org/apache/hudi/keygen/TestRowGeneratorHelper.scala


[jira] [Closed] (HUDI-3125) Spark SQL writing timestamp type don't need to disable `spark.sql.datetime.java8API.enabled` manually

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3125.

Resolution: Fixed

> Spark SQL writing timestamp type don't need to disable 
> `spark.sql.datetime.java8API.enabled` manually
> -
>
> Key: HUDI-3125
> URL: https://issues.apache.org/jira/browse/HUDI-3125
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Major
>  Labels: pull-request-available, sev:critical, user-support-issues
> Fix For: 0.10.1
>
>
> {code:java}
> create table h0_p(id int, name string, price double, dt timestamp) using hudi 
> partitioned by(dt) options(type = 'cow', primaryKey = 'id');
> insert into h0_p values (3, 'a1', 10, cast('2021-05-08 00:00:00' as 
> timestamp)); {code}
> By default, that run the sql above will throw exception:
> {code:java}
> Caused by: java.lang.ClassCastException: java.time.Instant cannot be cast to 
> java.sql.Timestamp
>     at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:306)
>     at 
> org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:306)
>     at scala.Option.map(Option.scala:230) {code}
> We need disable `spark.sql.datetime.java8API.enabled` manually to make it 
> work:
> {code:java}
> set spark.sql.datetime.java8API.enabled=false; {code}
> And the command must be executed in the runtime. It can't work if provide 
> this by spark-sql command: `spark-sql --conf 
> spark.sql.datetime.java8API.enabled=false`. That's because this config is 
> forced to enable when launch spark-sql.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] xushiyan merged pull request #4471: [HUDI-3125] spark-sql write timestamp directly

2022-01-08 Thread GitBox


xushiyan merged pull request #4471:
URL: https://github.com/apache/hudi/pull/4471


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#issuecomment-1003742831


   
   ## CI report:
   
   * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN
   * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN
   * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008245348


   
   ## CI report:
   
   * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN
   * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN
   * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861)
 
   * 2722bfcfd29a95f27338c1c8b026185472eefba0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-3198) Spark SQL create table should check partition fields

2022-01-08 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-3198:


 Summary: Spark SQL create table should check partition fields
 Key: HUDI-3198
 URL: https://issues.apache.org/jira/browse/HUDI-3198
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: Raymond Xu


{code:sql}
create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl';{code}
 

The following sql should throw exception about invalid partition `name` 
{code:sql}
create table hudi_cow_existing_tbl2 using hudi 
partitioned by (dt, name) 
location 'file:///tmp/hudi/hudi_cow_pt_tbl';
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3198) Spark SQL create table should check partition fields

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3198:
-
Fix Version/s: 0.11.0

> Spark SQL create table should check partition fields
> 
>
> Key: HUDI-3198
> URL: https://issues.apache.org/jira/browse/HUDI-3198
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
> Fix For: 0.11.0
>
>
> {code:sql}
> create table hudi_cow_pt_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts'
>  )
> partitioned by (dt, hh)
> location '/tmp/hudi/hudi_cow_pt_tbl';{code}
>  
> The following sql should throw exception about invalid partition `name` 
> {code:sql}
> create table hudi_cow_existing_tbl2 using hudi 
> partitioned by (dt, name) 
> location 'file:///tmp/hudi/hudi_cow_pt_tbl';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3198) Spark SQL create table should check partition fields

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-3198:


Assignee: Yann Byron

> Spark SQL create table should check partition fields
> 
>
> Key: HUDI-3198
> URL: https://issues.apache.org/jira/browse/HUDI-3198
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Assignee: Yann Byron
>Priority: Major
>
> {code:sql}
> create table hudi_cow_pt_tbl (
>   id bigint,
>   name string,
>   ts bigint,
>   dt string,
>   hh string
> ) using hudi
> tblproperties (
>   type = 'cow',
>   primaryKey = 'id',
>   preCombineField = 'ts'
>  )
> partitioned by (dt, hh)
> location '/tmp/hudi/hudi_cow_pt_tbl';{code}
>  
> The following sql should throw exception about invalid partition `name` 
> {code:sql}
> create table hudi_cow_existing_tbl2 using hudi 
> partitioned by (dt, name) 
> location 'file:///tmp/hudi/hudi_cow_pt_tbl';
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] yihua commented on pull request #3420: [HUDI-2283] Support Clustering Command For Spark Sql

2022-01-08 Thread GitBox


yihua commented on pull request #3420:
URL: https://github.com/apache/hudi/pull/3420#issuecomment-1008243291


   @pengzhiwei2018 Could you rebase the PR on latest master to resolve the 
conflicts?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-3104] Kafka-connect support of hadoop config environments and properties (#4451)

2022-01-08 Thread yihua
This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 0d8ca8d  [HUDI-3104] Kafka-connect support of hadoop config 
environments and properties (#4451)
0d8ca8d is described below

commit 0d8ca8da4e0f6651bc1f06dba5e7e37881225fdc
Author: Thinking Chen 
AuthorDate: Sun Jan 9 15:10:17 2022 +0800

[HUDI-3104] Kafka-connect support of hadoop config environments and 
properties (#4451)
---
 .../hudi/connect/utils/KafkaConnectUtils.java  | 68 +
 .../hudi/connect/writers/KafkaConnectConfigs.java  | 29 +
 .../apache/hudi/connect/TestHdfsConfiguration.java | 69 ++
 .../src/test/resources/hadoop_conf/core-site.xml   | 33 +++
 .../src/test/resources/hadoop_conf/hdfs-site.xml   | 30 ++
 .../resources/hadoop_home/etc/hadoop/core-site.xml | 33 +++
 .../resources/hadoop_home/etc/hadoop/hdfs-site.xml | 30 ++
 7 files changed, 292 insertions(+)

diff --git 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
index cf60b9e..cc37de2 100644
--- 
a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
+++ 
b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
@@ -49,9 +49,14 @@ import org.apache.log4j.Logger;
 
 import java.io.IOException;
 import java.nio.charset.StandardCharsets;
+import java.nio.file.Files;
+import java.nio.file.FileVisitOption;
+import java.nio.file.Path;
+import java.nio.file.Paths;
 import java.security.MessageDigest;
 import java.security.NoSuchAlgorithmException;
 import java.util.Arrays;
+import java.util.ArrayList;
 import java.util.List;
 import java.util.Map;
 import java.util.Objects;
@@ -65,6 +70,52 @@ public class KafkaConnectUtils {
 
   private static final Logger LOG = 
LogManager.getLogger(KafkaConnectUtils.class);
   private static final String HOODIE_CONF_PREFIX = "hoodie.";
+  public static final String HADOOP_CONF_DIR = "HADOOP_CONF_DIR";
+  public static final String HADOOP_HOME = "HADOOP_HOME";
+  private static final List DEFAULT_HADOOP_CONF_FILES;
+
+  static {
+DEFAULT_HADOOP_CONF_FILES = new ArrayList<>();
+try {
+  String hadoopConfigPath = System.getenv(HADOOP_CONF_DIR);
+  String hadoopHomePath = System.getenv(HADOOP_HOME);
+  DEFAULT_HADOOP_CONF_FILES.addAll(getHadoopConfigFiles(hadoopConfigPath, 
hadoopHomePath));
+  if (!DEFAULT_HADOOP_CONF_FILES.isEmpty()) {
+LOG.info(String.format("Found Hadoop default config files %s", 
DEFAULT_HADOOP_CONF_FILES));
+  }
+} catch (IOException e) {
+  LOG.error("An error occurred while getting the default Hadoop 
configuration. "
+  + "Please use hadoop.conf.dir or hadoop.home to configure Hadoop 
environment variables", e);
+}
+  }
+
+  /**
+   * Get hadoop config files by HADOOP_CONF_DIR or HADOOP_HOME
+   */
+  public static List getHadoopConfigFiles(String hadoopConfigPath, 
String hadoopHomePath)
+  throws IOException {
+List hadoopConfigFiles = new ArrayList<>();
+if (!StringUtils.isNullOrEmpty(hadoopConfigPath)) {
+  hadoopConfigFiles.addAll(walkTreeForXml(Paths.get(hadoopConfigPath)));
+}
+if (hadoopConfigFiles.isEmpty() && 
!StringUtils.isNullOrEmpty(hadoopHomePath)) {
+  hadoopConfigFiles.addAll(walkTreeForXml(Paths.get(hadoopHomePath, "etc", 
"hadoop")));
+}
+return hadoopConfigFiles;
+  }
+
+  /**
+   * Files walk to find xml
+   */
+  private static List walkTreeForXml(Path basePath) throws IOException {
+if (Files.notExists(basePath)) {
+  return new ArrayList<>();
+}
+return Files.walk(basePath, FileVisitOption.FOLLOW_LINKS)
+.filter(path -> path.toFile().isFile())
+.filter(path -> path.toString().endsWith(".xml"))
+.collect(Collectors.toList());
+  }
 
   public static int getLatestNumPartitions(String bootstrapServers, String 
topicName) {
 Properties props = new Properties();
@@ -89,6 +140,23 @@ public class KafkaConnectUtils {
*/
   public static Configuration getDefaultHadoopConf(KafkaConnectConfigs 
connectConfigs) {
 Configuration hadoopConf = new Configuration();
+
+// add hadoop config files
+if (!StringUtils.isNullOrEmpty(connectConfigs.getHadoopConfDir())
+|| !StringUtils.isNullOrEmpty(connectConfigs.getHadoopConfHome())) 
{
+  try {
+List configFiles = 
getHadoopConfigFiles(connectConfigs.getHadoopConfDir(),
+connectConfigs.getHadoopConfHome());
+configFiles.forEach(f ->
+hadoopConf.addResource(new 
org.apache.hadoop.fs.Path(f.toAbsolutePath().toUri(;
+  } catch (Exception e) {
+throw new HoodieException("

[GitHub] [hudi] yihua merged pull request #4451: [HUDI-3104] Kafka-connect support hadoop config environments and properties

2022-01-08 Thread GitBox


yihua merged pull request #4451:
URL: https://github.com/apache/hudi/pull/4451


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4458: [HUDI-3112] Fix KafkaConnect can not sync to Hive Problem

2022-01-08 Thread GitBox


yihua commented on a change in pull request #4458:
URL: https://github.com/apache/hudi/pull/4458#discussion_r780742854



##
File path: 
hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java
##
@@ -185,20 +187,50 @@ private void syncMeta() {
   }
 
   private void syncHive() {
-HiveSyncConfig hiveSyncConfig = DataSourceUtils.buildHiveSyncConfig(
-new TypedProperties(connectConfigs.getProps()),
-tableBasePath,
-"PARQUET");
+HiveSyncConfig hiveSyncConfig = buildSyncConfig(new 
TypedProperties(connectConfigs.getProps()), tableBasePath);
+String url;
+if (!StringUtils.isNullOrEmpty(hiveSyncConfig.syncMode) && 
HiveSyncMode.of(hiveSyncConfig.syncMode) == HiveSyncMode.HMS) {
+  url = hadoopConf.get(KafkaConnectConfigs.HIVE_METASTORE_URIS);
+} else {
+  url = hiveSyncConfig.jdbcUrl;
+}
+
 LOG.info("Syncing target hoodie table with hive table("
 + hiveSyncConfig.tableName
 + "). Hive metastore URL :"
-+ hiveSyncConfig.jdbcUrl
++ url
 + ", basePath :" + tableBasePath);
-LOG.info("Hive Sync Conf => " + hiveSyncConfig.toString());
+LOG.info("Hive Sync Conf => " + hiveSyncConfig);
 FileSystem fs = FSUtils.getFs(tableBasePath, hadoopConf);
 HiveConf hiveConf = new HiveConf();
 hiveConf.addResource(fs.getConf());
 LOG.info("Hive Conf => " + hiveConf.getAllProperties().toString());
 new HiveSyncTool(hiveSyncConfig, hiveConf, fs).syncHoodieTable();
   }
+
+  /**
+   * Build Hive Sync Config
+   */
+  public HiveSyncConfig buildSyncConfig(TypedProperties props, String 
tableBasePath) {

Review comment:
   Let's move this util method to `KafkaConnectUtils` class.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


dongkelun commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780742793



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -90,6 +90,11 @@
   // It is here so that both the client and deltastreamer use the same 
reference
   public static final String DELTASTREAMER_CHECKPOINT_KEY = 
"deltastreamer.checkpoint.key";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty
+  .key("hoodie.database.name")
+  .defaultValue("default")

Review comment:
   Would it be better if databaseName had a default value?like hive




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4458: [HUDI-3112] Fix KafkaConnect can not sync to Hive Problem

2022-01-08 Thread GitBox


yihua commented on a change in pull request #4458:
URL: https://github.com/apache/hudi/pull/4458#discussion_r780742608



##
File path: 
hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java
##
@@ -185,20 +187,50 @@ private void syncMeta() {
   }
 
   private void syncHive() {
-HiveSyncConfig hiveSyncConfig = DataSourceUtils.buildHiveSyncConfig(
-new TypedProperties(connectConfigs.getProps()),
-tableBasePath,
-"PARQUET");
+HiveSyncConfig hiveSyncConfig = buildSyncConfig(new 
TypedProperties(connectConfigs.getProps()), tableBasePath);
+String url;
+if (!StringUtils.isNullOrEmpty(hiveSyncConfig.syncMode) && 
HiveSyncMode.of(hiveSyncConfig.syncMode) == HiveSyncMode.HMS) {
+  url = hadoopConf.get(KafkaConnectConfigs.HIVE_METASTORE_URIS);
+} else {
+  url = hiveSyncConfig.jdbcUrl;
+}
+
 LOG.info("Syncing target hoodie table with hive table("
 + hiveSyncConfig.tableName
 + "). Hive metastore URL :"
-+ hiveSyncConfig.jdbcUrl
++ url
 + ", basePath :" + tableBasePath);
-LOG.info("Hive Sync Conf => " + hiveSyncConfig.toString());
+LOG.info("Hive Sync Conf => " + hiveSyncConfig);
 FileSystem fs = FSUtils.getFs(tableBasePath, hadoopConf);
 HiveConf hiveConf = new HiveConf();
 hiveConf.addResource(fs.getConf());
 LOG.info("Hive Conf => " + hiveConf.getAllProperties().toString());
 new HiveSyncTool(hiveSyncConfig, hiveConf, fs).syncHoodieTable();
   }
+
+  /**
+   * Build Hive Sync Config
+   */
+  public HiveSyncConfig buildSyncConfig(TypedProperties props, String 
tableBasePath) {

Review comment:
   @cdmikechen Understood.  I'm thinking about only moving util methods 
related Hive sync configs, not the Hive sync logic, to a separate Util class.  
The worry I have is that hive sync configs are spread into different places now 
and they may diverge if we forget to update all of them to be consistent.
   
   We can keep this PR as is for now.  @cdmikechen could you create a Jira 
ticket to track the Hive sync config unification, which will be done in a 
different PR in future?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on a change in pull request #4458: [HUDI-3112] Fix KafkaConnect can not sync to Hive Problem

2022-01-08 Thread GitBox


yihua commented on a change in pull request #4458:
URL: https://github.com/apache/hudi/pull/4458#discussion_r780742608



##
File path: 
hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java
##
@@ -185,20 +187,50 @@ private void syncMeta() {
   }
 
   private void syncHive() {
-HiveSyncConfig hiveSyncConfig = DataSourceUtils.buildHiveSyncConfig(
-new TypedProperties(connectConfigs.getProps()),
-tableBasePath,
-"PARQUET");
+HiveSyncConfig hiveSyncConfig = buildSyncConfig(new 
TypedProperties(connectConfigs.getProps()), tableBasePath);
+String url;
+if (!StringUtils.isNullOrEmpty(hiveSyncConfig.syncMode) && 
HiveSyncMode.of(hiveSyncConfig.syncMode) == HiveSyncMode.HMS) {
+  url = hadoopConf.get(KafkaConnectConfigs.HIVE_METASTORE_URIS);
+} else {
+  url = hiveSyncConfig.jdbcUrl;
+}
+
 LOG.info("Syncing target hoodie table with hive table("
 + hiveSyncConfig.tableName
 + "). Hive metastore URL :"
-+ hiveSyncConfig.jdbcUrl
++ url
 + ", basePath :" + tableBasePath);
-LOG.info("Hive Sync Conf => " + hiveSyncConfig.toString());
+LOG.info("Hive Sync Conf => " + hiveSyncConfig);
 FileSystem fs = FSUtils.getFs(tableBasePath, hadoopConf);
 HiveConf hiveConf = new HiveConf();
 hiveConf.addResource(fs.getConf());
 LOG.info("Hive Conf => " + hiveConf.getAllProperties().toString());
 new HiveSyncTool(hiveSyncConfig, hiveConf, fs).syncHoodieTable();
   }
+
+  /**
+   * Build Hive Sync Config
+   */
+  public HiveSyncConfig buildSyncConfig(TypedProperties props, String 
tableBasePath) {

Review comment:
   @cdmikechen Understood.  I'm thinking about only moving util methods 
related Hive sync configs, not the Hive sync logic, to a separate Util class.  
The worry I have is that hive sync configs are spread into different places and 
they may diverge if we forget to update all of them to be consistent.
   
   We can keep this PR as is for now.  @cdmikechen could you create a Jira 
ticket to track the Hive sync config unification, which will be done in a 
different PR in future?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


dongkelun commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780742376



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
##
@@ -117,9 +124,10 @@ private void parseInputPaths(Path[] inputPaths, 
List incrementalTables)
 }
   }
 
-  private void tagAsIncrementalOrSnapshot(Path inputPath, String tableName,
+  private void tagAsIncrementalOrSnapshot(Path inputPath, String databaseName, 
String tableName,

Review comment:
   Yes, but the outer layer also needs databaseName and tableName. I'm not 
sure which is better




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


dongkelun commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780741795



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -90,6 +90,11 @@
   // It is here so that both the client and deltastreamer use the same 
reference
   public static final String DELTASTREAMER_CHECKPOINT_KEY = 
"deltastreamer.checkpoint.key";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty
+  .key("hoodie.database.name")
+  .defaultValue("default")

Review comment:
   If there is no default value, is it must be set? like tableName




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


dongkelun commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780741275



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
##
@@ -33,6 +33,10 @@ import scala.collection.JavaConverters._
 class TestCreateTable extends TestHoodieSqlBase {
 
   test("Test Create Managed Hoodie Table") {
+val databaseName = "test_incremental"

Review comment:
   ok




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


dongkelun commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780741202



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -90,6 +90,11 @@
   // It is here so that both the client and deltastreamer use the same 
reference
   public static final String DELTASTREAMER_CHECKPOINT_KEY = 
"deltastreamer.checkpoint.key";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty

Review comment:
   Yes, just to keep consistent with other parameters before. If not, don't 
 need to change other parameters for the time being? Is it better to revise it 
uniformly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3197:
-
Reviewers: Raymond Xu, sivabalan narayanan  (was: sivabalan narayanan)

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: sivabalan narayanan
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-3197:


Assignee: sivabalan narayanan  (was: Raymond Xu)

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: sivabalan narayanan
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3195) optimize spark3 pom and modify build command

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3195.

Resolution: Fixed

> optimize spark3 pom and modify build command
> 
>
> Key: HUDI-3195
> URL: https://issues.apache.org/jira/browse/HUDI-3195
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Yann Byron
>Assignee: Yann Byron
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.10.1
>
>
> when use `mvn clean install -P "scala-2.12,spark3,spark3.1.x"` to build spark 
> and write data by spark, error is as follows: 
>  
>  
> {code:java}
> ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 2005)
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieException: 
> org.apache.hudi.exception.HoodieException: 
> java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: 
> org/apache/parquet/schema/LogicalTypeAnnotation$LogicalTypeAnnotationVisitor
>     at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>     at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
>     at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
>     at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
>     at 
> org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221)
>  {code}
>  
>  
> That's due to the use of 1.12.1 parquet.
>  
> Also, we use this build command in CI, which will mislead users.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3190) Validate and certify partition pruning for hudi tables w/ spark queries

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3190:
-
Fix Version/s: (was: 0.10.1)

> Validate and certify partition pruning for hudi tables w/ spark queries
> ---
>
> Key: HUDI-3190
> URL: https://issues.apache.org/jira/browse/HUDI-3190
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
>
> Validate and certify partition pruning for hudi tables w/ spark queries



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3190) Validate and certify partition pruning for hudi tables w/ spark queries

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3190.

  Assignee: (was: Raymond Xu)
Resolution: Duplicate

> Validate and certify partition pruning for hudi tables w/ spark queries
> ---
>
> Key: HUDI-3190
> URL: https://issues.apache.org/jira/browse/HUDI-3190
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Spark Integration
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.10.1
>
>
> Validate and certify partition pruning for hudi tables w/ spark queries



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3197.

 Reviewers: sivabalan narayanan
Resolution: Done

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 6:11 AM:
--

Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.

table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 


{code:java}
import org.apache.hudi.DataSourceReadOptions
import org.apache.hudi.common.config.HoodieMetadataConfig

val df = spark.
  read.
  format("org.apache.hudi").
  option(HoodieMetadataConfig.ENABLE.key(), "true").
  option(DataSourceReadOptions.QUERY_TYPE.key(), 
DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL).
  load("s3a://datalake-hudi/sessions_by_entrydate/")

df.createOrReplaceTempView("sessions")

spark.sql("SELECT count(*) FROM sessions").show() {code}
Without wildcards. spark inferring the column type and query fails with 
{code:java}
Caused by: java.lang.ClassCastException: 
org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long
  at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42)
  at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42)
  at 
org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195)
  at 
org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:98)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:230)
  at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:249)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:331)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
  at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132)
  at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
  at org.apache.spark.scheduler.Task.run(Task.scala:131)
  at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
 {code}


was (Author: h7kanna):
Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.

table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type and query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> --

[GitHub] [hudi] YannByron commented on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008236642


   @dongkelun LGTM, just left some minor comments. @xushiyan Further review 
whether this strategy makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780737792



##
File path: 
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieHFileInputFormat.java
##
@@ -235,6 +235,50 @@ public void testIncrementalSimple() throws IOException {
 FileStatus[] files = inputFormat.listStatus(jobConf);
 assertEquals(0, files.length,
 "We should exclude commit 100 when returning incremental pull with 
start commit time as 100");
+
+InputFormatTestUtil.setupIncremental(jobConf, "100", 1, true);
+
+files = inputFormat.listStatus(jobConf);
+assertEquals(10, files.length,
+"When hoodie.incremental.use.database is true and the incremental 
database name is not set,"

Review comment:
   more indent here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780737251



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java
##
@@ -117,9 +124,10 @@ private void parseInputPaths(Path[] inputPaths, 
List incrementalTables)
 }
   }
 
-  private void tagAsIncrementalOrSnapshot(Path inputPath, String tableName,
+  private void tagAsIncrementalOrSnapshot(Path inputPath, String databaseName, 
String tableName,

Review comment:
   can change the definition of this method to 
tagAsIncrementalOrSnapshot(Path inputPath, HoodieTableMetaClient metaClient, 
List incrementalTables)?
   Inside, get the database name and the table name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780736806



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java
##
@@ -175,4 +177,9 @@ private static HoodieTimeline filterIfInstantExists(String 
tableName, HoodieTime
 }
 return timeline.findInstantsBeforeOrEquals(maxCommit);
   }
+
+  public static boolean isIncrementalUseDatabase(JobContext job) {
+return job.getConfiguration().get(HOODIE_INCREMENTAL_USE_DATABASE, 
String.valueOf(DEFAULT_INCREMENTAL_USE_DATABASE))

Review comment:
   job.getConfiguration().getBoolean(HOODIE_INCREMENTAL_USE_DATABASE, 
false) ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780736465



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java
##
@@ -46,6 +46,7 @@
  */
 public class HoodieTestUtils {
 
+  public static final String INCREMENTAL_DATABASE_NAME = "test_incremental";

Review comment:
   can use a more common database name, like hoodie_database?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on issue #4170: [SUPPORT] Understanding Clustering Behavior

2022-01-08 Thread GitBox


nsivabalan edited a comment on issue #4170:
URL: https://github.com/apache/hudi/issues/4170#issuecomment-1008232146


   @rubenssoto : hey, any updates in this regard please. unless we get more 
logs, we can't do much here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4170: [SUPPORT] Understanding Clustering Behavior

2022-01-08 Thread GitBox


nsivabalan commented on issue #4170:
URL: https://github.com/apache/hudi/issues/4170#issuecomment-1008232146


   @rubenssoto : hey, any updates in this regard please. unless we get more 
logs, we can't help much here. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4200: spark-sql query timestamp partition error

2022-01-08 Thread GitBox


nsivabalan commented on issue #4200:
URL: https://github.com/apache/hudi/issues/4200#issuecomment-1008232053


   thanks for confirming. appreciate it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed issue #4200: spark-sql query timestamp partition error

2022-01-08 Thread GitBox


nsivabalan closed issue #4200:
URL: https://github.com/apache/hudi/issues/4200


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780735886



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -90,6 +90,11 @@
   // It is here so that both the client and deltastreamer use the same 
reference
   public static final String DELTASTREAMER_CHECKPOINT_KEY = 
"deltastreamer.checkpoint.key";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty
+  .key("hoodie.database.name")
+  .defaultValue("default")

Review comment:
   Do not set the default value to keep compatibility.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4208: [SUPPORT] On Hudi 0.9.0 - Alter table throws java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.alterTable(java.lang.String,

2022-01-08 Thread GitBox


nsivabalan commented on issue #4208:
URL: https://github.com/apache/hudi/issues/4208#issuecomment-1008232003


   @YannByron @xushiyan : can you folks please follow up on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4230: [SUPPORT] org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file

2022-01-08 Thread GitBox


nsivabalan commented on issue #4230:
URL: https://github.com/apache/hudi/issues/4230#issuecomment-1008231940


   @yihua : gentle ping to follow up on the issue. If there is some regression, 
we might want to fix in 0.10.1. would appreciate if you can follow up on this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4439: [BUG] ROLLBACK meet Cannot use marker based rollback strategy on completed error

2022-01-08 Thread GitBox


nsivabalan commented on issue #4439:
URL: https://github.com/apache/hudi/issues/4439#issuecomment-1008231844


   Hey @waywtdcc : let us know if you are looking for any more assistance. If 
not, feel free to close out the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780735803



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala
##
@@ -164,9 +169,14 @@ class HoodieCatalogTable(val spark: SparkSession, val 
table: CatalogTable) exten
 val properties = new Properties()
 properties.putAll(tableConfigs.asJava)
 
+val newDatabaseName = if (hoodieTableExists) databaseName else

Review comment:
   newDatabaseName  => hoodieDatabaseName




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4429: [SUPPORT] Spark SQL CTAS command doesn't work with 0.10.0 version and Spark 3.1.1

2022-01-08 Thread GitBox


nsivabalan commented on issue #4429:
URL: https://github.com/apache/hudi/issues/4429#issuecomment-1008231688


   hey folks. if the issue is resolved, can we close out the github issue. 
thanks to Yann for quick turn around.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed issue #4419: [SUPPORT] Not An Avro File (flink)

2022-01-08 Thread GitBox


nsivabalan closed issue #4419:
URL: https://github.com/apache/hudi/issues/4419


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4419: [SUPPORT] Not An Avro File (flink)

2022-01-08 Thread GitBox


nsivabalan commented on issue #4419:
URL: https://github.com/apache/hudi/issues/4419#issuecomment-1008231564


   thanks for confirming. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4457: [SUPPORT] Hudi archive stopped working

2022-01-08 Thread GitBox


nsivabalan commented on issue #4457:
URL: https://github.com/apache/hudi/issues/4457#issuecomment-1008231502


   @zuyanton : hey, do you have any updates for us. 
   CC @prashantwason does something pop up for you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4461: [SUPPORT]Hudi(0.10.0) write to Aliyun oss using metadata table warning

2022-01-08 Thread GitBox


nsivabalan commented on issue #4461:
URL: https://github.com/apache/hudi/issues/4461#issuecomment-1008231420


   @nikenfls : do you have any updates for us. if the issue is resolved, can we 
close out the github issue. thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780735473



##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
##
@@ -33,6 +33,10 @@ import scala.collection.JavaConverters._
 class TestCreateTable extends TestHoodieSqlBase {
 
   test("Test Create Managed Hoodie Table") {
+val databaseName = "test_incremental"

Review comment:
   test_incremental => hudi_database?

##
File path: 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala
##
@@ -332,14 +344,19 @@ class TestCreateTable extends TestHoodieSqlBase {
 
   test("Test Create Table From Exist Hoodie Table") {
 withTempDir { tmp =>
+  val databaseName = "test_incremental"

Review comment:
   ditto




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4456: [SUPPORT] MultiWriter w/ DynamoDB - Unable to acquire lock, lock object null

2022-01-08 Thread GitBox


nsivabalan commented on issue #4456:
URL: https://github.com/apache/hudi/issues/4456#issuecomment-1008231311


   @nochimow : a gentle reminder to respond to above question. above commentor 
is a Hoodie committer who added dynamoDB lock provider. So, he should be able 
to help in your case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4434: [SUPPORT]why are there many files under the Hoodie file?

2022-01-08 Thread GitBox


nsivabalan commented on issue #4434:
URL: https://github.com/apache/hudi/issues/4434#issuecomment-1008231118


   @tieke1121 : hey are you looking for more info. let us know. if not, feel 
free to close out the github issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4477: [SUPPORT]using spark on TimestampBasedKeyGenerator has no result when query by partition column

2022-01-08 Thread GitBox


nsivabalan commented on issue #4477:
URL: https://github.com/apache/hudi/issues/4477#issuecomment-1008231035


   @YannByron : may I know whats the tracking ticket. If not, can we create one 
for the issue reported in this github issue. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan closed issue #4474: [SUPPORT] Should we shade all aws dependencies to avoid class conflicts?

2022-01-08 Thread GitBox


nsivabalan closed issue #4474:
URL: https://github.com/apache/hudi/issues/4474


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4474: [SUPPORT] Should we shade all aws dependencies to avoid class conflicts?

2022-01-08 Thread GitBox


nsivabalan commented on issue #4474:
URL: https://github.com/apache/hudi/issues/4474#issuecomment-1008230937


   Closing the github issue as we have a tracking jira. thank you folks for 
chiming in. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4541: [SUPPORT] NullPointerException while writing Bulk ingest table

2022-01-08 Thread GitBox


nsivabalan commented on issue #4541:
URL: https://github.com/apache/hudi/issues/4541#issuecomment-1008230765


   let's try to remove some advanced configs, and test if we can make a simple 
job succeed and then we can add back more configs to deduce the issue.
   
   - I see you have added lot of custom configs for index. can we remove them 
for now. 
   ```
   'hoodie.bloom.index.bucketized.checking': True,
   'hoodie.bloom.index.keys.per.bucket': 5000,
   'hoodie.index.bloom.num_entries': 100,
   'hoodie.bloom.index.use.caching': True,
   'hoodie.bloom.index.use.treebased.filter': True,
   'hoodie.bloom.index.filter.type': 'DYNAMIC_V0',
   'hoodie.bloom.index.filter.dynamic.max.entries': 100,
   'hoodie.bloom.index.prune.by.ranges': True,
   ```
   - 'write.parquet.block.size': 256 seems very low. Can we remove this for 
now. 
   - I see the exception arises from clustering code. lets try to remove them 
for now. 
   ```
   'hoodie.clustering.inline': True,
   'hoodie.clustering.inline.max.commits': '1',
   'hoodie.clustering.plan.strategy.small.file.limit': '1073741824',
   'hoodie.clustering.plan.strategy.target.file.max.bytes': 
'2147483648',
   'hoodie.clustering.execution.strategy.class':
   'org.apache.hudi.client.clustering.run.strategy'
   '.SparkSortAndSizeExecutionStrategy',
   'hoodie.clustering.plan.strategy.sort.columns': sort_cols,
   ```
   
   Lets try to see if the job succeeds after making above modifications. and we 
can go from there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL

2022-01-08 Thread GitBox


YannByron commented on a change in pull request #4083:
URL: https://github.com/apache/hudi/pull/4083#discussion_r780735306



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -90,6 +90,11 @@
   // It is here so that both the client and deltastreamer use the same 
reference
   public static final String DELTASTREAMER_CHECKPOINT_KEY = 
"deltastreamer.checkpoint.key";
 
+  public static final ConfigProperty DATABASE_NAME = ConfigProperty

Review comment:
   better to point to the definition of `HoodieTableConfig.DATABASE_HOME` 
directly, to avoid define repeatedly.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #4539: [SUPPORT] spark 2.4.0 write data to hudi ERROR (0.10.0)

2022-01-08 Thread GitBox


nsivabalan commented on issue #4539:
URL: https://github.com/apache/hudi/issues/4539#issuecomment-1008230027


   2.4.0 is not supported. Can you try with 2.4.3 or higher spark versions. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471283#comment-17471283
 ] 

Raymond Xu edited comment on HUDI-3197 at 1/9/22, 4:49 AM:
---

{code:java}
-- create a partitioned, preCombineField-provided cow table
create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl';

-- insert sample data across partitions
insert into hudi_cow_pt_tbl
(id, name, ts, dt, hh)
values
(1, 'foo1', 1000, '20210701', '11'),
(2, 'foo2', 1001, '20210701', '12'),
(3, 'foo3', 1003, '20210701', '13'),
(4, 'foo4', 1004, '20210701', '14');

-- create an external Hudi table
create table hudi_cow_existing_tbl using hudi
partitioned by (dt, hh)
location 'file:///tmp/hudi/hudi_cow_pt_tbl';

-- query with partition pruning
select * from hudi_cow_existing_tbl where dt = '20210701' and hh = '13'; {code}


This validates spark sql created table works with partition pruning


was (Author: xushiyan):
{code:java}
-- create a partitioned, preCombineField-provided cow table
create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl';

-- insert sample data across partitions
insert into hudi_cow_pt_tbl
(id, name, ts, dt, hh)
values
(1, 'foo1', 1000, '20210701', '11'),
(2, 'foo2', 1001, '20210701', '12'),
(3, 'foo3', 1003, '20210701', '13'),
(4, 'foo4', 1004, '20210701', '14');

-- create an external Hudi table
create table hudi_cow_existing_tbl using hudi
partitioned by (dt, hh)
location 'file:///tmp/hudi/hudi_cow_pt_tbl';

-- query with partition pruning
select * from hudi_cow_existing_tbl where dt = '20210701' and hh = '13'; {code}

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-44) Compaction must preserve commit timestamps of merged records #376

2022-01-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-44:

Status: Resolved  (was: Patch Available)

> Compaction must preserve commit timestamps of merged records #376
> -
>
> Key: HUDI-44
> URL: https://issues.apache.org/jira/browse/HUDI-44
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Compaction
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: core-flow-ds, help-requested, pull-request-available, 
> sev:critical
> Fix For: 0.10.1
>
>
> https://github.com/uber/hudi/issues/376



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471283#comment-17471283
 ] 

Raymond Xu commented on HUDI-3197:
--

{code:java}
-- create a partitioned, preCombineField-provided cow table
create table hudi_cow_pt_tbl (
  id bigint,
  name string,
  ts bigint,
  dt string,
  hh string
) using hudi
tblproperties (
  type = 'cow',
  primaryKey = 'id',
  preCombineField = 'ts'
 )
partitioned by (dt, hh)
location '/tmp/hudi/hudi_cow_pt_tbl';

-- insert sample data across partitions
insert into hudi_cow_pt_tbl
(id, name, ts, dt, hh)
values
(1, 'foo1', 1000, '20210701', '11'),
(2, 'foo2', 1001, '20210701', '12'),
(3, 'foo3', 1003, '20210701', '13'),
(4, 'foo4', 1004, '20210701', '14');

-- create an external Hudi table
create table hudi_cow_existing_tbl using hudi
partitioned by (dt, hh)
location 'file:///tmp/hudi/hudi_cow_pt_tbl';

-- query with partition pruning
select * from hudi_cow_existing_tbl where dt = '20210701' and hh = '13'; {code}

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3197:
-
Attachment: (was: Screen Shot 2022-01-08 at 8.43.50 PM.png)

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (HUDI-2947) HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config from CLI in continuous mode

2022-01-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-2947.
---

> HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config 
> from CLI in continuous mode
> --
>
> Key: HUDI-2947
> URL: https://issues.apache.org/jira/browse/HUDI-2947
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> *Problem:*
> When deltastreamer is started with a given checkpoint, e.g., `--checkpoint 
> 0`, in the continuous mode, the deltastreamer job may pick up the wrong 
> checkpoint later on.  The wrong checkpoint (for 20211206203551080 commit) 
> happens after the replacecommit and clean, which is reset to "0", instead of 
> "5" after 20211206202728233.commit.  More details below.
>  
> The bug is due to the check here: 
> [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L335]
> {code:java}
> if (cfg.checkpoint != null && 
> (StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))  
>   || 
> !cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY {
> resumeCheckpointStr = Option.of(cfg.checkpoint);
> } {code}
> In this case of resuming after a clustering commit, "cfg.checkpoint != null" 
> and 
> "StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))" 
>  are both true as "--checkpoint 0" is configured and last commit is 
> replacecommit without checkpoint keys.  This leads to the resume checkpoint 
> string being reset to the configured checkpoint, skipping the timeline 
> walk-back logic below, which is wrong.  
>  
> Timeline:
>  
> {code:java}
>  189069 Dec  6 12:19 20211206201238649.commit
>       0 Dec  6 12:12 20211206201238649.commit.requested
>       0 Dec  6 12:12 20211206201238649.inflight
>  189069 Dec  6 12:27 20211206201959151.commit
>       0 Dec  6 12:20 20211206201959151.commit.requested
>       0 Dec  6 12:20 20211206201959151.inflight
>  189069 Dec  6 12:34 20211206202728233.commit
>       0 Dec  6 12:27 20211206202728233.commit.requested
>       0 Dec  6 12:27 20211206202728233.inflight
>   36662 Dec  6 12:35 20211206203449899.replacecommit
>       0 Dec  6 12:35 20211206203449899.replacecommit.inflight
>   34656 Dec  6 12:35 20211206203449899.replacecommit.requested
>   28013 Dec  6 12:35 20211206203503574.clean
>   19024 Dec  6 12:35 20211206203503574.clean.inflight
>   19024 Dec  6 12:35 20211206203503574.clean.requested
>  189069 Dec  6 12:43 20211206203551080.commit
>       0 Dec  6 12:35 20211206203551080.commit.requested
>       0 Dec  6 12:35 20211206203551080.inflight
>  189069 Dec  6 12:50 20211206204311612.commit
>       0 Dec  6 12:43 20211206204311612.commit.requested
>       0 Dec  6 12:43 20211206204311612.inflight
>       0 Dec  6 12:50 20211206205044595.commit.requested
>       0 Dec  6 12:50 20211206205044595.inflight
>     128 Dec  6 12:56 archived
>     483 Dec  6 11:52 hoodie.properties
>  {code}
>  
> Checkpoints in commits:
>  
> {code:java}
> grep "deltastreamer.checkpoint.key" *
> 20211206201238649.commit:    "deltastreamer.checkpoint.key" : "2"
> 20211206201959151.commit:    "deltastreamer.checkpoint.key" : "3"
> 20211206202728233.commit:    "deltastreamer.checkpoint.key" : "4"
> 20211206203551080.commit:    "deltastreamer.checkpoint.key" : "1"
> 20211206204311612.commit:    "deltastreamer.checkpoint.key" : "2" {code}
>  
> *Steps to reproduce:*
> Run HoodieDeltaStreamer in the continuous mode, by providing both 
> "--checkpoint 0" and "--continuous", with inline clustering and sync clean 
> enabled (some configs are masked).
>  
> {code:java}
> spark-submit \
>   --master yarn \
>   --driver-memory 8g --executor-memory 8g --num-executors 3 --executor-cores 
> 4 \
>   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
>   --conf 
> spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain
>  \
>   --conf spark.speculation=true \
>   --conf spark.speculation.multiplier=1.0 \
>   --conf spark.speculation.quantile=0.5 \
>   --packages org.apache.spark:spark-avro_2.12:3.2.0 \
>   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
>   file:/home/hadoop/ethan/hudi-utilities-bundle_2.12-0.10.0-rc3.jar \
>   --props file:/home/hadoop/ethan/test.properties \
>   --source-class ... \
>   --source-ordering-field ts \
>   --target-base-path s3a://hudi-testing/test_hoodie_table_11/ \
>   --target-table test_table \
>   --table-type COPY_ON_WRITE \
>   --op BULK_INSERT 

[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3197:
-
Attachment: Screen Shot 2022-01-08 at 8.43.50 PM.png

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png, Screen Shot 2022-01-08 at 8.43.50 
> PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471282#comment-17471282
 ] 

sivabalan narayanan commented on HUDI-2909:
---

[~codope] : can you help Harsha.

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4518:
URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222910


   
   ## CI report:
   
   * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921)
 
   * 108b27f73f4656423be54bf4b20ba9dad8a26647 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5022)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4518:
URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008228966


   
   ## CI report:
   
   * 108b27f73f4656423be54bf4b20ba9dad8a26647 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5022)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index

2022-01-08 Thread GitBox


danny0405 commented on pull request #4446:
URL: https://github.com/apache/hudi/pull/4446#issuecomment-1008228935


   Generally i think we should figure out a way for global index how to 
distinguish between `INSERT` and `UPDATE` for input records instead of hacking 
in the partitioner for write stats. That is too tricky for me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index

2022-01-08 Thread GitBox


danny0405 commented on a change in pull request #4446:
URL: https://github.com/apache/hudi/pull/4446#discussion_r780733934



##
File path: 
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java
##
@@ -90,27 +90,29 @@ public BaseJavaCommitActionExecutor(HoodieEngineContext 
context,
   public HoodieWriteMetadata> execute(List> 
inputRecords) {
 HoodieWriteMetadata> result = new 
HoodieWriteMetadata<>();
 
-WorkloadProfile profile = null;
+WorkloadProfile inputProfile = null;
 if (isWorkloadProfileNeeded()) {
-  profile = new WorkloadProfile(buildProfile(inputRecords));
-  LOG.info("Workload profile :" + profile);
+  inputProfile = new WorkloadProfile(buildProfile(inputRecords));
+  LOG.info("Input workload profile :" + inputProfile);
+}
+
+final Partitioner partitioner = getPartitioner(inputProfile);
+try {
+  WorkloadProfile executionProfile = 
partitioner.getExecutionWorkloadProfile();
+  LOG.info("Execution workload profile :" + inputProfile);
+  saveWorkloadProfileMetadataToInflight(executionProfile, instantTime);

Review comment:
   Any why we must use the execution profile here ? I know the original 
profile also works only for bloomfilter index but we should fix the profile 
building instead of fetch it from the partitioner, if we have a way to 
distinguish between  `INSERT`s and `UPDATE`s before write.

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##
@@ -182,14 +182,28 @@ public abstract void preCompact(
 .withOperationField(config.allowOperationMetadataField())
 .withPartition(operation.getPartitionPath())
 .build();
-if (!scanner.iterator().hasNext()) {
-  scanner.close();
-  return new ArrayList<>();
-}
 
 Option oldDataFileOpt =
 operation.getBaseFile(metaClient.getBasePath(), 
operation.getPartitionPath());
 
+// Considering following scenario: if all log blocks in this fileSlice is 
rollback, it returns an empty scanner.
+// But in this case, we need to give it a base file. Otherwise, it will 
lose base file in following fileSlice.
+if (!scanner.iterator().hasNext()) {
+  if (!oldDataFileOpt.isPresent()) {
+scanner.close();
+return new ArrayList<>();
+  } else {
+// TODO: we may directly rename original parquet file if there is not 
evolution/devolution of schema

Review comment:
   If the file slice only has parquet files, why we still trigger 
compaction ?

##
File path: 
hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java
##
@@ -90,27 +90,29 @@ public BaseJavaCommitActionExecutor(HoodieEngineContext 
context,
   public HoodieWriteMetadata> execute(List> 
inputRecords) {
 HoodieWriteMetadata> result = new 
HoodieWriteMetadata<>();
 
-WorkloadProfile profile = null;
+WorkloadProfile inputProfile = null;
 if (isWorkloadProfileNeeded()) {
-  profile = new WorkloadProfile(buildProfile(inputRecords));
-  LOG.info("Workload profile :" + profile);
+  inputProfile = new WorkloadProfile(buildProfile(inputRecords));
+  LOG.info("Input workload profile :" + inputProfile);
+}
+
+final Partitioner partitioner = getPartitioner(inputProfile);
+try {
+  WorkloadProfile executionProfile = 
partitioner.getExecutionWorkloadProfile();
+  LOG.info("Execution workload profile :" + inputProfile);
+  saveWorkloadProfileMetadataToInflight(executionProfile, instantTime);

Review comment:
   Did you mean `executionProfile` ?

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/Partitioner.java
##
@@ -18,10 +18,14 @@
 
 package org.apache.hudi.table.action.commit;
 
+import org.apache.hudi.table.WorkloadProfile;
+
 import java.io.Serializable;
 
 public interface Partitioner extends Serializable {
   int getNumPartitions();
 
   int getPartition(Object key);
+
+  WorkloadProfile getExecutionWorkloadProfile();
 }

Review comment:
   Why a `Partitioner` returns the profile ? Let's not put the interface 
here.

##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java
##
@@ -97,15 +98,25 @@ void saveWorkloadProfileMetadataToInflight(WorkloadProfile 
profile, String insta
 insertStat.setFileId("");
 insertStat.setPrevCommit(HoodieWriteStat.NULL_COMMIT);
 metadata.addWriteStat(path, insertStat);
-
-partitionStat.getUpdateLocationToCount().forEach((key, value) -> {
-  HoodieWriteStat writeStat = new HoodieWriteStat();
-  writeStat.setFileId(key);
-  // TODO : Write baseCommitTime is possible here ?
-  writeStat.setPrevCommit(value.getKey());
-  writeStat

[GitHub] [hudi] cdmikechen commented on a change in pull request #4451: [HUDI-3104] Kafka-connect support hadoop config environments and properties

2022-01-08 Thread GitBox


cdmikechen commented on a change in pull request #4451:
URL: https://github.com/apache/hudi/pull/4451#discussion_r780733280



##
File path: 
hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
##
@@ -89,6 +140,23 @@ public static int getLatestNumPartitions(String 
bootstrapServers, String topicNa
*/
   public static Configuration getDefaultHadoopConf(KafkaConnectConfigs 
connectConfigs) {
 Configuration hadoopConf = new Configuration();
+
+// add hadoop config files
+if (!StringUtils.isNullOrEmpty(connectConfigs.getHadoopConfDir())

Review comment:
   @codope 
   The default Hadoop configuration can solve the problem of a single 
environment, but we may also need to consider the need to manually configure 
`hadoop.conf.dir ` or ` hadoop.home` if different tasks need to write to 
different HDFS. 
   So that I also separately added the parameters of hadoop environment config 
in `KafkaConnectConfigs`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] cdmikechen commented on a change in pull request #4451: [HUDI-3104] Kafka-connect support hadoop config environments and properties

2022-01-08 Thread GitBox


cdmikechen commented on a change in pull request #4451:
URL: https://github.com/apache/hudi/pull/4451#discussion_r780733010



##
File path: 
hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java
##
@@ -65,6 +70,52 @@
 
   private static final Logger LOG = 
LogManager.getLogger(KafkaConnectUtils.class);
   private static final String HOODIE_CONF_PREFIX = "hoodie.";
+  public static final String HADOOP_CONF_DIR = "HADOOP_CONF_DIR";
+  public static final String HADOOP_HOME = "HADOOP_HOME";
+  private static final List DEFAULT_HADOOP_CONF_FILES;
+
+  static {
+DEFAULT_HADOOP_CONF_FILES = new ArrayList<>();
+try {
+  String hadoopConfigPath = System.getenv(HADOOP_CONF_DIR);
+  String hadoopHomePath = System.getenv(HADOOP_HOME);
+  DEFAULT_HADOOP_CONF_FILES.addAll(getHadoopConfigFiles(hadoopConfigPath, 
hadoopHomePath));
+  if (!DEFAULT_HADOOP_CONF_FILES.isEmpty()) {
+LOG.info(String.format("Found Hadoop default config files %s", 
DEFAULT_HADOOP_CONF_FILES));
+  }

Review comment:
   @codope 
   My idea was: because the hadoop environment is set by default, users need to 
know that kafka-connect has obtained the correct information. Thus, the users 
can judge whether the environment that they set is wrong, or manually declare 
the hadoop configuration path when registering the task.
   
   Because the default log level is info, it may be easier for users to output 
information with info.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280
 ] 

Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 4:15 AM:
--

Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.

table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type and query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 


was (Author: h7kanna):
Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.



table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type is query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877

[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280
 ] 

Harsha Teja Kanna commented on HUDI-3066:
-

Hi [~shivnarayan] Basic question

I am trying to use just the base-path without the wildcards. But facing this 
issue.

Table effectively has two columns with same name.



table is created using a timestamp column for key generation and mapped to date 
partition.
using hoodie.datasource.write.partitionpath.field=entrydate:timestamp

So the partition is entrydate=/mm/dd. 

Without wildcards. spark inferring the column type is query fails with 

java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot 
be cast to java.lang.Long

 

> Very slow file listing after enabling metadata for existing tables in 0.10.0 
> release
> 
>
> Key: HUDI-3066
> URL: https://issues.apache.org/jira/browse/HUDI-3066
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.0
> Environment: EMR 6.4.0
> Hudi version : 0.10.0
>Reporter: Harsha Teja Kanna
>Assignee: sivabalan narayanan
>Priority: Critical
>  Labels: performance, pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot 
> 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, 
> Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 
> PM.png, metadata_files.txt, metadata_files_compacted.txt, 
> metadata_timeline.txt, metadata_timeline_archived.txt, 
> metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, 
> timeline.txt, writer_log.txt
>
>
> After 'metadata table' is enabled, File listing takes long time.
> If metadata is enabled on Reader side(as shown below), it is taking even more 
> time per file listing task
> {code:java}
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.common.config.HoodieMetadataConfig
> val hadoopConf = spark.conf
> hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true")
> val basePath = "s3a://datalake-hudi"
> val sessions = spark
> .read
> .format("org.apache.hudi")
> .option(DataSourceReadOptions.QUERY_TYPE.key(), 
> DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL)
> .option(DataSourceReadOptions.READ_PATHS.key(), 
> s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*")
> .load()
> sessions.createOrReplaceTempView("sessions") {code}
> Existing tables (COW) have inline clustering on and have many replace commits.
> Logs seem to suggest the delay is in view.AbstractTableFileSystemView 
> resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata
> Also many log messages in AbstractHoodieLogRecordReader
>  
> 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms 
> to read  136 instants, 9731 replaced file groups
> 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515
>  at instant 20211217035105329
> 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613',
>  fileLen=0}
> 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek 
> policy
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a 
> data block from file 
> s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377
>  at instant 20211217022049877
> 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of 
> remaining logblocks to merge 1
> 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next 
> reader for logfile 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log 
> file 
> HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663',
>  fileLen=0}
> 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream

[GitHub] [hudi] hudi-bot commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4542:
URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008223765


   
   ## CI report:
   
   * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5021)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4542:
URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217433


   
   ## CI report:
   
   * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5021)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4518:
URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222649


   
   ## CI report:
   
   * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921)
 
   * 108b27f73f4656423be54bf4b20ba9dad8a26647 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4518:
URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222910


   
   ## CI report:
   
   * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921)
 
   * 108b27f73f4656423be54bf4b20ba9dad8a26647 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5022)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4518:
URL: https://github.com/apache/hudi/pull/4518#issuecomment-1006204405


   
   ## CI report:
   
   * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4518:
URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222649


   
   ## CI report:
   
   * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921)
 
   * 108b27f73f4656423be54bf4b20ba9dad8a26647 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4542:
URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217107


   
   ## CI report:
   
   * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4542:
URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217433


   
   ## CI report:
   
   * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5021)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4542:
URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217107


   
   ## CI report:
   
   * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] boneanxs commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


boneanxs commented on pull request #4542:
URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217012


   @xushiyan @nsivabalan pls take a took.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3157) Remove aws jars from hudi bundles

2022-01-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-3157:
-
Labels: pull-request-available sev:critical user-support-issues  (was: 
sev:critical user-support-issues)

> Remove aws jars from hudi bundles
> -
>
> Key: HUDI-3157
> URL: https://issues.apache.org/jira/browse/HUDI-3157
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Raymond Xu
>Assignee: Hui An
>Priority: Critical
>  Labels: pull-request-available, sev:critical, user-support-issues
> Fix For: 0.10.1
>
>
> ref: 
> [https://github.com/apache/hudi/issues/4474]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] boneanxs opened a new pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles

2022-01-08 Thread GitBox


boneanxs opened a new pull request #4542:
URL: https://github.com/apache/hudi/pull/4542


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   Remove aws jars from hudi bundles
   ref: https://github.com/apache/hudi/issues/4474
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-3157) Remove aws jars from hudi bundles

2022-01-08 Thread Hui An (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471269#comment-17471269
 ] 

Hui An commented on HUDI-3157:
--

Working on this now.

> Remove aws jars from hudi bundles
> -
>
> Key: HUDI-3157
> URL: https://issues.apache.org/jira/browse/HUDI-3157
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Raymond Xu
>Assignee: Hui An
>Priority: Critical
>  Labels: sev:critical, user-support-issues
> Fix For: 0.10.1
>
>
> ref: 
> [https://github.com/apache/hudi/issues/4474]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008208353


   
   ## CI report:
   
   * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013)
 
   * 8896e81ac168348d66de6c8cf444c4a7e2c9826e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5020)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008215773


   
   ## CI report:
   
   * 8896e81ac168348d66de6c8cf444c4a7e2c9826e Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5020)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264
 ] 

Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM:
--

I am not able to determine if I fall under user type c or a/b :) from the 
Github issue or the above description.

I can you please help understand if I have to recreate the dataset?


was (Author: h7kanna):
I am not able to determine if I fall under user type c or a/b :)

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264
 ] 

Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM:
--

I am not able to determine if I fall under user type c or a/b :) from the 
Github issue or the above description.

Can you please help understand if I have to recreate the dataset?


was (Author: h7kanna):
I am not able to determine if I fall under user type c or a/b :) from the 
Github issue or the above description.

I can you please help understand if I have to recreate the dataset?

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type

2022-01-08 Thread Harsha Teja Kanna (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264
 ] 

Harsha Teja Kanna commented on HUDI-2909:
-

I am not able to determine if I fall under user type c or a/b :)

> Partition field parsing fails due to KeyGenerator giving inconsistent value 
> for logical timestamp type
> --
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Harsha Teja Kanna
>Assignee: Sagar Sumit
>Priority: Blocker
>  Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.10.1
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM  a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[GitHub] [hudi] aznwarmonkey opened a new issue #4541: [SUPPORT] NullPointerException while writing Bulk ingest table

2022-01-08 Thread GitBox


aznwarmonkey opened a new issue #4541:
URL: https://github.com/apache/hudi/issues/4541


   Hello,
   
   I am currently getting an exception while writing a `hudi` talbe in 
`bulk_ingest` mode. Please see below for the stacktrace along with the snippet 
of code I a using to write the data. I am new to `hudi` and this stacktrace 
doesn't provide much insight as to why it is happening. Any help with this 
issue is greatly appreciative 
   
   ```bash
   py4j.protocol.Py4JJavaError: An error occurred while calling o195.save.
   : org.apache.spark.SparkException: Writing job failed.
   at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:383)
   at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:336)
   at 
org.apache.spark.sql.execution.datasources.v2.AppendDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:218)
   at 
org.apache.spark.sql.execution.datasources.v2.AppendDataExec.run(WriteToDataSourceV2Exec.scala:225)
   at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40)
   at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40)
   at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.doExecute(V2CommandExec.scala:55)
   at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
   at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
   at 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
   at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at 
org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135)
   at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107)
   at 
org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135)
   at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253)
   at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134)
   at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
   at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68)
   at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
   at 
org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:370)
   at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
   at 
org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:302)
   at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:127)
   at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134)
   at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
   at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
   at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
   at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
   at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
   at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220)
   at 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181)
   at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134)
   at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133)
   at 

[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008207981


   
   ## CI report:
   
   * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013)
 
   * 8896e81ac168348d66de6c8cf444c4a7e2c9826e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008208353


   
   ## CI report:
   
   * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013)
 
   * 8896e81ac168348d66de6c8cf444c4a7e2c9826e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5020)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1007952877


   
   ## CI report:
   
   * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4489:
URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008207981


   
   ## CI report:
   
   * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013)
 
   * 8896e81ac168348d66de6c8cf444c4a7e2c9826e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot removed a comment on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-08 Thread GitBox


hudi-bot removed a comment on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1008177446


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * bfca6fe7dccb87d9f823173fa965193b1e3c0b79 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4980)
 
   * b7b4aca36444784913b61c62bf9e24a99e8ffbd8 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5019)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation

2022-01-08 Thread GitBox


hudi-bot commented on pull request #4514:
URL: https://github.com/apache/hudi/pull/4514#issuecomment-1008194396


   
   ## CI report:
   
   * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN
   * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN
   * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN
   * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN
   * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN
   * b7b4aca36444784913b61c62bf9e24a99e8ffbd8 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5019)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi

2022-01-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3197:
--
Summary: Validate partition pruning with Hudi  (was: Validate partition 
pruning with Spark SQL)

> Validate partition pruning with Hudi
> 
>
> Key: HUDI-3197
> URL: https://issues.apache.org/jira/browse/HUDI-3197
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot 
> 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, 
> Screen Shot 2022-01-08 at 3.26.53 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (HUDI-2682) Spark schema not updated with new columns on hive sync

2022-01-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-2682.

 Reviewers: Tao Meng
Resolution: Fixed

> Spark schema not updated with new columns on hive sync
> --
>
> Key: HUDI-2682
> URL: https://issues.apache.org/jira/browse/HUDI-2682
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Spark Integration
>Affects Versions: 0.9.0
>Reporter: Charlie Briggs
>Assignee: 董可伦
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.1
>
>
> When syncing hive schema, new columns added from the source dataset are not 
> propagated to the `spark.sql.sources.schema` metadata on the hive table. This 
> leads to columns not being available when querying the dataset via spark SQL.
> Tested with both spark data writer and deltastreamer).
> The column we observed this on was a struct column, but it seems like it 
> would be independent of datatype.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   3   >