[GitHub] [hudi] hudi-bot commented on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
hudi-bot commented on pull request #4083: URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008247208 ## CI report: * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861) * 2722bfcfd29a95f27338c1c8b026185472eefba0 UNKNOWN * d361b823d0bf09c7ba103070d43cb81c6eb9467d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3158) Reduce warn logs in Spark SQL INSERT OVERWRITE
[ https://issues.apache.org/jira/browse/HUDI-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471296#comment-17471296 ] Raymond Xu commented on HUDI-3158: -- [~shivnarayan][~dongkelun] the needed changes look bigger than I thought, we may skip this for the minor release. > Reduce warn logs in Spark SQL INSERT OVERWRITE > -- > > Key: HUDI-3158 > URL: https://issues.apache.org/jira/browse/HUDI-3158 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Raymond Xu >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available, sev:normal > Fix For: 0.10.1 > > > {code:java} > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED]{code} > To reduce the repeated warn logs > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
hudi-bot removed a comment on pull request #4083: URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008245348 ## CI report: * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861) * 2722bfcfd29a95f27338c1c8b026185472eefba0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3158) Reduce warn logs in Spark SQL INSERT OVERWRITE
[ https://issues.apache.org/jira/browse/HUDI-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3158: - Story Points: 1 (was: 0.5) > Reduce warn logs in Spark SQL INSERT OVERWRITE > -- > > Key: HUDI-3158 > URL: https://issues.apache.org/jira/browse/HUDI-3158 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Raymond Xu >Assignee: 董可伦 >Priority: Major > Labels: pull-request-available, sev:normal > Fix For: 0.10.1 > > > {code:java} > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED] > 22/01/03 19:35:12 WARN ClusteringUtils: No content found in requested file > for instant [==>20220103192919722__replacecommit__REQUESTED]{code} > To reduce the repeated warn logs > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[hudi] branch master updated (0d8ca8d -> 3679070)
This is an automated email from the ASF dual-hosted git repository. xushiyan pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from 0d8ca8d [HUDI-3104] Kafka-connect support of hadoop config environments and properties (#4451) add 3679070 [HUDI-3125] spark-sql write timestamp directly (#4471) No new revisions were added by this update. Summary of changes: .../apache/hudi/keygen/RowKeyGeneratorHelper.java | 19 +++- .../org/apache/hudi/AvroConversionHelper.scala | 15 ++- .../hudi/keygen/TestRowGeneratorHelper.scala | 102 + .../apache/spark/sql/hudi/TestCreateTable.scala| 27 ++ .../apache/spark/sql/hudi/TestInsertTable.scala| 32 +++ 5 files changed, 188 insertions(+), 7 deletions(-) create mode 100644 hudi-client/hudi-spark-client/src/test/scala/org/apache/hudi/keygen/TestRowGeneratorHelper.scala
[jira] [Closed] (HUDI-3125) Spark SQL writing timestamp type don't need to disable `spark.sql.datetime.java8API.enabled` manually
[ https://issues.apache.org/jira/browse/HUDI-3125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3125. Resolution: Fixed > Spark SQL writing timestamp type don't need to disable > `spark.sql.datetime.java8API.enabled` manually > - > > Key: HUDI-3125 > URL: https://issues.apache.org/jira/browse/HUDI-3125 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Major > Labels: pull-request-available, sev:critical, user-support-issues > Fix For: 0.10.1 > > > {code:java} > create table h0_p(id int, name string, price double, dt timestamp) using hudi > partitioned by(dt) options(type = 'cow', primaryKey = 'id'); > insert into h0_p values (3, 'a1', 10, cast('2021-05-08 00:00:00' as > timestamp)); {code} > By default, that run the sql above will throw exception: > {code:java} > Caused by: java.lang.ClassCastException: java.time.Instant cannot be cast to > java.sql.Timestamp > at > org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8(AvroConversionHelper.scala:306) > at > org.apache.hudi.AvroConversionHelper$.$anonfun$createConverterToAvro$8$adapted(AvroConversionHelper.scala:306) > at scala.Option.map(Option.scala:230) {code} > We need disable `spark.sql.datetime.java8API.enabled` manually to make it > work: > {code:java} > set spark.sql.datetime.java8API.enabled=false; {code} > And the command must be executed in the runtime. It can't work if provide > this by spark-sql command: `spark-sql --conf > spark.sql.datetime.java8API.enabled=false`. That's because this config is > forced to enable when launch spark-sql. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] xushiyan merged pull request #4471: [HUDI-3125] spark-sql write timestamp directly
xushiyan merged pull request #4471: URL: https://github.com/apache/hudi/pull/4471 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
hudi-bot removed a comment on pull request #4083: URL: https://github.com/apache/hudi/pull/4083#issuecomment-1003742831 ## CI report: * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
hudi-bot commented on pull request #4083: URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008245348 ## CI report: * 00221c82e8b1693280fd72625eafcd503d54323c UNKNOWN * 46053bb143d1fd1274ac466197cc9361708e738b UNKNOWN * 020118c4b169d47e668f37240388e8d1bbdfad70 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4861) * 2722bfcfd29a95f27338c1c8b026185472eefba0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-3198) Spark SQL create table should check partition fields
Raymond Xu created HUDI-3198: Summary: Spark SQL create table should check partition fields Key: HUDI-3198 URL: https://issues.apache.org/jira/browse/HUDI-3198 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: Raymond Xu {code:sql} create table hudi_cow_pt_tbl ( id bigint, name string, ts bigint, dt string, hh string ) using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts' ) partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl';{code} The following sql should throw exception about invalid partition `name` {code:sql} create table hudi_cow_existing_tbl2 using hudi partitioned by (dt, name) location 'file:///tmp/hudi/hudi_cow_pt_tbl'; {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3198) Spark SQL create table should check partition fields
[ https://issues.apache.org/jira/browse/HUDI-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3198: - Fix Version/s: 0.11.0 > Spark SQL create table should check partition fields > > > Key: HUDI-3198 > URL: https://issues.apache.org/jira/browse/HUDI-3198 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Major > Fix For: 0.11.0 > > > {code:sql} > create table hudi_cow_pt_tbl ( > id bigint, > name string, > ts bigint, > dt string, > hh string > ) using hudi > tblproperties ( > type = 'cow', > primaryKey = 'id', > preCombineField = 'ts' > ) > partitioned by (dt, hh) > location '/tmp/hudi/hudi_cow_pt_tbl';{code} > > The following sql should throw exception about invalid partition `name` > {code:sql} > create table hudi_cow_existing_tbl2 using hudi > partitioned by (dt, name) > location 'file:///tmp/hudi/hudi_cow_pt_tbl'; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3198) Spark SQL create table should check partition fields
[ https://issues.apache.org/jira/browse/HUDI-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-3198: Assignee: Yann Byron > Spark SQL create table should check partition fields > > > Key: HUDI-3198 > URL: https://issues.apache.org/jira/browse/HUDI-3198 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Raymond Xu >Assignee: Yann Byron >Priority: Major > > {code:sql} > create table hudi_cow_pt_tbl ( > id bigint, > name string, > ts bigint, > dt string, > hh string > ) using hudi > tblproperties ( > type = 'cow', > primaryKey = 'id', > preCombineField = 'ts' > ) > partitioned by (dt, hh) > location '/tmp/hudi/hudi_cow_pt_tbl';{code} > > The following sql should throw exception about invalid partition `name` > {code:sql} > create table hudi_cow_existing_tbl2 using hudi > partitioned by (dt, name) > location 'file:///tmp/hudi/hudi_cow_pt_tbl'; > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] yihua commented on pull request #3420: [HUDI-2283] Support Clustering Command For Spark Sql
yihua commented on pull request #3420: URL: https://github.com/apache/hudi/pull/3420#issuecomment-1008243291 @pengzhiwei2018 Could you rebase the PR on latest master to resolve the conflicts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-3104] Kafka-connect support of hadoop config environments and properties (#4451)
This is an automated email from the ASF dual-hosted git repository. yihua pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 0d8ca8d [HUDI-3104] Kafka-connect support of hadoop config environments and properties (#4451) 0d8ca8d is described below commit 0d8ca8da4e0f6651bc1f06dba5e7e37881225fdc Author: Thinking Chen AuthorDate: Sun Jan 9 15:10:17 2022 +0800 [HUDI-3104] Kafka-connect support of hadoop config environments and properties (#4451) --- .../hudi/connect/utils/KafkaConnectUtils.java | 68 + .../hudi/connect/writers/KafkaConnectConfigs.java | 29 + .../apache/hudi/connect/TestHdfsConfiguration.java | 69 ++ .../src/test/resources/hadoop_conf/core-site.xml | 33 +++ .../src/test/resources/hadoop_conf/hdfs-site.xml | 30 ++ .../resources/hadoop_home/etc/hadoop/core-site.xml | 33 +++ .../resources/hadoop_home/etc/hadoop/hdfs-site.xml | 30 ++ 7 files changed, 292 insertions(+) diff --git a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java index cf60b9e..cc37de2 100644 --- a/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java +++ b/hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java @@ -49,9 +49,14 @@ import org.apache.log4j.Logger; import java.io.IOException; import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.FileVisitOption; +import java.nio.file.Path; +import java.nio.file.Paths; import java.security.MessageDigest; import java.security.NoSuchAlgorithmException; import java.util.Arrays; +import java.util.ArrayList; import java.util.List; import java.util.Map; import java.util.Objects; @@ -65,6 +70,52 @@ public class KafkaConnectUtils { private static final Logger LOG = LogManager.getLogger(KafkaConnectUtils.class); private static final String HOODIE_CONF_PREFIX = "hoodie."; + public static final String HADOOP_CONF_DIR = "HADOOP_CONF_DIR"; + public static final String HADOOP_HOME = "HADOOP_HOME"; + private static final List DEFAULT_HADOOP_CONF_FILES; + + static { +DEFAULT_HADOOP_CONF_FILES = new ArrayList<>(); +try { + String hadoopConfigPath = System.getenv(HADOOP_CONF_DIR); + String hadoopHomePath = System.getenv(HADOOP_HOME); + DEFAULT_HADOOP_CONF_FILES.addAll(getHadoopConfigFiles(hadoopConfigPath, hadoopHomePath)); + if (!DEFAULT_HADOOP_CONF_FILES.isEmpty()) { +LOG.info(String.format("Found Hadoop default config files %s", DEFAULT_HADOOP_CONF_FILES)); + } +} catch (IOException e) { + LOG.error("An error occurred while getting the default Hadoop configuration. " + + "Please use hadoop.conf.dir or hadoop.home to configure Hadoop environment variables", e); +} + } + + /** + * Get hadoop config files by HADOOP_CONF_DIR or HADOOP_HOME + */ + public static List getHadoopConfigFiles(String hadoopConfigPath, String hadoopHomePath) + throws IOException { +List hadoopConfigFiles = new ArrayList<>(); +if (!StringUtils.isNullOrEmpty(hadoopConfigPath)) { + hadoopConfigFiles.addAll(walkTreeForXml(Paths.get(hadoopConfigPath))); +} +if (hadoopConfigFiles.isEmpty() && !StringUtils.isNullOrEmpty(hadoopHomePath)) { + hadoopConfigFiles.addAll(walkTreeForXml(Paths.get(hadoopHomePath, "etc", "hadoop"))); +} +return hadoopConfigFiles; + } + + /** + * Files walk to find xml + */ + private static List walkTreeForXml(Path basePath) throws IOException { +if (Files.notExists(basePath)) { + return new ArrayList<>(); +} +return Files.walk(basePath, FileVisitOption.FOLLOW_LINKS) +.filter(path -> path.toFile().isFile()) +.filter(path -> path.toString().endsWith(".xml")) +.collect(Collectors.toList()); + } public static int getLatestNumPartitions(String bootstrapServers, String topicName) { Properties props = new Properties(); @@ -89,6 +140,23 @@ public class KafkaConnectUtils { */ public static Configuration getDefaultHadoopConf(KafkaConnectConfigs connectConfigs) { Configuration hadoopConf = new Configuration(); + +// add hadoop config files +if (!StringUtils.isNullOrEmpty(connectConfigs.getHadoopConfDir()) +|| !StringUtils.isNullOrEmpty(connectConfigs.getHadoopConfHome())) { + try { +List configFiles = getHadoopConfigFiles(connectConfigs.getHadoopConfDir(), +connectConfigs.getHadoopConfHome()); +configFiles.forEach(f -> +hadoopConf.addResource(new org.apache.hadoop.fs.Path(f.toAbsolutePath().toUri(; + } catch (Exception e) { +throw new HoodieException("
[GitHub] [hudi] yihua merged pull request #4451: [HUDI-3104] Kafka-connect support hadoop config environments and properties
yihua merged pull request #4451: URL: https://github.com/apache/hudi/pull/4451 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4458: [HUDI-3112] Fix KafkaConnect can not sync to Hive Problem
yihua commented on a change in pull request #4458: URL: https://github.com/apache/hudi/pull/4458#discussion_r780742854 ## File path: hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java ## @@ -185,20 +187,50 @@ private void syncMeta() { } private void syncHive() { -HiveSyncConfig hiveSyncConfig = DataSourceUtils.buildHiveSyncConfig( -new TypedProperties(connectConfigs.getProps()), -tableBasePath, -"PARQUET"); +HiveSyncConfig hiveSyncConfig = buildSyncConfig(new TypedProperties(connectConfigs.getProps()), tableBasePath); +String url; +if (!StringUtils.isNullOrEmpty(hiveSyncConfig.syncMode) && HiveSyncMode.of(hiveSyncConfig.syncMode) == HiveSyncMode.HMS) { + url = hadoopConf.get(KafkaConnectConfigs.HIVE_METASTORE_URIS); +} else { + url = hiveSyncConfig.jdbcUrl; +} + LOG.info("Syncing target hoodie table with hive table(" + hiveSyncConfig.tableName + "). Hive metastore URL :" -+ hiveSyncConfig.jdbcUrl ++ url + ", basePath :" + tableBasePath); -LOG.info("Hive Sync Conf => " + hiveSyncConfig.toString()); +LOG.info("Hive Sync Conf => " + hiveSyncConfig); FileSystem fs = FSUtils.getFs(tableBasePath, hadoopConf); HiveConf hiveConf = new HiveConf(); hiveConf.addResource(fs.getConf()); LOG.info("Hive Conf => " + hiveConf.getAllProperties().toString()); new HiveSyncTool(hiveSyncConfig, hiveConf, fs).syncHoodieTable(); } + + /** + * Build Hive Sync Config + */ + public HiveSyncConfig buildSyncConfig(TypedProperties props, String tableBasePath) { Review comment: Let's move this util method to `KafkaConnectUtils` class. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
dongkelun commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780742793 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -90,6 +90,11 @@ // It is here so that both the client and deltastreamer use the same reference public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty + .key("hoodie.database.name") + .defaultValue("default") Review comment: Would it be better if databaseName had a default value?like hive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4458: [HUDI-3112] Fix KafkaConnect can not sync to Hive Problem
yihua commented on a change in pull request #4458: URL: https://github.com/apache/hudi/pull/4458#discussion_r780742608 ## File path: hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java ## @@ -185,20 +187,50 @@ private void syncMeta() { } private void syncHive() { -HiveSyncConfig hiveSyncConfig = DataSourceUtils.buildHiveSyncConfig( -new TypedProperties(connectConfigs.getProps()), -tableBasePath, -"PARQUET"); +HiveSyncConfig hiveSyncConfig = buildSyncConfig(new TypedProperties(connectConfigs.getProps()), tableBasePath); +String url; +if (!StringUtils.isNullOrEmpty(hiveSyncConfig.syncMode) && HiveSyncMode.of(hiveSyncConfig.syncMode) == HiveSyncMode.HMS) { + url = hadoopConf.get(KafkaConnectConfigs.HIVE_METASTORE_URIS); +} else { + url = hiveSyncConfig.jdbcUrl; +} + LOG.info("Syncing target hoodie table with hive table(" + hiveSyncConfig.tableName + "). Hive metastore URL :" -+ hiveSyncConfig.jdbcUrl ++ url + ", basePath :" + tableBasePath); -LOG.info("Hive Sync Conf => " + hiveSyncConfig.toString()); +LOG.info("Hive Sync Conf => " + hiveSyncConfig); FileSystem fs = FSUtils.getFs(tableBasePath, hadoopConf); HiveConf hiveConf = new HiveConf(); hiveConf.addResource(fs.getConf()); LOG.info("Hive Conf => " + hiveConf.getAllProperties().toString()); new HiveSyncTool(hiveSyncConfig, hiveConf, fs).syncHoodieTable(); } + + /** + * Build Hive Sync Config + */ + public HiveSyncConfig buildSyncConfig(TypedProperties props, String tableBasePath) { Review comment: @cdmikechen Understood. I'm thinking about only moving util methods related Hive sync configs, not the Hive sync logic, to a separate Util class. The worry I have is that hive sync configs are spread into different places now and they may diverge if we forget to update all of them to be consistent. We can keep this PR as is for now. @cdmikechen could you create a Jira ticket to track the Hive sync config unification, which will be done in a different PR in future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yihua commented on a change in pull request #4458: [HUDI-3112] Fix KafkaConnect can not sync to Hive Problem
yihua commented on a change in pull request #4458: URL: https://github.com/apache/hudi/pull/4458#discussion_r780742608 ## File path: hudi-kafka-connect/src/main/java/org/apache/hudi/connect/writers/KafkaConnectTransactionServices.java ## @@ -185,20 +187,50 @@ private void syncMeta() { } private void syncHive() { -HiveSyncConfig hiveSyncConfig = DataSourceUtils.buildHiveSyncConfig( -new TypedProperties(connectConfigs.getProps()), -tableBasePath, -"PARQUET"); +HiveSyncConfig hiveSyncConfig = buildSyncConfig(new TypedProperties(connectConfigs.getProps()), tableBasePath); +String url; +if (!StringUtils.isNullOrEmpty(hiveSyncConfig.syncMode) && HiveSyncMode.of(hiveSyncConfig.syncMode) == HiveSyncMode.HMS) { + url = hadoopConf.get(KafkaConnectConfigs.HIVE_METASTORE_URIS); +} else { + url = hiveSyncConfig.jdbcUrl; +} + LOG.info("Syncing target hoodie table with hive table(" + hiveSyncConfig.tableName + "). Hive metastore URL :" -+ hiveSyncConfig.jdbcUrl ++ url + ", basePath :" + tableBasePath); -LOG.info("Hive Sync Conf => " + hiveSyncConfig.toString()); +LOG.info("Hive Sync Conf => " + hiveSyncConfig); FileSystem fs = FSUtils.getFs(tableBasePath, hadoopConf); HiveConf hiveConf = new HiveConf(); hiveConf.addResource(fs.getConf()); LOG.info("Hive Conf => " + hiveConf.getAllProperties().toString()); new HiveSyncTool(hiveSyncConfig, hiveConf, fs).syncHoodieTable(); } + + /** + * Build Hive Sync Config + */ + public HiveSyncConfig buildSyncConfig(TypedProperties props, String tableBasePath) { Review comment: @cdmikechen Understood. I'm thinking about only moving util methods related Hive sync configs, not the Hive sync logic, to a separate Util class. The worry I have is that hive sync configs are spread into different places and they may diverge if we forget to update all of them to be consistent. We can keep this PR as is for now. @cdmikechen could you create a Jira ticket to track the Hive sync config unification, which will be done in a different PR in future? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
dongkelun commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780742376 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java ## @@ -117,9 +124,10 @@ private void parseInputPaths(Path[] inputPaths, List incrementalTables) } } - private void tagAsIncrementalOrSnapshot(Path inputPath, String tableName, + private void tagAsIncrementalOrSnapshot(Path inputPath, String databaseName, String tableName, Review comment: Yes, but the outer layer also needs databaseName and tableName. I'm not sure which is better -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
dongkelun commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780741795 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -90,6 +90,11 @@ // It is here so that both the client and deltastreamer use the same reference public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty + .key("hoodie.database.name") + .defaultValue("default") Review comment: If there is no default value, is it must be set? like tableName -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
dongkelun commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780741275 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala ## @@ -33,6 +33,10 @@ import scala.collection.JavaConverters._ class TestCreateTable extends TestHoodieSqlBase { test("Test Create Managed Hoodie Table") { +val databaseName = "test_incremental" Review comment: ok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dongkelun commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
dongkelun commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780741202 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -90,6 +90,11 @@ // It is here so that both the client and deltastreamer use the same reference public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty Review comment: Yes, just to keep consistent with other parameters before. If not, don't need to change other parameters for the time being? Is it better to revise it uniformly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3197: - Reviewers: Raymond Xu, sivabalan narayanan (was: sivabalan narayanan) > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: sivabalan narayanan >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu reassigned HUDI-3197: Assignee: sivabalan narayanan (was: Raymond Xu) > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: sivabalan narayanan >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3195) optimize spark3 pom and modify build command
[ https://issues.apache.org/jira/browse/HUDI-3195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3195. Resolution: Fixed > optimize spark3 pom and modify build command > > > Key: HUDI-3195 > URL: https://issues.apache.org/jira/browse/HUDI-3195 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Yann Byron >Assignee: Yann Byron >Priority: Critical > Labels: pull-request-available > Fix For: 0.10.1 > > > when use `mvn clean install -P "scala-2.12,spark3,spark3.1.x"` to build spark > and write data by spark, error is as follows: > > > {code:java} > ERROR Executor: Exception in task 0.0 in stage 26.0 (TID 2005) > java.lang.RuntimeException: org.apache.hudi.exception.HoodieException: > org.apache.hudi.exception.HoodieException: > java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError: > org/apache/parquet/schema/LogicalTypeAnnotation$LogicalTypeAnnotationVisitor > at > org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121) > at > scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44) > at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) > at > org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:221) > {code} > > > That's due to the use of 1.12.1 parquet. > > Also, we use this build command in CI, which will mislead users. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3190) Validate and certify partition pruning for hudi tables w/ spark queries
[ https://issues.apache.org/jira/browse/HUDI-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3190: - Fix Version/s: (was: 0.10.1) > Validate and certify partition pruning for hudi tables w/ spark queries > --- > > Key: HUDI-3190 > URL: https://issues.apache.org/jira/browse/HUDI-3190 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: sivabalan narayanan >Priority: Major > > Validate and certify partition pruning for hudi tables w/ spark queries -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3190) Validate and certify partition pruning for hudi tables w/ spark queries
[ https://issues.apache.org/jira/browse/HUDI-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3190. Assignee: (was: Raymond Xu) Resolution: Duplicate > Validate and certify partition pruning for hudi tables w/ spark queries > --- > > Key: HUDI-3190 > URL: https://issues.apache.org/jira/browse/HUDI-3190 > Project: Apache Hudi > Issue Type: Task > Components: Spark Integration >Reporter: sivabalan narayanan >Priority: Major > Fix For: 0.10.1 > > > Validate and certify partition pruning for hudi tables w/ spark queries -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-3197. Reviewers: sivabalan narayanan Resolution: Done > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280 ] Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 6:11 AM: -- Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. {code:java} import org.apache.hudi.DataSourceReadOptions import org.apache.hudi.common.config.HoodieMetadataConfig val df = spark. read. format("org.apache.hudi"). option(HoodieMetadataConfig.ENABLE.key(), "true"). option(DataSourceReadOptions.QUERY_TYPE.key(), DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL). load("s3a://datalake-hudi/sessions_by_entrydate/") df.createOrReplaceTempView("sessions") spark.sql("SELECT count(*) FROM sessions").show() {code} Without wildcards. spark inferring the column type and query fails with {code:java} Caused by: java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:107) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getLong$(rows.scala:42) at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getLong(rows.scala:195) at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:98) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:230) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initBatch(VectorizedParquetRecordReader.java:249) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:331) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:169) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:132) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} was (Author: h7kanna): Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type and query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > --
[GitHub] [hudi] YannByron commented on pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on pull request #4083: URL: https://github.com/apache/hudi/pull/4083#issuecomment-1008236642 @dongkelun LGTM, just left some minor comments. @xushiyan Further review whether this strategy makes sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780737792 ## File path: hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHoodieHFileInputFormat.java ## @@ -235,6 +235,50 @@ public void testIncrementalSimple() throws IOException { FileStatus[] files = inputFormat.listStatus(jobConf); assertEquals(0, files.length, "We should exclude commit 100 when returning incremental pull with start commit time as 100"); + +InputFormatTestUtil.setupIncremental(jobConf, "100", 1, true); + +files = inputFormat.listStatus(jobConf); +assertEquals(10, files.length, +"When hoodie.incremental.use.database is true and the incremental database name is not set," Review comment: more indent here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780737251 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/InputPathHandler.java ## @@ -117,9 +124,10 @@ private void parseInputPaths(Path[] inputPaths, List incrementalTables) } } - private void tagAsIncrementalOrSnapshot(Path inputPath, String tableName, + private void tagAsIncrementalOrSnapshot(Path inputPath, String databaseName, String tableName, Review comment: can change the definition of this method to tagAsIncrementalOrSnapshot(Path inputPath, HoodieTableMetaClient metaClient, List incrementalTables)? Inside, get the database name and the table name. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780736806 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieHiveUtils.java ## @@ -175,4 +177,9 @@ private static HoodieTimeline filterIfInstantExists(String tableName, HoodieTime } return timeline.findInstantsBeforeOrEquals(maxCommit); } + + public static boolean isIncrementalUseDatabase(JobContext job) { +return job.getConfiguration().get(HOODIE_INCREMENTAL_USE_DATABASE, String.valueOf(DEFAULT_INCREMENTAL_USE_DATABASE)) Review comment: job.getConfiguration().getBoolean(HOODIE_INCREMENTAL_USE_DATABASE, false) ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780736465 ## File path: hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestUtils.java ## @@ -46,6 +46,7 @@ */ public class HoodieTestUtils { + public static final String INCREMENTAL_DATABASE_NAME = "test_incremental"; Review comment: can use a more common database name, like hoodie_database? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #4170: [SUPPORT] Understanding Clustering Behavior
nsivabalan edited a comment on issue #4170: URL: https://github.com/apache/hudi/issues/4170#issuecomment-1008232146 @rubenssoto : hey, any updates in this regard please. unless we get more logs, we can't do much here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4170: [SUPPORT] Understanding Clustering Behavior
nsivabalan commented on issue #4170: URL: https://github.com/apache/hudi/issues/4170#issuecomment-1008232146 @rubenssoto : hey, any updates in this regard please. unless we get more logs, we can't help much here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4200: spark-sql query timestamp partition error
nsivabalan commented on issue #4200: URL: https://github.com/apache/hudi/issues/4200#issuecomment-1008232053 thanks for confirming. appreciate it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #4200: spark-sql query timestamp partition error
nsivabalan closed issue #4200: URL: https://github.com/apache/hudi/issues/4200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780735886 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -90,6 +90,11 @@ // It is here so that both the client and deltastreamer use the same reference public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty + .key("hoodie.database.name") + .defaultValue("default") Review comment: Do not set the default value to keep compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4208: [SUPPORT] On Hudi 0.9.0 - Alter table throws java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.alterTable(java.lang.String,
nsivabalan commented on issue #4208: URL: https://github.com/apache/hudi/issues/4208#issuecomment-1008232003 @YannByron @xushiyan : can you folks please follow up on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4230: [SUPPORT] org.apache.hudi.exception.HoodieRemoteException: Failed to create marker file
nsivabalan commented on issue #4230: URL: https://github.com/apache/hudi/issues/4230#issuecomment-1008231940 @yihua : gentle ping to follow up on the issue. If there is some regression, we might want to fix in 0.10.1. would appreciate if you can follow up on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4439: [BUG] ROLLBACK meet Cannot use marker based rollback strategy on completed error
nsivabalan commented on issue #4439: URL: https://github.com/apache/hudi/issues/4439#issuecomment-1008231844 Hey @waywtdcc : let us know if you are looking for any more assistance. If not, feel free to close out the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780735803 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/catalyst/catalog/HoodieCatalogTable.scala ## @@ -164,9 +169,14 @@ class HoodieCatalogTable(val spark: SparkSession, val table: CatalogTable) exten val properties = new Properties() properties.putAll(tableConfigs.asJava) +val newDatabaseName = if (hoodieTableExists) databaseName else Review comment: newDatabaseName => hoodieDatabaseName -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4429: [SUPPORT] Spark SQL CTAS command doesn't work with 0.10.0 version and Spark 3.1.1
nsivabalan commented on issue #4429: URL: https://github.com/apache/hudi/issues/4429#issuecomment-1008231688 hey folks. if the issue is resolved, can we close out the github issue. thanks to Yann for quick turn around. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #4419: [SUPPORT] Not An Avro File (flink)
nsivabalan closed issue #4419: URL: https://github.com/apache/hudi/issues/4419 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4419: [SUPPORT] Not An Avro File (flink)
nsivabalan commented on issue #4419: URL: https://github.com/apache/hudi/issues/4419#issuecomment-1008231564 thanks for confirming. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4457: [SUPPORT] Hudi archive stopped working
nsivabalan commented on issue #4457: URL: https://github.com/apache/hudi/issues/4457#issuecomment-1008231502 @zuyanton : hey, do you have any updates for us. CC @prashantwason does something pop up for you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4461: [SUPPORT]Hudi(0.10.0) write to Aliyun oss using metadata table warning
nsivabalan commented on issue #4461: URL: https://github.com/apache/hudi/issues/4461#issuecomment-1008231420 @nikenfls : do you have any updates for us. if the issue is resolved, can we close out the github issue. thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780735473 ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala ## @@ -33,6 +33,10 @@ import scala.collection.JavaConverters._ class TestCreateTable extends TestHoodieSqlBase { test("Test Create Managed Hoodie Table") { +val databaseName = "test_incremental" Review comment: test_incremental => hudi_database? ## File path: hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestCreateTable.scala ## @@ -332,14 +344,19 @@ class TestCreateTable extends TestHoodieSqlBase { test("Test Create Table From Exist Hoodie Table") { withTempDir { tmp => + val databaseName = "test_incremental" Review comment: ditto -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4456: [SUPPORT] MultiWriter w/ DynamoDB - Unable to acquire lock, lock object null
nsivabalan commented on issue #4456: URL: https://github.com/apache/hudi/issues/4456#issuecomment-1008231311 @nochimow : a gentle reminder to respond to above question. above commentor is a Hoodie committer who added dynamoDB lock provider. So, he should be able to help in your case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4434: [SUPPORT]why are there many files under the Hoodie file?
nsivabalan commented on issue #4434: URL: https://github.com/apache/hudi/issues/4434#issuecomment-1008231118 @tieke1121 : hey are you looking for more info. let us know. if not, feel free to close out the github issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4477: [SUPPORT]using spark on TimestampBasedKeyGenerator has no result when query by partition column
nsivabalan commented on issue #4477: URL: https://github.com/apache/hudi/issues/4477#issuecomment-1008231035 @YannByron : may I know whats the tracking ticket. If not, can we create one for the issue reported in this github issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan closed issue #4474: [SUPPORT] Should we shade all aws dependencies to avoid class conflicts?
nsivabalan closed issue #4474: URL: https://github.com/apache/hudi/issues/4474 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4474: [SUPPORT] Should we shade all aws dependencies to avoid class conflicts?
nsivabalan commented on issue #4474: URL: https://github.com/apache/hudi/issues/4474#issuecomment-1008230937 Closing the github issue as we have a tracking jira. thank you folks for chiming in. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4541: [SUPPORT] NullPointerException while writing Bulk ingest table
nsivabalan commented on issue #4541: URL: https://github.com/apache/hudi/issues/4541#issuecomment-1008230765 let's try to remove some advanced configs, and test if we can make a simple job succeed and then we can add back more configs to deduce the issue. - I see you have added lot of custom configs for index. can we remove them for now. ``` 'hoodie.bloom.index.bucketized.checking': True, 'hoodie.bloom.index.keys.per.bucket': 5000, 'hoodie.index.bloom.num_entries': 100, 'hoodie.bloom.index.use.caching': True, 'hoodie.bloom.index.use.treebased.filter': True, 'hoodie.bloom.index.filter.type': 'DYNAMIC_V0', 'hoodie.bloom.index.filter.dynamic.max.entries': 100, 'hoodie.bloom.index.prune.by.ranges': True, ``` - 'write.parquet.block.size': 256 seems very low. Can we remove this for now. - I see the exception arises from clustering code. lets try to remove them for now. ``` 'hoodie.clustering.inline': True, 'hoodie.clustering.inline.max.commits': '1', 'hoodie.clustering.plan.strategy.small.file.limit': '1073741824', 'hoodie.clustering.plan.strategy.target.file.max.bytes': '2147483648', 'hoodie.clustering.execution.strategy.class': 'org.apache.hudi.client.clustering.run.strategy' '.SparkSortAndSizeExecutionStrategy', 'hoodie.clustering.plan.strategy.sort.columns': sort_cols, ``` Lets try to see if the job succeeds after making above modifications. and we can go from there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] YannByron commented on a change in pull request #4083: [HUDI-2837] The original hoodie.table.name should be maintained in Spark SQL
YannByron commented on a change in pull request #4083: URL: https://github.com/apache/hudi/pull/4083#discussion_r780735306 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -90,6 +90,11 @@ // It is here so that both the client and deltastreamer use the same reference public static final String DELTASTREAMER_CHECKPOINT_KEY = "deltastreamer.checkpoint.key"; + public static final ConfigProperty DATABASE_NAME = ConfigProperty Review comment: better to point to the definition of `HoodieTableConfig.DATABASE_HOME` directly, to avoid define repeatedly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #4539: [SUPPORT] spark 2.4.0 write data to hudi ERROR (0.10.0)
nsivabalan commented on issue #4539: URL: https://github.com/apache/hudi/issues/4539#issuecomment-1008230027 2.4.0 is not supported. Can you try with 2.4.3 or higher spark versions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471283#comment-17471283 ] Raymond Xu edited comment on HUDI-3197 at 1/9/22, 4:49 AM: --- {code:java} -- create a partitioned, preCombineField-provided cow table create table hudi_cow_pt_tbl ( id bigint, name string, ts bigint, dt string, hh string ) using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts' ) partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; -- insert sample data across partitions insert into hudi_cow_pt_tbl (id, name, ts, dt, hh) values (1, 'foo1', 1000, '20210701', '11'), (2, 'foo2', 1001, '20210701', '12'), (3, 'foo3', 1003, '20210701', '13'), (4, 'foo4', 1004, '20210701', '14'); -- create an external Hudi table create table hudi_cow_existing_tbl using hudi partitioned by (dt, hh) location 'file:///tmp/hudi/hudi_cow_pt_tbl'; -- query with partition pruning select * from hudi_cow_existing_tbl where dt = '20210701' and hh = '13'; {code} This validates spark sql created table works with partition pruning was (Author: xushiyan): {code:java} -- create a partitioned, preCombineField-provided cow table create table hudi_cow_pt_tbl ( id bigint, name string, ts bigint, dt string, hh string ) using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts' ) partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; -- insert sample data across partitions insert into hudi_cow_pt_tbl (id, name, ts, dt, hh) values (1, 'foo1', 1000, '20210701', '11'), (2, 'foo2', 1001, '20210701', '12'), (3, 'foo3', 1003, '20210701', '13'), (4, 'foo4', 1004, '20210701', '14'); -- create an external Hudi table create table hudi_cow_existing_tbl using hudi partitioned by (dt, hh) location 'file:///tmp/hudi/hudi_cow_pt_tbl'; -- query with partition pruning select * from hudi_cow_existing_tbl where dt = '20210701' and hh = '13'; {code} > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-44) Compaction must preserve commit timestamps of merged records #376
[ https://issues.apache.org/jira/browse/HUDI-44?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-44: Status: Resolved (was: Patch Available) > Compaction must preserve commit timestamps of merged records #376 > - > > Key: HUDI-44 > URL: https://issues.apache.org/jira/browse/HUDI-44 > Project: Apache Hudi > Issue Type: Bug > Components: Compaction >Reporter: Vinoth Chandar >Assignee: sivabalan narayanan >Priority: Critical > Labels: core-flow-ds, help-requested, pull-request-available, > sev:critical > Fix For: 0.10.1 > > > https://github.com/uber/hudi/issues/376 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471283#comment-17471283 ] Raymond Xu commented on HUDI-3197: -- {code:java} -- create a partitioned, preCombineField-provided cow table create table hudi_cow_pt_tbl ( id bigint, name string, ts bigint, dt string, hh string ) using hudi tblproperties ( type = 'cow', primaryKey = 'id', preCombineField = 'ts' ) partitioned by (dt, hh) location '/tmp/hudi/hudi_cow_pt_tbl'; -- insert sample data across partitions insert into hudi_cow_pt_tbl (id, name, ts, dt, hh) values (1, 'foo1', 1000, '20210701', '11'), (2, 'foo2', 1001, '20210701', '12'), (3, 'foo3', 1003, '20210701', '13'), (4, 'foo4', 1004, '20210701', '14'); -- create an external Hudi table create table hudi_cow_existing_tbl using hudi partitioned by (dt, hh) location 'file:///tmp/hudi/hudi_cow_pt_tbl'; -- query with partition pruning select * from hudi_cow_existing_tbl where dt = '20210701' and hh = '13'; {code} > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3197: - Attachment: (was: Screen Shot 2022-01-08 at 8.43.50 PM.png) > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (HUDI-2947) HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config from CLI in continuous mode
[ https://issues.apache.org/jira/browse/HUDI-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan resolved HUDI-2947. --- > HoodieDeltaStreamer/DeltaSync can improperly pick up the checkpoint config > from CLI in continuous mode > -- > > Key: HUDI-2947 > URL: https://issues.apache.org/jira/browse/HUDI-2947 > Project: Apache Hudi > Issue Type: Bug >Reporter: Ethan Guo >Assignee: sivabalan narayanan >Priority: Critical > Labels: pull-request-available, sev:critical > Fix For: 0.10.1 > > > *Problem:* > When deltastreamer is started with a given checkpoint, e.g., `--checkpoint > 0`, in the continuous mode, the deltastreamer job may pick up the wrong > checkpoint later on. The wrong checkpoint (for 20211206203551080 commit) > happens after the replacecommit and clean, which is reset to "0", instead of > "5" after 20211206202728233.commit. More details below. > > The bug is due to the check here: > [https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L335] > {code:java} > if (cfg.checkpoint != null && > (StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY)) > || > !cfg.checkpoint.equals(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY { > resumeCheckpointStr = Option.of(cfg.checkpoint); > } {code} > In this case of resuming after a clustering commit, "cfg.checkpoint != null" > and > "StringUtils.isNullOrEmpty(commitMetadata.getMetadata(CHECKPOINT_RESET_KEY))" > are both true as "--checkpoint 0" is configured and last commit is > replacecommit without checkpoint keys. This leads to the resume checkpoint > string being reset to the configured checkpoint, skipping the timeline > walk-back logic below, which is wrong. > > Timeline: > > {code:java} > 189069 Dec 6 12:19 20211206201238649.commit > 0 Dec 6 12:12 20211206201238649.commit.requested > 0 Dec 6 12:12 20211206201238649.inflight > 189069 Dec 6 12:27 20211206201959151.commit > 0 Dec 6 12:20 20211206201959151.commit.requested > 0 Dec 6 12:20 20211206201959151.inflight > 189069 Dec 6 12:34 20211206202728233.commit > 0 Dec 6 12:27 20211206202728233.commit.requested > 0 Dec 6 12:27 20211206202728233.inflight > 36662 Dec 6 12:35 20211206203449899.replacecommit > 0 Dec 6 12:35 20211206203449899.replacecommit.inflight > 34656 Dec 6 12:35 20211206203449899.replacecommit.requested > 28013 Dec 6 12:35 20211206203503574.clean > 19024 Dec 6 12:35 20211206203503574.clean.inflight > 19024 Dec 6 12:35 20211206203503574.clean.requested > 189069 Dec 6 12:43 20211206203551080.commit > 0 Dec 6 12:35 20211206203551080.commit.requested > 0 Dec 6 12:35 20211206203551080.inflight > 189069 Dec 6 12:50 20211206204311612.commit > 0 Dec 6 12:43 20211206204311612.commit.requested > 0 Dec 6 12:43 20211206204311612.inflight > 0 Dec 6 12:50 20211206205044595.commit.requested > 0 Dec 6 12:50 20211206205044595.inflight > 128 Dec 6 12:56 archived > 483 Dec 6 11:52 hoodie.properties > {code} > > Checkpoints in commits: > > {code:java} > grep "deltastreamer.checkpoint.key" * > 20211206201238649.commit: "deltastreamer.checkpoint.key" : "2" > 20211206201959151.commit: "deltastreamer.checkpoint.key" : "3" > 20211206202728233.commit: "deltastreamer.checkpoint.key" : "4" > 20211206203551080.commit: "deltastreamer.checkpoint.key" : "1" > 20211206204311612.commit: "deltastreamer.checkpoint.key" : "2" {code} > > *Steps to reproduce:* > Run HoodieDeltaStreamer in the continuous mode, by providing both > "--checkpoint 0" and "--continuous", with inline clustering and sync clean > enabled (some configs are masked). > > {code:java} > spark-submit \ > --master yarn \ > --driver-memory 8g --executor-memory 8g --num-executors 3 --executor-cores > 4 \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --conf > spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain > \ > --conf spark.speculation=true \ > --conf spark.speculation.multiplier=1.0 \ > --conf spark.speculation.quantile=0.5 \ > --packages org.apache.spark:spark-avro_2.12:3.2.0 \ > --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ > file:/home/hadoop/ethan/hudi-utilities-bundle_2.12-0.10.0-rc3.jar \ > --props file:/home/hadoop/ethan/test.properties \ > --source-class ... \ > --source-ordering-field ts \ > --target-base-path s3a://hudi-testing/test_hoodie_table_11/ \ > --target-table test_table \ > --table-type COPY_ON_WRITE \ > --op BULK_INSERT
[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu updated HUDI-3197: - Attachment: Screen Shot 2022-01-08 at 8.43.50 PM.png > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png, Screen Shot 2022-01-08 at 8.43.50 > PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471282#comment-17471282 ] sivabalan narayanan commented on HUDI-2909: --- [~codope] : can you help Harsha. > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
hudi-bot removed a comment on pull request #4518: URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222910 ## CI report: * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921) * 108b27f73f4656423be54bf4b20ba9dad8a26647 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5022) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
hudi-bot commented on pull request #4518: URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008228966 ## CI report: * 108b27f73f4656423be54bf4b20ba9dad8a26647 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5022) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index
danny0405 commented on pull request #4446: URL: https://github.com/apache/hudi/pull/4446#issuecomment-1008228935 Generally i think we should figure out a way for global index how to distinguish between `INSERT` and `UPDATE` for input records instead of hacking in the partitioner for write stats. That is too tricky for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #4446: [HUDI-2917] rollback insert data appended to log file when using Hbase Index
danny0405 commented on a change in pull request #4446: URL: https://github.com/apache/hudi/pull/4446#discussion_r780733934 ## File path: hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java ## @@ -90,27 +90,29 @@ public BaseJavaCommitActionExecutor(HoodieEngineContext context, public HoodieWriteMetadata> execute(List> inputRecords) { HoodieWriteMetadata> result = new HoodieWriteMetadata<>(); -WorkloadProfile profile = null; +WorkloadProfile inputProfile = null; if (isWorkloadProfileNeeded()) { - profile = new WorkloadProfile(buildProfile(inputRecords)); - LOG.info("Workload profile :" + profile); + inputProfile = new WorkloadProfile(buildProfile(inputRecords)); + LOG.info("Input workload profile :" + inputProfile); +} + +final Partitioner partitioner = getPartitioner(inputProfile); +try { + WorkloadProfile executionProfile = partitioner.getExecutionWorkloadProfile(); + LOG.info("Execution workload profile :" + inputProfile); + saveWorkloadProfileMetadataToInflight(executionProfile, instantTime); Review comment: Any why we must use the execution profile here ? I know the original profile also works only for bloomfilter index but we should fix the profile building instead of fetch it from the partitioner, if we have a way to distinguish between `INSERT`s and `UPDATE`s before write. ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java ## @@ -182,14 +182,28 @@ public abstract void preCompact( .withOperationField(config.allowOperationMetadataField()) .withPartition(operation.getPartitionPath()) .build(); -if (!scanner.iterator().hasNext()) { - scanner.close(); - return new ArrayList<>(); -} Option oldDataFileOpt = operation.getBaseFile(metaClient.getBasePath(), operation.getPartitionPath()); +// Considering following scenario: if all log blocks in this fileSlice is rollback, it returns an empty scanner. +// But in this case, we need to give it a base file. Otherwise, it will lose base file in following fileSlice. +if (!scanner.iterator().hasNext()) { + if (!oldDataFileOpt.isPresent()) { +scanner.close(); +return new ArrayList<>(); + } else { +// TODO: we may directly rename original parquet file if there is not evolution/devolution of schema Review comment: If the file slice only has parquet files, why we still trigger compaction ? ## File path: hudi-client/hudi-java-client/src/main/java/org/apache/hudi/table/action/commit/BaseJavaCommitActionExecutor.java ## @@ -90,27 +90,29 @@ public BaseJavaCommitActionExecutor(HoodieEngineContext context, public HoodieWriteMetadata> execute(List> inputRecords) { HoodieWriteMetadata> result = new HoodieWriteMetadata<>(); -WorkloadProfile profile = null; +WorkloadProfile inputProfile = null; if (isWorkloadProfileNeeded()) { - profile = new WorkloadProfile(buildProfile(inputRecords)); - LOG.info("Workload profile :" + profile); + inputProfile = new WorkloadProfile(buildProfile(inputRecords)); + LOG.info("Input workload profile :" + inputProfile); +} + +final Partitioner partitioner = getPartitioner(inputProfile); +try { + WorkloadProfile executionProfile = partitioner.getExecutionWorkloadProfile(); + LOG.info("Execution workload profile :" + inputProfile); + saveWorkloadProfileMetadataToInflight(executionProfile, instantTime); Review comment: Did you mean `executionProfile` ? ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/Partitioner.java ## @@ -18,10 +18,14 @@ package org.apache.hudi.table.action.commit; +import org.apache.hudi.table.WorkloadProfile; + import java.io.Serializable; public interface Partitioner extends Serializable { int getNumPartitions(); int getPartition(Object key); + + WorkloadProfile getExecutionWorkloadProfile(); } Review comment: Why a `Partitioner` returns the profile ? Let's not put the interface here. ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/BaseCommitActionExecutor.java ## @@ -97,15 +98,25 @@ void saveWorkloadProfileMetadataToInflight(WorkloadProfile profile, String insta insertStat.setFileId(""); insertStat.setPrevCommit(HoodieWriteStat.NULL_COMMIT); metadata.addWriteStat(path, insertStat); - -partitionStat.getUpdateLocationToCount().forEach((key, value) -> { - HoodieWriteStat writeStat = new HoodieWriteStat(); - writeStat.setFileId(key); - // TODO : Write baseCommitTime is possible here ? - writeStat.setPrevCommit(value.getKey()); - writeStat
[GitHub] [hudi] cdmikechen commented on a change in pull request #4451: [HUDI-3104] Kafka-connect support hadoop config environments and properties
cdmikechen commented on a change in pull request #4451: URL: https://github.com/apache/hudi/pull/4451#discussion_r780733280 ## File path: hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java ## @@ -89,6 +140,23 @@ public static int getLatestNumPartitions(String bootstrapServers, String topicNa */ public static Configuration getDefaultHadoopConf(KafkaConnectConfigs connectConfigs) { Configuration hadoopConf = new Configuration(); + +// add hadoop config files +if (!StringUtils.isNullOrEmpty(connectConfigs.getHadoopConfDir()) Review comment: @codope The default Hadoop configuration can solve the problem of a single environment, but we may also need to consider the need to manually configure `hadoop.conf.dir ` or ` hadoop.home` if different tasks need to write to different HDFS. So that I also separately added the parameters of hadoop environment config in `KafkaConnectConfigs`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] cdmikechen commented on a change in pull request #4451: [HUDI-3104] Kafka-connect support hadoop config environments and properties
cdmikechen commented on a change in pull request #4451: URL: https://github.com/apache/hudi/pull/4451#discussion_r780733010 ## File path: hudi-kafka-connect/src/main/java/org/apache/hudi/connect/utils/KafkaConnectUtils.java ## @@ -65,6 +70,52 @@ private static final Logger LOG = LogManager.getLogger(KafkaConnectUtils.class); private static final String HOODIE_CONF_PREFIX = "hoodie."; + public static final String HADOOP_CONF_DIR = "HADOOP_CONF_DIR"; + public static final String HADOOP_HOME = "HADOOP_HOME"; + private static final List DEFAULT_HADOOP_CONF_FILES; + + static { +DEFAULT_HADOOP_CONF_FILES = new ArrayList<>(); +try { + String hadoopConfigPath = System.getenv(HADOOP_CONF_DIR); + String hadoopHomePath = System.getenv(HADOOP_HOME); + DEFAULT_HADOOP_CONF_FILES.addAll(getHadoopConfigFiles(hadoopConfigPath, hadoopHomePath)); + if (!DEFAULT_HADOOP_CONF_FILES.isEmpty()) { +LOG.info(String.format("Found Hadoop default config files %s", DEFAULT_HADOOP_CONF_FILES)); + } Review comment: @codope My idea was: because the hadoop environment is set by default, users need to know that kafka-connect has obtained the correct information. Thus, the users can judge whether the environment that they set is wrong, or manually declare the hadoop configuration path when registering the task. Because the default log level is info, it may be easier for users to output information with info. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280 ] Harsha Teja Kanna edited comment on HUDI-3066 at 1/9/22, 4:15 AM: -- Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type and query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long was (Author: h7kanna): Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type is query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877
[jira] [Commented] (HUDI-3066) Very slow file listing after enabling metadata for existing tables in 0.10.0 release
[ https://issues.apache.org/jira/browse/HUDI-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471280#comment-17471280 ] Harsha Teja Kanna commented on HUDI-3066: - Hi [~shivnarayan] Basic question I am trying to use just the base-path without the wildcards. But facing this issue. Table effectively has two columns with same name. table is created using a timestamp column for key generation and mapped to date partition. using hoodie.datasource.write.partitionpath.field=entrydate:timestamp So the partition is entrydate=/mm/dd. Without wildcards. spark inferring the column type is query fails with java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Long > Very slow file listing after enabling metadata for existing tables in 0.10.0 > release > > > Key: HUDI-3066 > URL: https://issues.apache.org/jira/browse/HUDI-3066 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 0.10.0 > Environment: EMR 6.4.0 > Hudi version : 0.10.0 >Reporter: Harsha Teja Kanna >Assignee: sivabalan narayanan >Priority: Critical > Labels: performance, pull-request-available > Fix For: 0.11.0 > > Attachments: Screen Shot 2021-12-18 at 6.16.29 PM.png, Screen Shot > 2021-12-20 at 10.05.50 PM.png, Screen Shot 2021-12-20 at 10.17.44 PM.png, > Screen Shot 2021-12-21 at 10.22.54 PM.png, Screen Shot 2021-12-21 at 10.24.12 > PM.png, metadata_files.txt, metadata_files_compacted.txt, > metadata_timeline.txt, metadata_timeline_archived.txt, > metadata_timeline_compacted.txt, stderr_part1.txt, stderr_part2.txt, > timeline.txt, writer_log.txt > > > After 'metadata table' is enabled, File listing takes long time. > If metadata is enabled on Reader side(as shown below), it is taking even more > time per file listing task > {code:java} > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.common.config.HoodieMetadataConfig > val hadoopConf = spark.conf > hadoopConf.set(HoodieMetadataConfig.ENABLE.key(), "true") > val basePath = "s3a://datalake-hudi" > val sessions = spark > .read > .format("org.apache.hudi") > .option(DataSourceReadOptions.QUERY_TYPE.key(), > DataSourceReadOptions.QUERY_TYPE_SNAPSHOT_OPT_VAL) > .option(DataSourceReadOptions.READ_PATHS.key(), > s"${basePath}/sessions_by_entrydate/entrydate=2021/*/*/*") > .load() > sessions.createOrReplaceTempView("sessions") {code} > Existing tables (COW) have inline clustering on and have many replace commits. > Logs seem to suggest the delay is in view.AbstractTableFileSystemView > resetFileGroupsReplaced function or metadata.HoodieBackedTableMetadata > Also many log messages in AbstractHoodieLogRecordReader > > 2021-12-18 23:17:54,056 INFO view.AbstractTableFileSystemView: Took 4118 ms > to read 136 instants, 9731 replaced file groups > 2021-12-18 23:37:46,086 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.76_0-20-515 > at instant 20211217035105329 > 2021-12-18 23:37:46,090 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,094 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,095 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.20_0-35-613', > fileLen=0} > 2021-12-18 23:37:46,095 INFO s3a.S3AInputStream: Switching to Random IO seek > policy > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Reading a > data block from file > s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.62_0-34-377 > at instant 20211217022049877 > 2021-12-18 23:37:46,096 INFO log.AbstractHoodieLogRecordReader: Number of > remaining logblocks to merge 1 > 2021-12-18 23:37:46,105 INFO log.HoodieLogFormatReader: Moving to the next > reader for logfile > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.86_0-20-362', > fileLen=0} > 2021-12-18 23:37:46,109 INFO log.AbstractHoodieLogRecordReader: Scanning log > file > HoodieLogFile\{pathStr='s3a://datalake-hudi/sessions/.hoodie/metadata/files/.files-_20211216144130775001.log.121_0-57-663', > fileLen=0} > 2021-12-18 23:37:46,109 INFO s3a.S3AInputStream
[GitHub] [hudi] hudi-bot commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
hudi-bot commented on pull request #4542: URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008223765 ## CI report: * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5021) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
hudi-bot removed a comment on pull request #4542: URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217433 ## CI report: * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5021) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
hudi-bot removed a comment on pull request #4518: URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222649 ## CI report: * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921) * 108b27f73f4656423be54bf4b20ba9dad8a26647 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
hudi-bot commented on pull request #4518: URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222910 ## CI report: * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921) * 108b27f73f4656423be54bf4b20ba9dad8a26647 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5022) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
hudi-bot removed a comment on pull request #4518: URL: https://github.com/apache/hudi/pull/4518#issuecomment-1006204405 ## CI report: * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4518: [HUDI-2477] Removing rollbacks instants from timeline for restore operation
hudi-bot commented on pull request #4518: URL: https://github.com/apache/hudi/pull/4518#issuecomment-1008222649 ## CI report: * 2b82de3c867cddb3af7f2edd7f48f662defda372 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4921) * 108b27f73f4656423be54bf4b20ba9dad8a26647 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
hudi-bot removed a comment on pull request #4542: URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217107 ## CI report: * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
hudi-bot commented on pull request #4542: URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217433 ## CI report: * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5021) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
hudi-bot commented on pull request #4542: URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217107 ## CI report: * 5c4d7cfecd25cea19567f897fa0dac3f5f784baf UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] boneanxs commented on pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
boneanxs commented on pull request #4542: URL: https://github.com/apache/hudi/pull/4542#issuecomment-1008217012 @xushiyan @nsivabalan pls take a took. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3157) Remove aws jars from hudi bundles
[ https://issues.apache.org/jira/browse/HUDI-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-3157: - Labels: pull-request-available sev:critical user-support-issues (was: sev:critical user-support-issues) > Remove aws jars from hudi bundles > - > > Key: HUDI-3157 > URL: https://issues.apache.org/jira/browse/HUDI-3157 > Project: Apache Hudi > Issue Type: Bug >Reporter: Raymond Xu >Assignee: Hui An >Priority: Critical > Labels: pull-request-available, sev:critical, user-support-issues > Fix For: 0.10.1 > > > ref: > [https://github.com/apache/hudi/issues/4474] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] boneanxs opened a new pull request #4542: [HUDI-3157] Remove aws jars from hudi bundles
boneanxs opened a new pull request #4542: URL: https://github.com/apache/hudi/pull/4542 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.* ## What is the purpose of the pull request Remove aws jars from hudi bundles ref: https://github.com/apache/hudi/issues/4474 ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-3157) Remove aws jars from hudi bundles
[ https://issues.apache.org/jira/browse/HUDI-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471269#comment-17471269 ] Hui An commented on HUDI-3157: -- Working on this now. > Remove aws jars from hudi bundles > - > > Key: HUDI-3157 > URL: https://issues.apache.org/jira/browse/HUDI-3157 > Project: Apache Hudi > Issue Type: Bug >Reporter: Raymond Xu >Assignee: Hui An >Priority: Critical > Labels: sev:critical, user-support-issues > Fix For: 0.10.1 > > > ref: > [https://github.com/apache/hudi/issues/4474] > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql
hudi-bot removed a comment on pull request #4489: URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008208353 ## CI report: * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013) * 8896e81ac168348d66de6c8cf444c4a7e2c9826e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5020) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql
hudi-bot commented on pull request #4489: URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008215773 ## CI report: * 8896e81ac168348d66de6c8cf444c4a7e2c9826e Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5020) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264 ] Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM: -- I am not able to determine if I fall under user type c or a/b :) from the Github issue or the above description. I can you please help understand if I have to recreate the dataset? was (Author: h7kanna): I am not able to determine if I fall under user type c or a/b :) > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264 ] Harsha Teja Kanna edited comment on HUDI-2909 at 1/9/22, 2:05 AM: -- I am not able to determine if I fall under user type c or a/b :) from the Github issue or the above description. Can you please help understand if I have to recreate the dataset? was (Author: h7kanna): I am not able to determine if I fall under user type c or a/b :) from the Github issue or the above description. I can you please help understand if I have to recreate the dataset? > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-2909) Partition field parsing fails due to KeyGenerator giving inconsistent value for logical timestamp type
[ https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471264#comment-17471264 ] Harsha Teja Kanna commented on HUDI-2909: - I am not able to determine if I fall under user type c or a/b :) > Partition field parsing fails due to KeyGenerator giving inconsistent value > for logical timestamp type > -- > > Key: HUDI-2909 > URL: https://issues.apache.org/jira/browse/HUDI-2909 > Project: Apache Hudi > Issue Type: Bug > Components: DeltaStreamer >Reporter: Harsha Teja Kanna >Assignee: Sagar Sumit >Priority: Blocker > Labels: core-flow-ds, pull-request-available, sev:critical > Fix For: 0.10.1 > > > Existing table has timebased keygen config show below > hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR > hoodie.deltastreamer.keygen.timebased.output.timezone=GMT > hoodie.deltastreamer.keygen.timebased.output.dateformat=/MM/dd > hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS > hoodie.deltastreamer.keygen.timebased.input.timezone=GMT > hoodie.datasource.write.partitionpath.field=lastdate:timestamp > hoodie.datasource.write.operation=upsert > hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, > session.mid, to_timestamp(session.lastdate) as lastdate, > to_timestamp(session.updatedate) as updatedate FROM a > > Upgrading to 0.10.0 from 0.9.0 fails with exception > org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input > partition field :2021-12-01 10:13:34.702 > Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected > type for partition field: java.sql.Timestamp > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211) > at > org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133) > *Workaround fix:* > Reverting this > https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543 > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [hudi] aznwarmonkey opened a new issue #4541: [SUPPORT] NullPointerException while writing Bulk ingest table
aznwarmonkey opened a new issue #4541: URL: https://github.com/apache/hudi/issues/4541 Hello, I am currently getting an exception while writing a `hudi` talbe in `bulk_ingest` mode. Please see below for the stacktrace along with the snippet of code I a using to write the data. I am new to `hudi` and this stacktrace doesn't provide much insight as to why it is happening. Any help with this issue is greatly appreciative ```bash py4j.protocol.Py4JJavaError: An error occurred while calling o195.save. : org.apache.spark.SparkException: Writing job failed. at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:383) at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:336) at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.writeWithV2(WriteToDataSourceV2Exec.scala:218) at org.apache.spark.sql.execution.datasources.v2.AppendDataExec.run(WriteToDataSourceV2Exec.scala:225) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:40) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:40) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.doExecute(V2CommandExec.scala:55) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.executeQuery$1(SQLExecution.scala:110) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:135) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:107) at org.apache.spark.sql.execution.SQLExecution$.withTracker(SQLExecution.scala:232) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:135) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:253) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:134) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:68) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:370) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301) at org.apache.hudi.HoodieSparkSqlWriter$.bulkInsertAsRow(HoodieSparkSqlWriter.scala:302) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:127) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:134) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:134) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:133) at
[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql
hudi-bot removed a comment on pull request #4489: URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008207981 ## CI report: * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013) * 8896e81ac168348d66de6c8cf444c4a7e2c9826e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql
hudi-bot commented on pull request #4489: URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008208353 ## CI report: * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013) * 8896e81ac168348d66de6c8cf444c4a7e2c9826e Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5020) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql
hudi-bot removed a comment on pull request #4489: URL: https://github.com/apache/hudi/pull/4489#issuecomment-1007952877 ## CI report: * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4489: [HUDI-3135] Fix Delete partitions with metadata table and fix show partitions in spark sql
hudi-bot commented on pull request #4489: URL: https://github.com/apache/hudi/pull/4489#issuecomment-1008207981 ## CI report: * fa5894fba7cf168250fe52b70d8131ce0877f285 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5013) * 8896e81ac168348d66de6c8cf444c4a7e2c9826e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot removed a comment on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation
hudi-bot removed a comment on pull request #4514: URL: https://github.com/apache/hudi/pull/4514#issuecomment-1008177446 ## CI report: * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN * bfca6fe7dccb87d9f823173fa965193b1e3c0b79 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=4980) * b7b4aca36444784913b61c62bf9e24a99e8ffbd8 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5019) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot commented on pull request #4514: [HUDI-3172] Refactor hudi existing modules to make more code reuse in V2 Implementation
hudi-bot commented on pull request #4514: URL: https://github.com/apache/hudi/pull/4514#issuecomment-1008194396 ## CI report: * ddc3af0c32bafef6b10c32c43132df32a5f7d83c UNKNOWN * e1ba726105dfa7ae07d802546c71a0cf1ad8b172 UNKNOWN * 306e7d462959e0249e230f60c2e9ea6602342e08 UNKNOWN * 15122772d9430d91807053555e12afaeda30e688 UNKNOWN * 0a64e4175cbc20c63ebc5723389ed98ac55c9c0c UNKNOWN * b7b4aca36444784913b61c62bf9e24a99e8ffbd8 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=5019) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-3197) Validate partition pruning with Hudi
[ https://issues.apache.org/jira/browse/HUDI-3197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sivabalan narayanan updated HUDI-3197: -- Summary: Validate partition pruning with Hudi (was: Validate partition pruning with Spark SQL) > Validate partition pruning with Hudi > > > Key: HUDI-3197 > URL: https://issues.apache.org/jira/browse/HUDI-3197 > Project: Apache Hudi > Issue Type: Task >Reporter: Raymond Xu >Assignee: Raymond Xu >Priority: Major > Attachments: Screen Shot 2022-01-08 at 3.22.54 PM.png, Screen Shot > 2022-01-08 at 3.23.04 PM.png, Screen Shot 2022-01-08 at 3.26.13 PM.png, > Screen Shot 2022-01-08 at 3.26.53 PM.png > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (HUDI-2682) Spark schema not updated with new columns on hive sync
[ https://issues.apache.org/jira/browse/HUDI-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Raymond Xu closed HUDI-2682. Reviewers: Tao Meng Resolution: Fixed > Spark schema not updated with new columns on hive sync > -- > > Key: HUDI-2682 > URL: https://issues.apache.org/jira/browse/HUDI-2682 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration, Spark Integration >Affects Versions: 0.9.0 >Reporter: Charlie Briggs >Assignee: 董可伦 >Priority: Blocker > Labels: pull-request-available > Fix For: 0.10.1 > > > When syncing hive schema, new columns added from the source dataset are not > propagated to the `spark.sql.sources.schema` metadata on the hive table. This > leads to columns not being available when querying the dataset via spark SQL. > Tested with both spark data writer and deltastreamer). > The column we observed this on was a struct column, but it seems like it > would be independent of datatype. -- This message was sent by Atlassian Jira (v8.20.1#820001)