[jira] [Created] (HUDI-4382) Add logger for HoodieCopyOnWriteTableInputFormat
Wenning Ding created HUDI-4382: -- Summary: Add logger for HoodieCopyOnWriteTableInputFormat Key: HUDI-4382 URL: https://issues.apache.org/jira/browse/HUDI-4382 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding When querying the ro and rt bootstrap mor tables using presto I observed both are failed with the following excecption: {{java.lang.NoSuchFieldError: LOG at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.makeExternalFileSplit(HoodieCopyOnWriteTableInputFormat.java:199) at org.apache.hudi.hadoop.HoodieCopyOnWriteTableInputFormat.makeSplit(HoodieCopyOnWriteTableInputFormat.java:100) at org.apache.hudi.hadoop.realtime.HoodieMergeOnReadTableInputFormat.doMakeSplitForRealtimePath(HoodieMergeOnReadTableInputFormat.java:266) at org.apache.hudi.hadoop.realtime.HoodieMergeOnReadTableInputFormat.makeSplit(HoodieMergeOnReadTableInputFormat.java:211) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:345) at org.apache.hudi.hadoop.realtime.HoodieMergeOnReadTableInputFormat.getSplits(HoodieMergeOnReadTableInputFormat.java:79) at org.apache.hudi.hadoop.HoodieParquetInputFormatBase.getSplits(HoodieParquetInputFormatBase.java:68) at com.facebook.presto.hive.StoragePartitionLoader.loadPartition(StoragePartitionLoader.java:278) at com.facebook.presto.hive.DelegatingPartitionLoader.loadPartition(DelegatingPartitionLoader.java:81) at com.facebook.presto.hive.BackgroundHiveSplitLoader.loadSplits(BackgroundHiveSplitLoader.java:224) at com.facebook.presto.hive.BackgroundHiveSplitLoader.access$700(BackgroundHiveSplitLoader.java:50) at com.facebook.presto.hive.BackgroundHiveSplitLoader$HiveSplitLoaderTask.process(BackgroundHiveSplitLoader.java:153) at com.facebook.presto.hive.util.ResumableTasks.safeProcessTask(ResumableTasks.java:47) at com.facebook.presto.hive.util.ResumableTasks.access$000(ResumableTasks.java:20) at com.facebook.presto.hive.util.ResumableTasks$1.run(ResumableTasks.java:35) at com.facebook.airlift.concurrent.BoundedExecutor.drainQueue(BoundedExecutor.java:78) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)}} The reason we saw {{java.lang.NoSuchFieldError: LOG}} during the presto query is because in this {{HoodieCopyOnWriteTableInputFormat}} class, it inherits field {{LOG}} from its parent class {{FileInputFormat}} which is a class from Hadoop. So in the compile time, it would reference this field from {{{}FileInputFormat.class{}}}. However, in the runtime, the presto doesn't have all the Hadoop classes in its classpath, what Presto uses is its own Hadoop dependency e.g. {{{}hadoop-apache2:jar{}}}. I checked that {{hadoop-apache2}} does not have class {{FileInputFormat}} shaded which causes this runtime error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-4331) Allow loading external config file from class loader
Title: Message Title Wenning Ding created an issue Apache Hudi / HUDI-4331 Allow loading external config file from class loader Issue Type: Improvement Assignee: Unassigned Created: 27/Jun/22 22:57 Priority: Major Reporter: Wenning Ding Add Comment This message was sent by Atlassian Jira (v8.20.10#820010-sha1:ace47f9)
[jira] [Updated] (HUDI-4278) Add skip archive option when syncing to AWS Glue tables
[ https://issues.apache.org/jira/browse/HUDI-4278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-4278: --- Description: The issue is each time when Hudi upserts records, it would sync to the catalog and update {{last_commit_time_sync}} for the Glue table. Each time it updates this property, Glue by default would create a new table version and archive old versions. So the problem is if customers update the Hudi table frequently, eventually they would hit the Glue table version limit. So here inside Hudi, we pass a parameter {{skipGlueArchive}} to the environment context to finally pass it to {{{}AWS Glue metadata service{}}}, so Glue client has an option to decide whether to skip archive or not. > Add skip archive option when syncing to AWS Glue tables > --- > > Key: HUDI-4278 > URL: https://issues.apache.org/jira/browse/HUDI-4278 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > The issue is each time when Hudi upserts records, it would sync to the > catalog and update {{last_commit_time_sync}} for the Glue table. Each time it > updates this property, Glue by default would create a new table version and > archive old versions. So the problem is if customers update the Hudi table > frequently, eventually they would hit the Glue table version limit. > So here inside Hudi, we pass a parameter {{skipGlueArchive}} to the > environment context to finally pass it to {{{}AWS Glue metadata service{}}}, > so Glue client has an option to decide whether to skip archive or not. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-4278) Add skip archive option when syncing to AWS Glue tables
Wenning Ding created HUDI-4278: -- Summary: Add skip archive option when syncing to AWS Glue tables Key: HUDI-4278 URL: https://issues.apache.org/jira/browse/HUDI-4278 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (HUDI-3764) Allow loading external configs while querying Hudi tables with Spark
Wenning Ding created HUDI-3764: -- Summary: Allow loading external configs while querying Hudi tables with Spark Key: HUDI-3764 URL: https://issues.apache.org/jira/browse/HUDI-3764 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3450) Avoid passing empty string spark master to hudi cli
Wenning Ding created HUDI-3450: -- Summary: Avoid passing empty string spark master to hudi cli Key: HUDI-3450 URL: https://issues.apache.org/jira/browse/HUDI-3450 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding When using Hudi CLI, when not passing SparkMaster, by default Hudi CLI should use [SparkUtil.DEFAULT_SPARK_MASTER|https://github.com/apache/hudi/blob/release-0.10.0/hudi-cli/src/main/java/org/apache/hudi/cli/utils/SparkUtil.java#L44]. However, w/ a recent [code change|https://github.com/apache/hudi/commit/445208a0d20b457daeeb5f70995302c92dd19f31] in OSS, when SparkMaster is not passed, it would set Spark master to {{""}} which causes the following exception when initializing a Hudi CLI job: {{org.apache.spark.SparkException: Could not parse Master URL: ''at org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2999) at org.apache.spark.SparkContext.(SparkContext.scala:567) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:115) at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:110) at org.apache.hudi.cli.commands.SparkMain.main(SparkMain.java:88)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-3395) Allow pass rollbackUsingMarkers to Hudi CLI rollback command
[ https://issues.apache.org/jira/browse/HUDI-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-3395: --- Description: In a recent OSS [change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111], {{hoodie.rollback.using.markers}} is enabled by default. However, if you are trying rolling back a completed instance, Hudi would throw this exception: {{Caused by: java.lang.IllegalArgumentException: Cannot use marker based rollback strategy on completed instant:[20211228041514616__commit__COMPLETED]at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}} The problem in our integration test {{{}testHudiCliRollback{}}}, it tries rolling back a complete instance. But right now, through Hudi CLI, the {{commit rollback}} command does not provide customers anyway to disable {{hoodie.rollback.using.markers}} which means customers can never use {{commit rollback}} command to rollback a completed commit. was: In a recent OSS [change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111], {{hoodie.rollback.using.markers}} is enabled by default. However, if you are trying rolling back a completed instance, Hudi would throw this exception: {{Caused by: java.lang.IllegalArgumentException: Cannot use marker based rollback strategy on completed instant:[20211228041514616__commit__COMPLETED]at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}} The problem in our integration test {{{}testHudiCliRollback{}}}, it tries rolling back a complete instance. But right now, through Hudi CLI, the {{commit rollback}} command does not provide customers anyway to disable {{hoodie.rollback.using.markers}} which means customers can never use {{commit rollback}} command to rollback a completed commit. So in this CR, I added a parameter called {{usingMarkers}} so that customers can disable this feature, > Allow pass rollbackUsingMarkers to Hudi CLI rollback command > > > Key: HUDI-3395 > URL: https://issues.apache.org/jira/browse/HUDI-3395 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > Labels: pull-request-available > > In a recent OSS > [change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111], > {{hoodie.rollback.using.markers}} is enabled by default. However, if you are > trying rolling back a completed instance, Hudi would throw this exception: > > {{Caused by: java.lang.IllegalArgumentException: Cannot use marker based > rollback strategy on completed > instant:[20211228041514616__commit__COMPLETED]at > org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}} > The problem in our integration test {{{}testHudiCliRollback{}}}, it tries > rolling back a complete instance. But right now, through Hudi CLI, the > {{commit rollback}} command does not provide customers anyway to disable > {{hoodie.rollback.using.markers}} which means customers can never use > {{commit rollback}} command to rollback a completed commit. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3395) Allow pass rollbackUsingMarkers to Hudi CLI rollback command
Wenning Ding created HUDI-3395: -- Summary: Allow pass rollbackUsingMarkers to Hudi CLI rollback command Key: HUDI-3395 URL: https://issues.apache.org/jira/browse/HUDI-3395 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding In a recent OSS [change|https://github.com/apache/hudi/commit/ce7d2333078e4e1f16de1bce6d448c5eef1e4111], {{hoodie.rollback.using.markers}} is enabled by default. However, if you are trying rolling back a completed instance, Hudi would throw this exception: {{Caused by: java.lang.IllegalArgumentException: Cannot use marker based rollback strategy on completed instant:[20211228041514616__commit__COMPLETED]at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)}} The problem in our integration test {{{}testHudiCliRollback{}}}, it tries rolling back a complete instance. But right now, through Hudi CLI, the {{commit rollback}} command does not provide customers anyway to disable {{hoodie.rollback.using.markers}} which means customers can never use {{commit rollback}} command to rollback a completed commit. So in this CR, I added a parameter called {{usingMarkers}} so that customers can disable this feature, -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466279#comment-17466279 ] Wenning Ding commented on HUDI-3122: Thanks I will give a shot > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466275#comment-17466275 ] Wenning Ding commented on HUDI-3122: I see your point, but for [https://github.com/prestodb/presto/blob/release-0.261/presto-hive/pom.xml], it's just a dependency right? So in the presto classpath, it still does not have the hbase related jars. We need the hudi-presto-bundle.jar because in the runtime, say when you querying a Hudi bootstrap table, it requires hbase/io/hfile/CacheConfig which is a class in hbase-server, but by default, presto does not have any hbase jar in it's classpath. > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466269#comment-17466269 ] Wenning Ding edited comment on HUDI-3122 at 12/29/21, 12:30 AM: I checked the hudi-presto-bundle.jar, the hbase server related class is not there. That's the reason why I saw the classNotFound exception. was (Author: wenningd): I checked the hudi-presto-bundle.jar, the hbase server related class is not there. > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466269#comment-17466269 ] Wenning Ding commented on HUDI-3122: I checked the hudi-presto-bundle.jar, the hbase server related class is not there. > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466266#comment-17466266 ] Wenning Ding commented on HUDI-3122: Presto 0.261 > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-3122) presto query failed for bootstrap tables
[ https://issues.apache.org/jira/browse/HUDI-3122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466261#comment-17466261 ] Wenning Ding commented on HUDI-3122: [~zhangyue19921010] Is removing hbase-server from hudi-presto-bundle.jar intended? > presto query failed for bootstrap tables > > > Key: HUDI-3122 > URL: https://issues.apache.org/jira/browse/HUDI-3122 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Priority: Major > > > {{java.lang.NoClassDefFoundError: > org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) > at > org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) > at java.util.HashMap.forEach(HashMap.java:1290) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) > at > java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) > at > org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-3122) presto query failed for bootstrap tables
Wenning Ding created HUDI-3122: -- Summary: presto query failed for bootstrap tables Key: HUDI-3122 URL: https://issues.apache.org/jira/browse/HUDI-3122 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding {{java.lang.NoClassDefFoundError: org/apache/hudi/org/apache/hadoop/hbase/io/hfile/CacheConfig at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:181) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.access$400(HFileBootstrapIndex.java:76) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.partitionIndexReader(HFileBootstrapIndex.java:272) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.fetchBootstrapIndexInfo(HFileBootstrapIndex.java:262) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.initIndexInfo(HFileBootstrapIndex.java:252) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex$HFileBootstrapIndexReader.(HFileBootstrapIndex.java:243) at org.apache.hudi.common.bootstrap.index.HFileBootstrapIndex.createReader(HFileBootstrapIndex.java:191) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$addFilesToView$2(AbstractTableFileSystemView.java:137) at java.util.HashMap.forEach(HashMap.java:1290) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.addFilesToView(AbstractTableFileSystemView.java:134) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.lambda$ensurePartitionLoadedCorrectly$9(AbstractTableFileSystemView.java:294) at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660) at org.apache.hudi.common.table.view.AbstractTableFileSystemView.ensurePartitionLoadedCorrectly(AbstractTableFileSystemView.java:281)}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2946) Upgrade maven plugin to make Hudi be compatible with higher Java versions
Wenning Ding created HUDI-2946: -- Summary: Upgrade maven plugin to make Hudi be compatible with higher Java versions Key: HUDI-2946 URL: https://issues.apache.org/jira/browse/HUDI-2946 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding I saw several issues while building Hudi w/ Java 11: {{[ERROR] Failed to execute goal org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar (default) on project hudi-common: Execution default of goal org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar failed: An API incompatibility was encountered while executing org.apache.maven.plugins:maven-jar-plugin:2.6:test-jar: java.lang.ExceptionInInitializerError: null[ERROR] Failed to execute goal org.apache.maven.plugins:maven-shade-plugin:3.1.1:shade (default) on project hudi-hadoop-mr-bundle: Error creating shaded jar: Problem shading JAR /workspace/workspace/rchertar.bigtop.hudi-rpm-mainline-6.x-0.9.0/build/hudi/rpm/BUILD/hudi-0.9.0-amzn-1-SNAPSHOT/packaging/hudi-hadoop-mr-bundle/target/hudi-hadoop-mr-bundle-0.9.0-amzn-1-SNAPSHOT.jar entry org/apache/hudi/hadoop/bundle/Main.class: java.lang.IllegalArgumentException -> [Help 1]}} We need to upgrade maven plugin versions to make it be compatible with Java 11. Also upgrade dockerfile-maven-plugin to latest versions to support Java 11 [https://github.com/spotify/dockerfile-maven/pull/230] -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2884) Allow loading external configs while querying Hudi tables with Spark
Wenning Ding created HUDI-2884: -- Summary: Allow loading external configs while querying Hudi tables with Spark Key: HUDI-2884 URL: https://issues.apache.org/jira/browse/HUDI-2884 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding Currently when doing Hudi queries w/ Spark, it won't load the external configurations. Say if customers enabled metadata listing in their global config file, then this would let them actually query w/o metadata feature enabled. This CR fixes this issue and allows loading global configs during the Hudi reading phase. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (HUDI-2775) Add documentation for external configuration support
Wenning Ding created HUDI-2775: -- Summary: Add documentation for external configuration support Key: HUDI-2775 URL: https://issues.apache.org/jira/browse/HUDI-2775 Project: Apache Hudi Issue Type: Sub-task Reporter: Wenning Ding We support external configuration in this pr: [https://github.com/apache/hudi/pull/3486,] but we need more documentation to let the customers know how to use it -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (HUDI-2364) Run compaction without user schema file provided
[ https://issues.apache.org/jira/browse/HUDI-2364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-2364: --- Description: Currently to run Hudi compaction manually, customers have to pass the avsc file of data schema by themselves, e.g. in Hudi CLI, {{}} {code:java} compaction run --compactionInstant 20201203005420 \ --parallelism 2 --sparkMemory 2G \ --schemaFilePath s3://xxx/hudi/mor_schema.avsc \ --propsFilePath file:///home/hadoop/config.properties --retry 1 {code} Let customers provide avsc file is not a good option. Some customers don’t know how to generate this schema file, and some customers pass the wrong schema file and get other exceptions. We should handle this logic inside Hudi if possible. was: Currently to run Hudi compaction manually, customers have to pass the avsc file of data schema by themselves, e.g. in Hudi CLI, {{}} {code:java} compaction run --compactionInstant 20201203005420 \ --parallelism 2 --sparkMemory 2G \ --schemaFilePath s3://wenningd-emr-dev/oncall/hudi/mor_delete_2_schema.avsc \ --propsFilePath file:///home/hadoop/config.properties --retry 1 {code} Let customers provide avsc file is not a good option. Some customers don’t know how to generate this schema file, and some customers pass the wrong schema file and get other exceptions. We should handle this logic inside Hudi if possible. > Run compaction without user schema file provided > > > Key: HUDI-2364 > URL: https://issues.apache.org/jira/browse/HUDI-2364 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Wenning Ding >Priority: Major > > Currently to run Hudi compaction manually, customers have to pass the avsc > file of data schema by themselves, > e.g. in Hudi CLI, > > {{}} > {code:java} > compaction run --compactionInstant 20201203005420 \ --parallelism 2 > --sparkMemory 2G \ --schemaFilePath s3://xxx/hudi/mor_schema.avsc \ > --propsFilePath file:///home/hadoop/config.properties --retry 1 > {code} > Let customers provide avsc file is not a good option. Some customers don’t > know how to generate this schema file, and some customers pass the wrong > schema file and get other exceptions. We should handle this logic inside Hudi > if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2364) Run compaction without user schema file provided
Wenning Ding created HUDI-2364: -- Summary: Run compaction without user schema file provided Key: HUDI-2364 URL: https://issues.apache.org/jira/browse/HUDI-2364 Project: Apache Hudi Issue Type: New Feature Reporter: Wenning Ding Currently to run Hudi compaction manually, customers have to pass the avsc file of data schema by themselves, e.g. in Hudi CLI, {{}} {code:java} compaction run --compactionInstant 20201203005420 \ --parallelism 2 --sparkMemory 2G \ --schemaFilePath s3://wenningd-emr-dev/oncall/hudi/mor_delete_2_schema.avsc \ --propsFilePath file:///home/hadoop/config.properties --retry 1 {code} Let customers provide avsc file is not a good option. Some customers don’t know how to generate this schema file, and some customers pass the wrong schema file and get other exceptions. We should handle this logic inside Hudi if possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2362) Hudi external configuration file support
Wenning Ding created HUDI-2362: -- Summary: Hudi external configuration file support Key: HUDI-2362 URL: https://issues.apache.org/jira/browse/HUDI-2362 Project: Apache Hudi Issue Type: New Feature Reporter: Wenning Ding Many big data applications like Hadoop, Hive have an XML configuration file that users can have a concentrated place to set all the parameters. Also to support Spark SQL, it might be easier for Hudi to have a configuration file which could avoid setting Hudi parameters inside Hive CLI or Spark SQL CLI. Here is an example: {{}} {code:java} # Enable optimistic concurrency control by default, to disable it, remove the following two configs hoodie.write.concurrency.mode optimistic_concurrency_control hoodie.cleaner.policy.failed.writes LAZY hoodie.write.lock.provider org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider hoodie.write.lock.zookeeper.url ip-192-168-1-239.ec2.internal hoodie.write.lock.zookeeper.port2181 hoodie.write.lock.zookeeper.base_path hudi_occ_lock hoodie.index.type BLOOM # Only applies if index type is HBASE hoodie.index.hbase.zkquorum ip-192-168-1-239.ec2.internal hoodie.index.hbase.zkport 2181 # Only applies if Hive sync is enable hoodie.datasource.hive_sync.jdbcurl jdbc:hive2://ip-192-168-1-239.ec2.internal:1 {code} {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2314) Add DynamoDb based lock provider
Wenning Ding created HUDI-2314: -- Summary: Add DynamoDb based lock provider Key: HUDI-2314 URL: https://issues.apache.org/jira/browse/HUDI-2314 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Similar to zookeeper & hive metastore based lock provider, we need to add DynamoDb based lock provider. The benefit of having DynamoDb based lock provider is for the customers who use AWS EMR, they can share the lock information across different EMR clusters and it's more easy for them to config. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2255) Refactor DataSourceOptions
Wenning Ding created HUDI-2255: -- Summary: Refactor DataSourceOptions Key: HUDI-2255 URL: https://issues.apache.org/jira/browse/HUDI-2255 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding As discussed with Vinoth, we can rename DataSourceOptions, from xxx_OPT_KEY to xxx_OPT. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-2242) Add inference logic to few Hudi configs
[ https://issues.apache.org/jira/browse/HUDI-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-2242: -- Assignee: Wenning Ding > Add inference logic to few Hudi configs > --- > > Key: HUDI-2242 > URL: https://issues.apache.org/jira/browse/HUDI-2242 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > Since we already refactored the Hudi config framework, we can add inference > logic to few Hudi configs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-2242) Add inference logic to few Hudi configs
Wenning Ding created HUDI-2242: -- Summary: Add inference logic to few Hudi configs Key: HUDI-2242 URL: https://issues.apache.org/jira/browse/HUDI-2242 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Since we already refactored the Hudi config framework, we can add inference logic to few Hudi configs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-2146) Concurrent writes loss data
[ https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377525#comment-17377525 ] Wenning Ding edited comment on HUDI-2146 at 7/8/21, 5:37 PM: - [~nishith29] [~vinoth] thanks for taking a look. `df3` is different for insert1 and insert2 with different primary keys. insert1: (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") insert2: val df3 = Seq( (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") The expected output (this is the content of parquet file at 20210706171250): {code:java} 20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1 20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1 20210706171250,20210706171250_1_1,400,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,400,event_name_11,2125-02-01T13:51:39.340396Z,type1 20210706171250,20210706171250_1_2,404,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,404,event_name_333433,2126-01-01T12:15:00.512679Z,type1 {code} The output I saw (this is the content of parquet file at 20210706171252): {code:java} 20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1 20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1 {code} I started two spark-shell in the same cluster. was (Author: wenningd): [~nishith29] [~vinoth] thanks for taking a look. `df3` is different for insert1 and insert2 with different primary keys. insert1: (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") insert2: val df3 = Seq( (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") The expected output: {code:java} 20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1 20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1 20210706171250,20210706171250_1_1,400,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,400,event_name_11,2125-02-01T13:51:39.340396Z,type1 20210706171250,20210706171250_1_2,404,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,404,event_name_333433,2126-01-01T12:15:00.512679Z,type1 {code} The output
[jira] [Commented] (HUDI-2146) Concurrent writes loss data
[ https://issues.apache.org/jira/browse/HUDI-2146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377525#comment-17377525 ] Wenning Ding commented on HUDI-2146: [~nishith29] [~vinoth] thanks for taking a look. `df3` is different for insert1 and insert2 with different primary keys. insert1: (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") insert2: val df3 = Seq( (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") The expected output: {code:java} 20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1 20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1 20210706171250,20210706171250_1_1,400,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,400,event_name_11,2125-02-01T13:51:39.340396Z,type1 20210706171250,20210706171250_1_2,404,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-32-12035_20210706171250.parquet,404,event_name_333433,2126-01-01T12:15:00.512679Z,type1 {code} The output I saw: {code:java} 20210706170824,20210706170824_1_1,104,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,104,event_name_123,2015-01-01T12:15:00.512679Z,type1 20210706170824,20210706170824_1_2,100,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-5-16_20210706170824.parquet,100,event_name_16,2015-01-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_1,300,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,300,event_name_1,2035-02-01T13:51:39.340396Z,type1 20210706171252,20210706171252_1_2,304,type1,218e0866-e815-4e08-84ee-dbb3df2155cc-0_1-21-12012_20210706171252.parquet,304,event_name_3,2036-01-01T12:15:00.512679Z,type1 {code} I started two spark-shell in the same cluster. > Concurrent writes loss data > > > Key: HUDI-2146 > URL: https://issues.apache.org/jira/browse/HUDI-2146 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Blocker > Fix For: 0.9.0 > > Attachments: image-2021-07-08-00-49-30-730.png > > > Reproduction steps: > Create a Hudi table: > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > import org.apache.hudi.AvroConversionUtils > val df = Seq( > (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), > (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), > (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), > (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") > ).toDF("event_id", "event_name", "event_ts", "event_type") > var tableName = "hudi_test" > var tablePath = "s3://.../" + tableName > // write hudi dataset > df.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > {code} > Perform two insert operations almost in the same time, each insertion > c
[jira] [Created] (HUDI-2146) Concurrent writes loss data
Wenning Ding created HUDI-2146: -- Summary: Concurrent writes loss data Key: HUDI-2146 URL: https://issues.apache.org/jira/browse/HUDI-2146 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding Attachments: image-2021-07-08-00-49-30-730.png Reproduction steps: Create a Hudi table: {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveMode import org.apache.hudi.AvroConversionUtils val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_test" var tablePath = "s3://.../" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) {code} Perform two insert operations almost in the same time, each insertion contains different data: Insert 1: {code:java} val df3 = Seq( (400, "event_name_11", "2125-02-01T13:51:39.340396Z", "type1"), (401, "event_name_22", "2125-02-01T12:14:58.597216Z", "type2"), (404, "event_name_333433", "2126-01-01T12:15:00.512679Z", "type1"), (405, "event_name_666378", "2125-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") // update hudi dataset df3.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control") .option("hoodie.cleaner.policy.failed.writes", "LAZY") .option("hoodie.write.lock.provider", "org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider") .option("hoodie.write.lock.zookeeper.url", "ip-***.ec2.internal") .option("hoodie.write.lock.zookeeper.port", "2181") .option("hoodie.write.lock.zookeeper.lock_key", tableName) .option("hoodie.write.lock.zookeeper.base_path", "/occ_lock") .mode(SaveMode.Append) .save(tablePath) {code} Insert 2: {code:java} val df3 = Seq( (300, "event_name_1", "2035-02-01T13:51:39.340396Z", "type1"), (301, "event_name_2", "2035-02-01T12:14:58.597216Z", "type2"), (304, "event_name_3", "2036-01-01T12:15:00.512679Z", "type1"), (305, "event_name_66678", "2035-07-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") // update hudi dataset df3.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOpti
[jira] [Created] (HUDI-1899) Auto sync Hudi configuration description with website
Wenning Ding created HUDI-1899: -- Summary: Auto sync Hudi configuration description with website Key: HUDI-1899 URL: https://issues.apache.org/jira/browse/HUDI-1899 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Based on the change in https://issues.apache.org/jira/browse/HUDI-89, we can further add a feature to auto sync Hudi configuration description with website. Flink has similar implementation: https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/configuration/description/Description.java. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-874) Schema evolution does not work with AWS Glue catalog
[ https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342805#comment-17342805 ] Wenning Ding edited comment on HUDI-874 at 5/11/21, 6:52 PM: - Would be good if you can share some reproduction steps. Here is what I tried on EMR 6.1.0: # Created a Hudi table with 4 columns. # Appended a new column at the end (5 columns totally), upsert Hudi table. I didn't run into any error in this case. was (Author: wenningd): Would be good if you can share some reproduction steps. Here is what I tried on EMR 6.1.0: # Created a Hudi table with 4 columns. # Append a new column at the end (5 columns totally), upsert Hudi table. I didn't run into any error in this case. > Schema evolution does not work with AWS Glue catalog > > > Key: HUDI-874 > URL: https://issues.apache.org/jira/browse/HUDI-874 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: aws-emr, sev:critical, user-support-issues > > This issue has been discussed here > [https://github.com/apache/incubator-hudi/issues/1581] and at other places as > well. Glue catalog currently does not support *cascade* for *ALTER TABLE* > statements. As a result features like adding new columns to an existing table > does now work with glue catalog . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (HUDI-874) Schema evolution does not work with AWS Glue catalog
[ https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342805#comment-17342805 ] Wenning Ding edited comment on HUDI-874 at 5/11/21, 6:52 PM: - Would be good if you can share some reproduction steps. Here is what I tried on EMR 6.1.0: # Created a Hudi table with 4 columns. # Append a new column at the end (5 columns totally), upsert Hudi table. I didn't run into any error in this case. was (Author: wenningd): Can you share some reproduction steps. Here is what I tried on EMR 6.1.0: # Created a Hudi table with 4 columns. # Append a new column at the end (5 columns totally), upsert Hudi table. > Schema evolution does not work with AWS Glue catalog > > > Key: HUDI-874 > URL: https://issues.apache.org/jira/browse/HUDI-874 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: aws-emr, sev:critical, user-support-issues > > This issue has been discussed here > [https://github.com/apache/incubator-hudi/issues/1581] and at other places as > well. Glue catalog currently does not support *cascade* for *ALTER TABLE* > statements. As a result features like adding new columns to an existing table > does now work with glue catalog . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-874) Schema evolution does not work with AWS Glue catalog
[ https://issues.apache.org/jira/browse/HUDI-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17342805#comment-17342805 ] Wenning Ding commented on HUDI-874: --- Can you share some reproduction steps. Here is what I tried on EMR 6.1.0: # Created a Hudi table with 4 columns. # Append a new column at the end (5 columns totally), upsert Hudi table. > Schema evolution does not work with AWS Glue catalog > > > Key: HUDI-874 > URL: https://issues.apache.org/jira/browse/HUDI-874 > Project: Apache Hudi > Issue Type: Improvement > Components: Hive Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Major > Labels: aws-emr, sev:critical, user-support-issues > > This issue has been discussed here > [https://github.com/apache/incubator-hudi/issues/1581] and at other places as > well. Glue catalog currently does not support *cascade* for *ALTER TABLE* > statements. As a result features like adding new columns to an existing table > does now work with glue catalog . -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1627) DeltaStreamer does not work with SQLtransformation
Wenning Ding created HUDI-1627: -- Summary: DeltaStreamer does not work with SQLtransformation Key: HUDI-1627 URL: https://issues.apache.org/jira/browse/HUDI-1627 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding Hudi deltastreamer , using SQLtransformation : 21/02/04 17:19:50 INFO SqlQueryBasedTransformer: SQL Query for transformation : (select order_date from `default`.ordertest) org.apache.spark.sql.AnalysisException: Table or view not found: `default`.`ordertest`; line 1 pos 23; ’Project [’order_date] +- ’UnresolvedRelation `default`.`ordertest` This is because: DeltaStreamer does not enable HiveSupport when initiating the SparkSession. this.sparkSession = SparkSession.builder().config(jssc.getConf()).getOrCreate(); line 526 in [https://github.com/a0x8o/hudi/blob/b8d0747959bc6f101b5b90b8e3ad323aafa2aa6e/hudi-u[…]rg/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java|https://github.com/a0x8o/hudi/blob/b8d0747959bc6f101b5b90b8e3ad323aafa2aa6e/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1240) Simplify config classes
[ https://issues.apache.org/jira/browse/HUDI-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282006#comment-17282006 ] Wenning Ding commented on HUDI-1240: Make sense to me! This is similar to what I thought. Just have a question, if we start using the ConfigOption structure, do we still keep the existing configuration files? If we remove the existing configurations files, then users won't be able to use something like this any more? {code:java} .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id"){code} > Simplify config classes > --- > > Key: HUDI-1240 > URL: https://issues.apache.org/jira/browse/HUDI-1240 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup >Reporter: sivabalan narayanan >Priority: Major > > Cleanup config classes across the board with a {{HoodieConfig}} class in > hudi-common, that wraps a single key,value, default, doc, fallback keys (old > keys this points to. eg). > the notion of fallback keys is very important, such that the user can still > use the old config names and we should be able map it to how we have now > renamed. > i.e say there was `hoodie.a.b.c` and we think `hoodie.x.y.z` is a better > name, then `hoodie.a.b.c` should be marked as deprecated (annotation or a > boolean in HoodieConfig) and list `hoodie.x.y.z` as the fallback key. Users > should be able to set `hoodie.a.b.c` for the next 3-6 months at least and > internall we translate it to `hoodie.x.y.z`. None of our code should directly > access on `hoodie.a.b.c` anymore. > We can see the Apache Flink project fpr examples. > Once this is done, first pass we should move all our existing configs to > something like below. > {code:java} > public static final String EMBEDDED_TIMELINE_SERVER_ENABLED = > "hoodie.embed.timeline.server"; > public static final String DEFAULT_EMBEDDED_TIMELINE_SERVER_ENABLED = > "true"; > {code} > becomes > {code:java} >public static HoodieConfig timelineServerEnabled = new HoodieConfig( > "hoodie.embed.timeline.server", > // property name > Boolean.class, // type > true, //default val > Option.empty(), // fallback key > false, //deprecated > > "Enables/Disables the timeline > server on the write client.." //doc > ) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-89) Clean up placement, naming, defaults of HoodieWriteConfig
[ https://issues.apache.org/jira/browse/HUDI-89?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280356#comment-17280356 ] Wenning Ding commented on HUDI-89: -- Just have a question [~shivnarayan] [~vinoth], if we start using the ConfigOption structure, then users won't be able to use something like this any more? {code:java} .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id") {code} > Clean up placement, naming, defaults of HoodieWriteConfig > - > > Key: HUDI-89 > URL: https://issues.apache.org/jira/browse/HUDI-89 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup, Usability, Writer Core >Reporter: Vinoth Chandar >Assignee: Vinoth Chandar >Priority: Major > > # Rename HoodieWriteConfig to HoodieClientConfig > # Move bunch of configs from CompactionConfig to StorageConfig > # Introduce new HoodieCleanConfig > # Should we consider lombok or something to automate the > defaults/getters/setters > # Consistent name of properties/defaults > # Enforce bounds more strictly -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0
[ https://issues.apache.org/jira/browse/HUDI-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1512: --- Description: This bug is introduced by https://github.com/apache/hudi/pull/2328. hudi-spark2 unit tests failed when running with Spark 3.0.0: {code:java} mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code} Threw a class not found error. {code:java} java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder at org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) {code} When enabling spark3 profile, in hudi-spark2 module, some of the test code is compiled with Spark 3 and then run with Spark 2. The above error is also because of this incompatibility. To solve this compatibility issue is kind of hard, but we can skip hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is actually not used in the Spark 3.0.0 environment. was: hudi-spark2 unit tests failed when running with Spark 3.0.0: {code:java} mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code} Threw a class not found error. {code:java} java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder at org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) {code} When enabling spark3 profile, in hudi-spark2 module, some of the test code is compiled with Spark 3 and run with Spark 2. The above error is also because of this incompatibility. To solve this compatibility issue is kind of hard, but we can skip hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is actually not used in the Spark 3.0.0 environment. > Fix hudi-spark2 unit tests failure with Spark 3.0.0 > > > Key: HUDI-1512 > URL: https://issues.apache.org/jira/browse/HUDI-1512 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > This bug is introduced by https://github.com/apache/hudi/pull/2328. > > hudi-spark2 unit tests failed when running with Spark 3.0.0: > {code:java} > mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code} > Threw a class not found error. > {code:java} > java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder > at > org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) > {code} > When enabling spark3 profile, in hudi-spark2 module, some of the test code is > compiled with Spark 3 and then run with Spark 2. The above error is also > because of this incompatibility. > To solve this compatibility issue is kind of hard, but we can skip > hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is > actually not used in the Spark 3.0.0 environment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0
[ https://issues.apache.org/jira/browse/HUDI-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1512: --- Description: hudi-spark2 unit tests failed when running with Spark 3.0.0: {code:java} mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code} Threw a class not found error. {code:java} java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder at org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) {code} When enabling spark3 profile, in hudi-spark2 module, some of the test code is compiled with Spark 3 and run with Spark 2. The above error is also because of this incompatibility. To solve this compatibility issue is kind of hard, but we can skip hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is actually not used in the Spark 3.0.0 environment. was: hudi-spark2 unit tests failed with Spark 3.0.0: {code:java} mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2 java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder at org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) {code} > Fix hudi-spark2 unit tests failure with Spark 3.0.0 > > > Key: HUDI-1512 > URL: https://issues.apache.org/jira/browse/HUDI-1512 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > hudi-spark2 unit tests failed when running with Spark 3.0.0: > {code:java} > mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code} > Threw a class not found error. > {code:java} > java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder > at > org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) > {code} > When enabling spark3 profile, in hudi-spark2 module, some of the test code is > compiled with Spark 3 and run with Spark 2. The above error is also because > of this incompatibility. > To solve this compatibility issue is kind of hard, but we can skip > hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is > actually not used in the Spark 3.0.0 environment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0
[ https://issues.apache.org/jira/browse/HUDI-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1512: -- Assignee: Wenning Ding > Fix hudi-spark2 unit tests failure with Spark 3.0.0 > > > Key: HUDI-1512 > URL: https://issues.apache.org/jira/browse/HUDI-1512 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > hudi-spark2 unit tests failed when running with Spark 3.0.0: > {code:java} > mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2{code} > Threw a class not found error. > {code:java} > java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder > at > org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) > {code} > When enabling spark3 profile, in hudi-spark2 module, some of the test code is > compiled with Spark 3 and run with Spark 2. The above error is also because > of this incompatibility. > To solve this compatibility issue is kind of hard, but we can skip > hudi-spark2 unit testing when enable spark3 profile since hudi-spark2 code is > actually not used in the Spark 3.0.0 environment. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1512) Fix hudi-spark2 unit tests failure with Spark 3.0.0
Wenning Ding created HUDI-1512: -- Summary: Fix hudi-spark2 unit tests failure with Spark 3.0.0 Key: HUDI-1512 URL: https://issues.apache.org/jira/browse/HUDI-1512 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding hudi-spark2 unit tests failed with Spark 3.0.0: {code:java} mvn clean install -Dspark3 -pl hudi-spark-datasource/hudi-spark2 java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheBuilder at org.apache.hudi.internal.TestHoodieBulkInsertDataInternalWriter.testGlobalFailure(TestHoodieBulkInsertDataInternalWriter.java:97) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1489) Not able to read after updating bootstrap table with written table
[ https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1489: --- Description: After updating Hudi table with the written bootstrap table, it would fail to read the latest bootstrap table. h3. Reproduction steps {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.common.model.HoodieTableType import org.apache.hudi.config.HoodieBootstrapConfig import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession val bucketName = "wenningd-dev" val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8" val recordKeyName = "event_id" val partitionKeyName = "event_type" val precombineKeyName = "event_time" val verificationRecordKey = "4" val verificationColumn = "event_name" val originalVerificationValue = "event_d" val updatedVerificationValue = "event_test" val sourceTableLocation = "s3://wenningd-dev/hudi/test-data/source_table/" val tableType = HoodieTableType.COPY_ON_WRITE.name() val verificationSqlQuery = "select " + verificationColumn + " from " + tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'" val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName val loadTablePath = tablePath + "/*/*" // Create table and sync with hive val df = spark.emptyDataFrame val tableType = HoodieTableType.COPY_ON_WRITE.name df.write .format("hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, sourceTableLocation) .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, "org.apache.hudi.keygen.SimpleKeyGenerator") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, partitionKeyName) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // Verify create with spark sql query val result0 = spark.sql(verificationSqlQuery) if (!(result0.count == 1) || !result0.collect.mkString.contains(originalVerificationValue)) { throw new TestFailureException("Create table verification failed!") } val df3 = spark.read.format("org.apache.hudi").load(loadTablePath) val df4 = df3.filter(col(recordKeyName) === verificationRecordKey) val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue)) df5.write.format("org.apache.hudi") .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKeyName) .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, partitionKeyName) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) val result1 = spark.sql(verificationSqlQuery) val df6 = spark.read.format("org.apache.hudi").load(loadTablePath) df6.show {code} df6.show would return: {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2030) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:967) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGSchedul
[jira] [Created] (HUDI-1489) Not able to read after updating bootstrap table with written table
Wenning Ding created HUDI-1489: -- Summary: Not able to read after updating bootstrap table with written table Key: HUDI-1489 URL: https://issues.apache.org/jira/browse/HUDI-1489 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding After updating Hudi table with the written bootstrap table, it would fail to read the latest bootstrap table. h3. Reproduction steps {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.common.model.HoodieTableType import org.apache.hudi.config.HoodieBootstrapConfig import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveMode import org.apache.spark.sql.SparkSession val bucketName = "wenningd-emr-dev" val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8" val recordKeyName = "event_id" val partitionKeyName = "event_type" val precombineKeyName = "event_time" val verificationRecordKey = "4" val verificationColumn = "event_name" val originalVerificationValue = "event_d" val updatedVerificationValue = "event_test" // val sourceTableWithoutHiveStylePartition = "s3://wenningd-emr-dev/hudi/test-data/source_table/" // new parameters val sourceTableLocation = "s3://wenningd-emr-dev/hudi/test-data/source_table/" val tableType = HoodieTableType.COPY_ON_WRITE.name() val verificationSqlQuery = "select " + verificationColumn + " from " + tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'" val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName val loadTablePath = tablePath + "/*/*" // Create table and sync with hive val df = spark.emptyDataFrame val tableType = HoodieTableType.COPY_ON_WRITE.name // val tableType = HoodieTableType.MERGE_ON_READ.name df.write .format("hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, sourceTableLocation) .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, "org.apache.hudi.keygen.SimpleKeyGenerator") .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName) .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, partitionKeyName) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // Verify create with spark sql query val result0 = spark.sql(verificationSqlQuery) if (!(result0.count == 1) || !result0.collect.mkString.contains(originalVerificationValue)) { throw new TestFailureException("Create table verification failed!") } val df3 = spark.read.format("org.apache.hudi").load(loadTablePath) val df4 = df3.filter(col(recordKeyName) === verificationRecordKey) val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue)) df5.write.format("org.apache.hudi") .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName) .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, partitionKeyName) .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName) // .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, partitionKeyName) .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) val result1 = spark.sql(verificationSqlQuery) val df6 = spark.read.format("org.apache.hudi").load(loadTablePath) df6.show {code} df6.show would return: {code:java} Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2043) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2031) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2030) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayB
[jira] [Assigned] (HUDI-1489) Not able to read after updating bootstrap table with written table
[ https://issues.apache.org/jira/browse/HUDI-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1489: -- Assignee: Wenning Ding > Not able to read after updating bootstrap table with written table > -- > > Key: HUDI-1489 > URL: https://issues.apache.org/jira/browse/HUDI-1489 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > After updating Hudi table with the written bootstrap table, it would fail to > read the latest bootstrap table. > h3. Reproduction steps > {code:java} > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.common.model.HoodieTableType > import org.apache.hudi.config.HoodieBootstrapConfig > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.spark.sql.SaveMode > import org.apache.spark.sql.SparkSession > val bucketName = "wenningd-emr-dev" > val tableName = "hudi_bootstrap_test_cow_5c1a5147_888e_4b638bef8" > val recordKeyName = "event_id" > val partitionKeyName = "event_type" > val precombineKeyName = "event_time" > val verificationRecordKey = "4" > val verificationColumn = "event_name" > val originalVerificationValue = "event_d" > val updatedVerificationValue = "event_test" > // val sourceTableWithoutHiveStylePartition = > "s3://wenningd-emr-dev/hudi/test-data/source_table/" > // new parameters > val sourceTableLocation = > "s3://wenningd-emr-dev/hudi/test-data/source_table/" > val tableType = HoodieTableType.COPY_ON_WRITE.name() > val verificationSqlQuery = "select " + verificationColumn + " from " + > tableName + " where " + recordKeyName + " = '" + verificationRecordKey + "'" > val tablePath = "s3://" + bucketName + "/hudi/tables/" + tableName > val loadTablePath = tablePath + "/*/*" > // Create table and sync with hive > val df = spark.emptyDataFrame > val tableType = HoodieTableType.COPY_ON_WRITE.name > // val tableType = HoodieTableType.MERGE_ON_READ.name > df.write > .format("hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(HoodieBootstrapConfig.BOOTSTRAP_BASE_PATH_PROP, > sourceTableLocation) > .option(HoodieBootstrapConfig.BOOTSTRAP_KEYGEN_CLASS, > "org.apache.hudi.keygen.SimpleKeyGenerator") > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.BOOTSTRAP_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, > recordKeyName) > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > partitionKeyName) > > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Overwrite) > .save(tablePath) > // Verify create with spark sql query > val result0 = spark.sql(verificationSqlQuery) > if (!(result0.count == 1) || > !result0.collect.mkString.contains(originalVerificationValue)) { > throw new TestFailureException("Create table verification failed!") > } > val df3 = spark.read.format("org.apache.hudi").load(loadTablePath) > val df4 = df3.filter(col(recordKeyName) === verificationRecordKey) > val df5 = df4.withColumn(verificationColumn, lit(updatedVerificationValue)) > df5.write.format("org.apache.hudi") > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, recordKeyName) > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, > partitionKeyName) > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, precombineKeyName) > // .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") > .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) > .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, > partitionKeyName) > .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, > "org.apache.hudi.hive.MultiPartKeysValueExtractor") > .mode(SaveMode.Append) > .save(tablePath) > val result1 = spark.sql(verificationSqlQuery) > val df6 = spark.read.format("org.apache.hudi").load(loadTablePath) > df6.show > {code} > df6.show would return: > {code:java} > Driver stacktrace: > at > org.
[jira] [Updated] (HUDI-1461) Bulk insert v2 creates additional small files
[ https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1461: --- Description: I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, we set BulkInsertShuffleParallelism to 2 and have the following df as input: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"), (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") {code} (Here I added a new column partitionID for better understanding) Based on the current logic, after sorting and coalesce, the dataframe would become: {code:java} val df2 = df.sort(functions.col("event_type"), functions.col("event_id")).coalesce(2) df2.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} You can see the coalesce result actually does not depend on the sorting result. Each spark partition id contains 3 types of Hudi partitions. So during the writing phase, each spark executor would get its corresponding partition id, and each executor would create 3 files under 3 Hudi partitions. Finally we have two parquet files under each Hudi partition. But with such a small dataset, ideally we should have single file under each Hudi partition. If I change the sort to repartition: {code:java} val df3 = df.repartition(functions.col("event_type")).coalesce(2) df3.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} In this case, we can have single file under each Hudi partition. But according to our understanding, we still need the sort part so that we can get benefit from min/max record key index. So the problem is how should we correctly handle the logic. Repartition and sort within each partition might be a way? Though sort within each partition might cause OOM issue if the data is unbalance. was: I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, the input df is: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "t
[jira] [Updated] (HUDI-1461) Bulk insert v2 creates additional small files
[ https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1461: --- Description: I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, the input df is: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"), (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") {code} (Here I added a new column partitionID for better understanding) Based on the current logic, after sorting and coalesce, the dataframe would become: {code:java} val df2 = df.sort(functions.col("event_type"), functions.col("event_id")).coalesce(2) df2.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} You can see the coalesce result actually does not depend on the sorting result. Each spark partition id contains 3 types of Hudi partitions. So during the writing phase, each spark executor would get its corresponding partition id, and each executor would create 3 files under 3 Hudi partitions. Finally we have two parquet files under each Hudi partition. But with such a small dataset, ideally we should have single file under each Hudi partition. If I change the sort to repartition: {code:java} val df3 = df.repartition(functions.col("event_type")).coalesce(2) df3.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} In this case, we can have single file under each Hudi partition. But according to our understanding, we still need the sort part so that we can get benefit from min/max record key index. So the problem is how should we correctly handle the logic. Repartition and sort within each partition might be a way? was: I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, the input df is: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"),
[jira] [Updated] (HUDI-1461) Bulk insert v2 creates additional small files
[ https://issues.apache.org/jira/browse/HUDI-1461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1461: --- Description: I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, the input df is: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"), (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") {code} (Here I added a new column partitionID for better understanding) Based on the current logic, after sorting and coalesce, the dataframe would become: {code:java} val df2 = df.sort(functions.col("event_type"), functions.col("event_id")).coalesce(2) df2.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} You can see the coalesce result actually does not depend on the sorting result. Each spark partition id contains 3 types of Hudi partitions. So during the writing phase, each spark executor would get its corresponding partition id, and each executor would create 3 files under 3 Hudi partitions. Finally we have two parquet files under each Hudi partition. But with such a small dataset, ideally we should have single file under each Hudi partition. If I change the sort to repartition: {code:java} val df3 = df.repartition(functions.col("event_type")).coalesce(2) df3.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} In this case, we can have single file under each Hudi partition. But according to our understanding, we still need the sort part so that we can get benefit from min/max record key index. So the problem is how should we correctly handle the logic. was: I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, the input df is: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"), (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type
[jira] [Created] (HUDI-1461) Bulk insert v2 creates additional small files
Wenning Ding created HUDI-1461: -- Summary: Bulk insert v2 creates additional small files Key: HUDI-1461 URL: https://issues.apache.org/jira/browse/HUDI-1461 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding I took a look at the data preparation step for bulk insert, I found that current logic will create additional small files when performing bulk insert v2 which will hurt the performance. Current logic is to first sort the input dataframe and then do coalesce: [https://github.com/apache/hudi/blob/master/hudi-spark-datasource/hudi-spark/src/main/java/org/apache/hudi/HoodieDatasetBulkInsertHelper.java#L104-L106] For example, the input df is: {code:java} val df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (108, "event_name_18", "2015-01-01T11:51:33.340396Z", "type1"), (109, "event_name_19", "2014-01-01T11:51:33.340396Z", "type3"), (110, "event_name_20", "2014-02-01T11:51:33.340396Z", "type3"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") {code} (Here I added a new column partitionID for better understanding) Based on the current logic, after sorting and coalesce, the dataframe would become: {code:java} val df2 = df.sort(functions.col("event_type"), functions.col("event_id")).coalesce(2) df2.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |0 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |1 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} You can see the coalesce result actually is not related to the sorting result. And each spark partition id contains 3 types of Hudi partitions. So during the writing phase, each spark executor would get its corresponding partition id, and each executor would create 3 files under 3 Hudi partitions. Finally we have two parquet files under each Hudi partition. But with such a small dataset, ideally we should have single file under each Hudi partition. If I change the sort to repartition: {code:java} val df3 = df.repartition(functions.col("event_type")).coalesce(2) df3.withColumn("partitionID", spark_partition_id).show(false) ++--+---+--+---+ |event_id|event_name|event_ts |event_type|partitionID| ++--+---+--+---+ |100 |event_name_16 |2015-01-01T13:51:39.340396Z|type1 |0 | |104 |event_name_123|2015-01-01T12:15:00.512679Z|type1 |0 | |108 |event_name_18 |2015-01-01T11:51:33.340396Z|type1 |0 | |101 |event_name_546|2015-01-01T12:14:58.597216Z|type2 |1 | |105 |event_name_678|2015-01-01T13:51:42.248818Z|type2 |1 | |109 |event_name_19 |2014-01-01T11:51:33.340396Z|type3 |1 | |110 |event_name_20 |2014-02-01T11:51:33.340396Z|type3 |1 | ++--+---+--+---+ {code} In this case, we can have single file under each Hudi partition. But according to our understanding, we still need the sort part so that we can get benefit from min/max record key index. So the problem is how should we correctly handle the logic. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1451) Support bulk insert v2 with Spark 3
Wenning Ding created HUDI-1451: -- Summary: Support bulk insert v2 with Spark 3 Key: HUDI-1451 URL: https://issues.apache.org/jira/browse/HUDI-1451 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Recently, we made Hudi support Spark 3.0.0: https://github.com/apache/hudi/pull/2208. However, there was a large refactor done to Spark datasource V2 API interfaces from version 2.4.4 → 3.0.0. So currently our bulk insert v2(https://issues.apache.org/jira/browse/HUDI-1013) feature is marked as unsupported with Spark3. We need to redesign the bulk insert v2 part with datasource v2 api in Spark 3.0.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1432) Better organize classes in hudi-spark-common & hudi-spark
Wenning Ding created HUDI-1432: -- Summary: Better organize classes in hudi-spark-common & hudi-spark Key: HUDI-1432 URL: https://issues.apache.org/jira/browse/HUDI-1432 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding In this pr: [https://github.com/apache/hudi/pull/2208,] we refactored hudi-spark module and split it into four submodules: hudi-spark, hudi-spark-common, hudi-spark2 and hudi-spark3. We only moved two classes to hudi-spark-common: DataSourceWriteOptions, DataSourceUtils. But ideally we should move more classes to hudi-spark-common. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1396) Bootstrap job via Hudi datasource hangs at the end of the spark-submit job
Wenning Ding created HUDI-1396: -- Summary: Bootstrap job via Hudi datasource hangs at the end of the spark-submit job Key: HUDI-1396 URL: https://issues.apache.org/jira/browse/HUDI-1396 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding Bootstrap job via Hudi datasource hangs at the end of the spark-submit job This issue is similar to https://issues.apache.org/jira/browse/HUDI-1230. Basically, {{HoodieWriteClient}} at [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255] will not be closed and as a result, the corresponding timeline server will not stop at the end. Therefore the job hangs and never exits. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing
[ https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1376: --- Description: When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since metadata columns are not used during the Spark datasource writing process, we can drop those columns in the beginning. was: When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since metadata columns are not used during the Spark datasource writing process, we can drop those columns in the beginning of Spark datasource. > Drop Hudi metadata columns before Spark datasource writing > --- > > Key: HUDI-1376 > URL: https://issues.apache.org/jira/browse/HUDI-1376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > Labels: pull-request-available > > When updating a Hudi table through Spark datasource, it will use the schema > of the input dataframe as the schema stored in the commit files. Thus, when > upserted with rows containing metadata columns, the upsert commit file will > store the metadata columns schema in the commit file which is unnecessary for > common cases. And also this will bring an issue for bootstrap table. > Since metadata columns are not used during the Spark datasource writing > process, we can drop those columns in the beginning. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing
[ https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1376: --- Description: When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since metadata columns are not used during the Spark datasource writing process, we can drop those columns in the beginning of Spark datasource. was: When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since metadata columns is not used during the Spark datasource processing, we can drop those columns in the beginning of Spark datasource. > Drop Hudi metadata columns before Spark datasource writing > --- > > Key: HUDI-1376 > URL: https://issues.apache.org/jira/browse/HUDI-1376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > Labels: pull-request-available > > When updating a Hudi table through Spark datasource, it will use the schema > of the input dataframe as the schema stored in the commit files. Thus, when > upserted with rows containing metadata columns, the upsert commit file will > store the metadata columns schema in the commit file which is unnecessary for > common cases. And also this will bring an issue for bootstrap table. > Since metadata columns are not used during the Spark datasource writing > process, we can drop those columns in the beginning of Spark datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1376) Drop metadata columns before Spark datasource processing
[ https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1376: --- Description: When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since metadata columns is not used during the Spark datasource processing, we can drop those columns in the beginning of Spark datasource. was: When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since the schema of metadata columns is always the same, we should remove the schema of metadata columns in the commit file for any insert/upsert/... action. > Drop metadata columns before Spark datasource processing > - > > Key: HUDI-1376 > URL: https://issues.apache.org/jira/browse/HUDI-1376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > Labels: pull-request-available > > When updating a Hudi table through Spark datasource, it will use the schema > of the input dataframe as the schema stored in the commit files. Thus, when > upserted with rows containing metadata columns, the upsert commit file will > store the metadata columns schema in the commit file which is unnecessary for > common cases. And also this will bring an issue for bootstrap table. > Since metadata columns is not used during the Spark datasource processing, we > can drop those columns in the beginning of Spark datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1376) Drop Hudi metadata columns before Spark datasource writing
[ https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1376: --- Summary: Drop Hudi metadata columns before Spark datasource writing (was: Drop metadata columns before Spark datasource processing ) > Drop Hudi metadata columns before Spark datasource writing > --- > > Key: HUDI-1376 > URL: https://issues.apache.org/jira/browse/HUDI-1376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > Labels: pull-request-available > > When updating a Hudi table through Spark datasource, it will use the schema > of the input dataframe as the schema stored in the commit files. Thus, when > upserted with rows containing metadata columns, the upsert commit file will > store the metadata columns schema in the commit file which is unnecessary for > common cases. And also this will bring an issue for bootstrap table. > Since metadata columns is not used during the Spark datasource processing, we > can drop those columns in the beginning of Spark datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1376) Drop metadata columns before Spark datasource processing
[ https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1376: --- Summary: Drop metadata columns before Spark datasource processing (was: Remove the schema of metadata columns in the commit files) > Drop metadata columns before Spark datasource processing > - > > Key: HUDI-1376 > URL: https://issues.apache.org/jira/browse/HUDI-1376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > Labels: pull-request-available > > When updating a Hudi table through Spark datasource, it will use the schema > of the input dataframe as the schema stored in the commit files. Thus, when > upserted with rows containing metadata columns, the upsert commit file will > store the metadata columns schema in the commit file which is unnecessary for > common cases. And also this will bring an issue for bootstrap table. > Since the schema of metadata columns is always the same, we should remove the > schema of metadata columns in the commit file for any insert/upsert/... > action. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1376) Remove the schema of metadata columns in the commit files
Wenning Ding created HUDI-1376: -- Summary: Remove the schema of metadata columns in the commit files Key: HUDI-1376 URL: https://issues.apache.org/jira/browse/HUDI-1376 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding When updating a Hudi table through Spark datasource, it will use the schema of the input dataframe as the schema stored in the commit files. Thus, when upserted with rows containing metadata columns, the upsert commit file will store the metadata columns schema in the commit file which is unnecessary for common cases. And also this will bring an issue for bootstrap table. Since the schema of metadata columns is always the same, we should remove the schema of metadata columns in the commit file for any insert/upsert/... action. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1376) Remove the schema of metadata columns in the commit files
[ https://issues.apache.org/jira/browse/HUDI-1376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1376: -- Assignee: Wenning Ding > Remove the schema of metadata columns in the commit files > - > > Key: HUDI-1376 > URL: https://issues.apache.org/jira/browse/HUDI-1376 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > When updating a Hudi table through Spark datasource, it will use the schema > of the input dataframe as the schema stored in the commit files. Thus, when > upserted with rows containing metadata columns, the upsert commit file will > store the metadata columns schema in the commit file which is unnecessary for > common cases. And also this will bring an issue for bootstrap table. > Since the schema of metadata columns is always the same, we should remove the > schema of metadata columns in the commit file for any insert/upsert/... > action. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1375) Fix bug for HoodieAvroUtils.removeMetadataFields
Wenning Ding created HUDI-1375: -- Summary: Fix bug for HoodieAvroUtils.removeMetadataFields Key: HUDI-1375 URL: https://issues.apache.org/jira/browse/HUDI-1375 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding There's a code bug in HoodieAvroUtils.removeMetadataFields(). Reproduction steps: {{}} {code:java} import org.apache.hudi.avro.HoodieAvroUtils import org.apache.avro.Schema val EXAMPLE_SCHEMA = """{"type": "record","name": "testrec","fields": [ {"name": "timestamp","type": "double"},{"name": "_row_key", "type": "string"},{"name": "non_pii_col", "type": "string"},{"name": "pii_col", "type": "string", "column_category": "user_profile"}]}""" val schema = HoodieAvroUtils.addMetadataFields(newSchema.Parser().parse(EXAMPLE_SCHEMA)) HoodieAvroUtils.removeMetadataFields(schema) {code} {{}} Then it would throw an exception: {{}} {code:java} org.apache.avro.AvroRuntimeException: Field already used: timestamp type:DOUBLE pos:5 at org.apache.avro.Schema$RecordSchema.setFields(Schema.java:647) at org.apache.hudi.avro.HoodieAvroUtils.removeMetadataFields(HoodieAvroUtils.java:209) ... 49 elided {code} {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1375) Fix bug for HoodieAvroUtils.removeMetadataFields
[ https://issues.apache.org/jira/browse/HUDI-1375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1375: -- Assignee: Wenning Ding > Fix bug for HoodieAvroUtils.removeMetadataFields > > > Key: HUDI-1375 > URL: https://issues.apache.org/jira/browse/HUDI-1375 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > There's a code bug in HoodieAvroUtils.removeMetadataFields(). > Reproduction steps: > > {{}} > {code:java} > import org.apache.hudi.avro.HoodieAvroUtils > import org.apache.avro.Schema > val EXAMPLE_SCHEMA = """{"type": "record","name": "testrec","fields": [ > {"name": "timestamp","type": "double"},{"name": "_row_key", "type": > "string"},{"name": "non_pii_col", "type": "string"},{"name": "pii_col", > "type": "string", "column_category": "user_profile"}]}""" > val schema = > HoodieAvroUtils.addMetadataFields(newSchema.Parser().parse(EXAMPLE_SCHEMA)) > HoodieAvroUtils.removeMetadataFields(schema) > {code} > {{}} > Then it would throw an exception: > > {{}} > {code:java} > org.apache.avro.AvroRuntimeException: Field already used: timestamp > type:DOUBLE pos:5 at > org.apache.avro.Schema$RecordSchema.setFields(Schema.java:647) at > org.apache.hudi.avro.HoodieAvroUtils.removeMetadataFields(HoodieAvroUtils.java:209) > ... 49 elided {code} > {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit
[ https://issues.apache.org/jira/browse/HUDI-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1369: -- Assignee: Wenning Ding > Bootstrap datasource jobs from hanging via spark-submit > --- > > Key: HUDI-1369 > URL: https://issues.apache.org/jira/browse/HUDI-1369 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > > MOR table creation via Hudi datasource hangs at the end of the spark-submit > job. > Looks like {{HoodieWriteClient}} at > [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255] > not being closed which does not stop the timeline server at the end, and as > a result the job hangs and never exits. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1369) Bootstrap datasource jobs from hanging via spark-submit
Wenning Ding created HUDI-1369: -- Summary: Bootstrap datasource jobs from hanging via spark-submit Key: HUDI-1369 URL: https://issues.apache.org/jira/browse/HUDI-1369 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding MOR table creation via Hudi datasource hangs at the end of the spark-submit job. Looks like {{HoodieWriteClient}} at [https://github.com/apache/hudi/blob/release-0.6.0/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala#L255] not being closed which does not stop the timeline server at the end, and as a result the job hangs and never exits. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table
[ https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17187969#comment-17187969 ] Wenning Ding commented on HUDI-1021: Verified this issue still exists in 0.6.0 release. Will work on fixing it. > [Bug] Unable to update bootstrapped table using rows from the written > bootstrapped table > > > Key: HUDI-1021 > URL: https://issues.apache.org/jira/browse/HUDI-1021 > Project: Apache Hudi > Issue Type: Sub-task > Components: bootstrap >Reporter: Udit Mehrotra >Assignee: Wenning Ding >Priority: Blocker > Fix For: 0.6.1 > > > Reproduction Steps: > > {code:java} > import spark.implicits._ > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.HoodieDataSourceHelpers > import org.apache.hudi.common.model.HoodieTableType > import org.apache.spark.sql.SaveMode > val sourcePath = > "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null" > val sourceDf = spark.read.parquet(sourcePath + "/*") > var tableName = "events_data_partitioned_non_null_00" > var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName > sourceDf.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .mode(SaveMode.Overwrite) > .save(tablePath) > val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*") > val updateDf = readDf.filter($"event_id" === "106") > .withColumn("event_name", lit("udit_event_106")) > > updateDf.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .mode(SaveMode.Append) > .save(tablePath) > {code} > > Full Stack trace: > {noformat} > Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting > bucketType UPDATE for partition :0 > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:308) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.Result
[jira] [Created] (HUDI-1194) Reorganize HoodieHiveClient and make it fully support Hive Metastore API
Wenning Ding created HUDI-1194: -- Summary: Reorganize HoodieHiveClient and make it fully support Hive Metastore API Key: HUDI-1194 URL: https://issues.apache.org/jira/browse/HUDI-1194 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Currently there are three ways in HoodieHiveClient to perform Hive functionalities. One is through Hive JDBC, one is through Hive Metastore API. One is through Hive Driver. There’s a parameter called +{{hoodie.datasource.hive_sync.use_jdbc}}+ to control whether use Hive JDBC or not. However, this parameter does not accurately describe the situation. Basically, current logic is when set +*use_jdbc*+ to true, most of the methods in HoodieHiveClient will use JDBC, and few methods in HoodieHiveClient will use Hive Metastore API. When set +*use_jdbc*+ to false, most of the methods in HoodieHiveClient will use Hive Driver, and few methods in HoodieHiveClient will use Hive Metastore API. Here is a table shows that what will actually be used when setting use_jdbc to ture/false. |Method|use_jdbc=true|use_jdbc=false| |{{addPartitionsToTable}}|JDBC|Hive Driver| |{{updatePartitionsToTable}}|JDBC|Hive Driver| |{{scanTablePartitions}}|Metastore API|Metastore API| |{{updateTableDefinition}}|JDBC|Hive Driver| |{{createTable}}|JDBC|Hive Driver| |{{getTableSchema}}|JDBC|Metastore API| |{{doesTableExist}}|Metastore API|Metastore API| |getLastCommitTimeSynced|Metastore API|Metastore API| [~bschell] and I developed several Metastore API implementation for {{createTable, }}{{addPartitionsToTable}}{{, }}{{updatePartitionsToTable}}{{, }}{{updateTableDefinition }}{{which will be helpful for several issues: e.g. resolving null partition hive sync issue and supporting ALTER_TABLE cascade with AWS glue catalog}}{{. }} {{But it seems hard to organize three implementations within the current config. So we plan to separate HoodieHiveClient into three classes:}} # {{HoodieHiveClient which implements all the APIs through Metastore API.}} # {{HoodieHiveJDBCClient which extends from HoodieHiveClient and overwrite several the APIs through Hive JDBC.}} # {{HoodieHiveDriverClient which extends from HoodieHiveClient and overwrite several the APIs through Hive Driver.}} {{And we introduce a new parameter }}+*hoodie.datasource.hive_sync.hive_client_class*+ which could** _**_ let you choose which Hive Client class to use. {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1181) Decimal type display issue for record key field
[ https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1181: -- Assignee: Wenning Ding > Decimal type display issue for record key field > --- > > Key: HUDI-1181 > URL: https://issues.apache.org/jira/browse/HUDI-1181 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Assignee: Wenning Ding >Priority: Major > Labels: pull-request-available > > When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would > not correctly display the decimal value, instead, Hudi would display it as a > byte array. > During the Hudi writing phase, Hudi would save the parquet source data into > Avro Generic Record. For example, the source parquet data has a column with > decimal type: > > {code:java} > optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} > > Then Hudi will convert it into the following avro decimal type: > {code:java} > { > "name" : "OBJ_ID", > "type" : [ { > "type" : "fixed", > "name" : "fixed", > "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", > "size" : 16, > "logicalType" : "decimal", > "precision" : 38, > "scale" : 0 > }, "null" ] > } > {code} > This decimal field would be stored as a fixed length bytes array. And in the > reading phase, Hudi will convert this bytes array back to a readable decimal > value through this > [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58]. > However, the problem is, when setting decimal type as record keys, Hudi would > read the value from Avro Generic Record and then directly convert it into > String type(See > [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). > As a result, what shows in the _hoodie_record_key field would be something > like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So > we need to handle this special case to convert bytes array back before > converting to String. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field
[ https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1181: --- Description: When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: {code:java} optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} Then Hudi will convert it into the following avro decimal type: {code:java} { "name" : "OBJ_ID", "type" : [ { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", "size" : 16, "logicalType" : "decimal", "precision" : 38, "scale" : 0 }, "null" ] } {code} This decimal field would be stored as a fixed length bytes array. And in the reading phase, Hudi will convert this bytes array back to a readable decimal value through this converter. However, the problem is, when setting decimal type as record keys, Hudi would read the value from Avro Generic Record and then directly convert it into String type(See [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). As a result, what shows in the _hoodie_record_key field would be something like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So we need to handle this special case to convert bytes array back before converting to String. was: When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: {code:java} optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} Then Hudi will convert it into the following avro decimal type: {code:java} { "name" : "LN_LQDN_OBJ_ID", "type" : [ { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", "size" : 16, "logicalType" : "decimal", "precision" : 38, "scale" : 0 }, "null" ] } {code} This decimal field would be stored as a fixed length bytes array. And in the reading phase, Hudi will convert this bytes array back to a readable decimal value through this converter. However, the problem is, when setting decimal type as record keys, Hudi would read the value from Avro Generic Record and then directly convert it into String type(See [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). As a result, what shows in the _hoodie_record_key field would be something like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So we need to handle this special case to convert bytes array back before converting to String. > Decimal type display issue for record key field > --- > > Key: HUDI-1181 > URL: https://issues.apache.org/jira/browse/HUDI-1181 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would > not correctly display the decimal value, instead, Hudi would display it as a > byte array. > During the Hudi writing phase, Hudi would save the parquet source data into > Avro Generic Record. For example, the source parquet data has a column with > decimal type: > > {code:java} > optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} > > Then Hudi will convert it into the following avro decimal type: > {code:java} > { > "name" : "OBJ_ID", > "type" : [ { > "type" : "fixed", > "name" : "fixed", > "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", > "size" : 16, > "logicalType" : "decimal", > "precision" : 38, > "scale" : 0 > }, "null" ] > } > {code} > This decimal field would be stored as a fixed length bytes array. And in the > reading phase, Hudi will convert this bytes array back to a readable decimal > value through this converter. > However, the problem is, when setting decimal type as record keys, Hudi would > read the value from Avro Generic Record and then directly convert it into > String type(See > [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). > As a result, what shows in the _hoodie_record_key field would be something > like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71
[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field
[ https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1181: --- Description: When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: {code:java} optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} Then Hudi will convert it into the following avro decimal type: {code:java} { "name" : "OBJ_ID", "type" : [ { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", "size" : 16, "logicalType" : "decimal", "precision" : 38, "scale" : 0 }, "null" ] } {code} This decimal field would be stored as a fixed length bytes array. And in the reading phase, Hudi will convert this bytes array back to a readable decimal value through this [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58]. However, the problem is, when setting decimal type as record keys, Hudi would read the value from Avro Generic Record and then directly convert it into String type(See [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). As a result, what shows in the _hoodie_record_key field would be something like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So we need to handle this special case to convert bytes array back before converting to String. was: When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: {code:java} optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} Then Hudi will convert it into the following avro decimal type: {code:java} { "name" : "OBJ_ID", "type" : [ { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", "size" : 16, "logicalType" : "decimal", "precision" : 38, "scale" : 0 }, "null" ] } {code} This decimal field would be stored as a fixed length bytes array. And in the reading phase, Hudi will convert this bytes array back to a readable decimal value through this converter. However, the problem is, when setting decimal type as record keys, Hudi would read the value from Avro Generic Record and then directly convert it into String type(See [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). As a result, what shows in the _hoodie_record_key field would be something like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So we need to handle this special case to convert bytes array back before converting to String. > Decimal type display issue for record key field > --- > > Key: HUDI-1181 > URL: https://issues.apache.org/jira/browse/HUDI-1181 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would > not correctly display the decimal value, instead, Hudi would display it as a > byte array. > During the Hudi writing phase, Hudi would save the parquet source data into > Avro Generic Record. For example, the source parquet data has a column with > decimal type: > > {code:java} > optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} > > Then Hudi will convert it into the following avro decimal type: > {code:java} > { > "name" : "OBJ_ID", > "type" : [ { > "type" : "fixed", > "name" : "fixed", > "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", > "size" : 16, > "logicalType" : "decimal", > "precision" : 38, > "scale" : 0 > }, "null" ] > } > {code} > This decimal field would be stored as a fixed length bytes array. And in the > reading phase, Hudi will convert this bytes array back to a readable decimal > value through this > [converter|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L58]. > However, the problem is, when setting decimal type as record keys, Hudi would > read the value from Avro Generic Record and then directly convert it into > String type(See > [here|https://github.com/apache/hudi/bl
[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field
[ https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1181: --- Description: When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: {code:java} optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} Then Hudi will convert it into the following avro decimal type: {code:java} { "name" : "LN_LQDN_OBJ_ID", "type" : [ { "type" : "fixed", "name" : "fixed", "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", "size" : 16, "logicalType" : "decimal", "precision" : 38, "scale" : 0 }, "null" ] } {code} This decimal field would be stored as a fixed length bytes array. And in the reading phase, Hudi will convert this bytes array back to a readable decimal value through this converter. However, the problem is, when setting decimal type as record keys, Hudi would read the value from Avro Generic Record and then directly convert it into String type(See [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). As a result, what shows in the _hoodie_record_key field would be something like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So we need to handle this special case to convert bytes array back before converting to String. was: When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: { optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); } Then Hudi will convert it into the following avro decimal type: > Decimal type display issue for record key field > --- > > Key: HUDI-1181 > URL: https://issues.apache.org/jira/browse/HUDI-1181 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > When using *fixed_len_byte_array* decimal type as Hudi record key, Hudi would > not correctly display the decimal value, instead, Hudi would display it as a > byte array. > During the Hudi writing phase, Hudi would save the parquet source data into > Avro Generic Record. For example, the source parquet data has a column with > decimal type: > > {code:java} > optional fixed_len_byte_array(16) OBJ_ID (DECIMAL(38,0));{code} > > Then Hudi will convert it into the following avro decimal type: > {code:java} > { > "name" : "LN_LQDN_OBJ_ID", > "type" : [ { > "type" : "fixed", > "name" : "fixed", > "namespace" : "hoodie.hudi_ln.hudi_ln_record.OBJ_ID", > "size" : 16, > "logicalType" : "decimal", > "precision" : 38, > "scale" : 0 > }, "null" ] > } > {code} > This decimal field would be stored as a fixed length bytes array. And in the > reading phase, Hudi will convert this bytes array back to a readable decimal > value through this converter. > However, the problem is, when setting decimal type as record keys, Hudi would > read the value from Avro Generic Record and then directly convert it into > String type(See > [here|https://github.com/apache/hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L76]). > As a result, what shows in the _hoodie_record_key field would be something > like: LN_LQDN_OBJ_ID:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 40, 95, -71].So > we need to handle this special case to convert bytes array back before > converting to String. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field
[ https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1181: --- Description: When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: { optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); } Then Hudi will convert it into the following avro decimal type: was: When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: { optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); } Then Hudi will convert it into the following avro decimal type: > Decimal type display issue for record key field > --- > > Key: HUDI-1181 > URL: https://issues.apache.org/jira/browse/HUDI-1181 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi > would not correctly display the decimal value, instead, Hudi would display it > as a byte array. > During the Hudi writing phase, Hudi would save the parquet source data into > Avro Generic Record. For example, the source parquet data has a column with > decimal type: > { > optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); > } > Then Hudi will convert it into the following avro decimal type: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1181) Decimal type display issue for record key field
[ https://issues.apache.org/jira/browse/HUDI-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-1181: --- Description: When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: { optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); } Then Hudi will convert it into the following avro decimal type: was: When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: ``` optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); ``` Then Hudi will convert it into the following avro decimal type: > Decimal type display issue for record key field > --- > > Key: HUDI-1181 > URL: https://issues.apache.org/jira/browse/HUDI-1181 > Project: Apache Hudi > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi > would not correctly display the decimal value, instead, Hudi would display it > as a byte array. > During the Hudi writing phase, Hudi would save the parquet source data into > Avro Generic Record. For example, the source parquet data has a column with > decimal type: > { > optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); > } > Then Hudi will convert it into the following avro decimal type: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1181) Decimal type display issue for record key field
Wenning Ding created HUDI-1181: -- Summary: Decimal type display issue for record key field Key: HUDI-1181 URL: https://issues.apache.org/jira/browse/HUDI-1181 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding When using ```fixed_len_byte_array``` decimal type as Hudi record key, Hudi would not correctly display the decimal value, instead, Hudi would display it as a byte array. During the Hudi writing phase, Hudi would save the parquet source data into Avro Generic Record. For example, the source parquet data has a column with decimal type: ``` optional fixed_len_byte_array(16) LN_LQDN_OBJ_ID (DECIMAL(38,0)); ``` Then Hudi will convert it into the following avro decimal type: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1180) Upgrade HBase to 2.3.3
Wenning Ding created HUDI-1180: -- Summary: Upgrade HBase to 2.3.3 Key: HUDI-1180 URL: https://issues.apache.org/jira/browse/HUDI-1180 Project: Apache Hudi Issue Type: Improvement Reporter: Wenning Ding Trying to upgrade HBase to 2.3.3 but ran into several issues. According to the Hadoop version support matrix: [http://hbase.apache.org/book.html#hadoop], also need to upgrade Hadoop to 2.8.5+. There are several API conflicts between HBase 2.2.3 and HBase 1.2.3, we need to resolve this first. After resolving conflicts, I am able to compile it but then I ran into a tricky jetty version issue during the testing: {code:java} [ERROR] TestHBaseIndex.testDelete() Time elapsed: 4.705 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdate() Time elapsed: 0.174 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] TestHBaseIndex.testSimpleTagLocationAndUpdateWithRollback() Time elapsed: 0.076 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] TestHBaseIndex.testSmallBatchSize() Time elapsed: 0.122 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] TestHBaseIndex.testTagLocationAndDuplicateUpdate() Time elapsed: 0.16 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] TestHBaseIndex.testTotalGetsBatching() Time elapsed: 1.771 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] TestHBaseIndex.testTotalPutsBatching() Time elapsed: 0.082 s <<< ERROR! java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V 34206 [Thread-260] WARN org.apache.hadoop.hdfs.server.datanode.DirectoryScanner - DirectoryScanner: shutdown has been called 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to localhost/127.0.0.1:55924] WARN org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager - IncrementalBlockReportManager interrupted 34240 [BP-1058834949-10.0.0.2-1597189606506 heartbeating to localhost/127.0.0.1:55924] WARN org.apache.hadoop.hdfs.server.datanode.DataNode - Ending block pool service for: Block pool BP-1058834949-10.0.0.2-1597189606506 (Datanode Uuid cb7bd8aa-5d79-4955-b1ec-bdaf7f1b6431) service to localhost/127.0.0.1:55924 34246 [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data1/current/BP-1058834949-10.0.0.2-1597189606506] WARN org.apache.hadoop.fs.CachingGetSpaceUsed - Thread Interrupted waiting to refresh disk information: sleep interrupted 34247 [refreshUsed-/private/var/folders/98/mxq3vc_n6l5728rf1wmcwrqs52lpwg/T/temp1791820148926982977/dfs/data/data2/current/BP-1058834949-10.0.0.2-1597189606506] WARN org.apache.hadoop.fs.CachingGetSpaceUsed - Thread Interrupted waiting to refresh disk information: sleep interrupted 37192 [HBase-Metrics2-1] WARN org.apache.hadoop.metrics2.impl.MetricsConfig - Cannot locate configuration: tried hadoop-metrics2-datanode.properties,hadoop-metrics2.properties 43904 [master/iad1-ws-cor-r12:0:becomeActiveMaster-SendThread(localhost:58768)] WARN org.apache.zookeeper.ClientCnxn - Session 0x173dfeb0c8b0004 for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) [INFO] [INFO] Results: [INFO] [ERROR] Errors: [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [ERROR] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V [INFO] [ERROR] Tests run: 10, Failures: 0, Errors: 7, Skipped: 0 [INFO] {code} Basically currently Hudi and it's dependency Javalin depend on Jetty 9.4.x but Hbase depends on jetty 9.3.x. And they have incompatible APIs which could not be easily resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table
[ https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding reassigned HUDI-1021: -- Assignee: Wenning Ding (was: Balaji Varadarajan) > [Bug] Unable to update bootstrapped table using rows from the written > bootstrapped table > > > Key: HUDI-1021 > URL: https://issues.apache.org/jira/browse/HUDI-1021 > Project: Apache Hudi > Issue Type: Sub-task > Components: bootstrap >Reporter: Udit Mehrotra >Assignee: Wenning Ding >Priority: Blocker > Fix For: 0.6.0 > > > Reproduction Steps: > > {code:java} > import spark.implicits._ > import org.apache.hudi.DataSourceWriteOptions > import org.apache.hudi.DataSourceReadOptions > import org.apache.hudi.config.HoodieWriteConfig > import org.apache.hudi.HoodieDataSourceHelpers > import org.apache.hudi.common.model.HoodieTableType > import org.apache.spark.sql.SaveMode > val sourcePath = > "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null" > val sourceDf = spark.read.parquet(sourcePath + "/*") > var tableName = "events_data_partitioned_non_null_00" > var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName > sourceDf.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .mode(SaveMode.Overwrite) > .save(tablePath) > val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*") > val updateDf = readDf.filter($"event_id" === "106") > .withColumn("event_name", lit("udit_event_106")) > > updateDf.write.format("org.apache.hudi") > .option(HoodieWriteConfig.TABLE_NAME, tableName) > .option(DataSourceWriteOptions.OPERATION_OPT_KEY, > DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) > .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, > DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") > .mode(SaveMode.Append) > .save(tablePath) > {code} > > Full Stack trace: > {noformat} > Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting > bucketType UPDATE for partition :0 > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276) > at > org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359) > at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:308) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:310) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.
[jira] [Created] (HUDI-1061) Hudi CLI savepoint command fail because of spark conf loading issue
Wenning Ding created HUDI-1061: -- Summary: Hudi CLI savepoint command fail because of spark conf loading issue Key: HUDI-1061 URL: https://issues.apache.org/jira/browse/HUDI-1061 Project: Apache Hudi Issue Type: Bug Components: CLI Reporter: Wenning Ding h3. Reproduce open hudi-cli.sh and run these two commands: {code:java} connect --path s3://wenningd-emr-dev/hudi/tables/events/hudi_null01 savepoint create --commit 2019115109 {code} {{}} {{}}You would see this error: {{}} {code:java} java.io.FileNotFoundException: File file:/tmp/spark-events does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:640) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:866) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:630) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:452) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:97) at org.apache.spark.SparkContext.(SparkContext.scala:523) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) at org.apache.hudi.cli.utils.SparkUtil.initJavaSparkConf(SparkUtil.java:85) at org.apache.hudi.cli.commands.SavepointsCommand.savepoint(SavepointsCommand.java:79) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) at org.springframework.shell.core.SimpleExecutionStrategy.invoke(SimpleExecutionStrategy.java:68) at org.springframework.shell.core.SimpleExecutionStrategy.execute(SimpleExecutionStrategy.java:59) at org.springframework.shell.core.AbstractShell.executeCommand(AbstractShell.java:134) at org.springframework.shell.core.JLineShell.promptLoop(JLineShell.java:533) at org.springframework.shell.core.JLineShell.run(JLineShell.java:179) at java.lang.Thread.run(Thread.java:748){code} {{}}Although in {{spark-defaults.conf}}, it configs {{spark.eventLog.dir hdfs:///var/log/spark/apps}}, but here hudi cli still uses {{file:/tmp/spark-events}} as the event log dir, which means sparkcontext won't load the configs from {{spark-defaults.conf}}. We should make initJavaSparkConf method be able to read configs from spark config file. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-949) Test MOR : Hive Realtime Query with metadata bootstrap
[ https://issues.apache.org/jira/browse/HUDI-949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding resolved HUDI-949. --- Resolution: Fixed > Test MOR : Hive Realtime Query with metadata bootstrap > -- > > Key: HUDI-949 > URL: https://issues.apache.org/jira/browse/HUDI-949 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Balaji Varadarajan >Assignee: Wenning Ding >Priority: Major > Time Spent: 72h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-950) Test COW : Spark SQL Read Optimized Query with metadata bootstrap
[ https://issues.apache.org/jira/browse/HUDI-950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding resolved HUDI-950. --- Resolution: Fixed > Test COW : Spark SQL Read Optimized Query with metadata bootstrap > - > > Key: HUDI-950 > URL: https://issues.apache.org/jira/browse/HUDI-950 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Balaji Varadarajan >Assignee: Wenning Ding >Priority: Major > Time Spent: 72h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name
Wenning Ding created HUDI-971: - Summary: Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name Key: HUDI-971 URL: https://issues.apache.org/jira/browse/HUDI-971 Project: Apache Hudi Issue Type: Sub-task Reporter: Wenning Ding When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return unclean partitions because of [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-934) Hive query does not work with realtime table which contain decimal type
Wenning Ding created HUDI-934: - Summary: Hive query does not work with realtime table which contain decimal type Key: HUDI-934 URL: https://issues.apache.org/jira/browse/HUDI-934 Project: Apache Hudi Issue Type: Bug Reporter: Wenning Ding h3. Issue After updating a MOR table with decimal type, Hive query would fail because of a type cast exception. The bug looks like is because the decimal type is not correctly handled in here: https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/AbstractRealtimeRecordReader.java#L303 h3. Reproduction steps Create a MOR table with decimal type {code:java} import org.apache.spark.sql.types._ import spark.implicits._ import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.common.model.HoodieTableType import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveMode import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} import org.apache.spark.sql.Row import org.apache.spark.sql.functions import java.util.Date import org.apache.spark.sql.DataFrame var df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", 1.32, "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", 2.57, "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", 3.45, "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", 6.78, "type2") ).toDF("event_id", "event_name", "event_ts", "event_value", "event_type") df = df.withColumn("event_value", df.col("event_value").cast(DecimalType(4,2))) val tableType = HoodieTableType.MERGE_ON_READ.name val tableName = "test8" val tablePath = "s3://xxx/hudi/tables/" + tableName + "/" df.write.format("org.apache.hudi") .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option("hoodie.compact.inline", "false") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator") .mode(SaveMode.Overwrite) .save(tablePath) {code} Update this table w/o inline compaction: {code:java} var update_df = Seq( (100, "event_name_16", "2015-01-01T13:51:39.340396Z", 9.00, "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", 2.57, "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", 8.00, "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", 6.78, "type2") ).toDF("event_id", "event_name", "event_ts", "event_value", "event_type") update_df = update_df.withColumn("event_value", update_df.col("event_value").cast(DecimalType(4,2))) update_df.write.format("org.apache.hudi") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, tableType) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option("hoodie.compact.inline", "false") .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, "default") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, "org.apache.hudi.keygen.ComplexKeyGenerator") .mode(SaveMode.Append) .save(tablePath) {code} Query _rt table with hive {code:java} hive> select * from test8_rt; {code} Get this error {code:java} Failed with
[jira] [Updated] (HUDI-713) Datasource Writer throws error on resolving array of struct fields
[ https://issues.apache.org/jira/browse/HUDI-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-713: -- Description: Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration of Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue now exists in Hudi master. Reproduce steps: Run following script {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.spark.sql.SaveMode import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import spark.implicits._ val sample = """ [{ "partition": 0, "offset": 5, "timestamp": "1581508884", "value": { "prop1": "val1", "prop2": [{"withinProp1": "val2", "withinProp2": 1}] } }, { "partition": 1, "offset": 10, "timestamp": "1581108884", "value": { "prop1": "val4", "prop2": [{"withinProp1": "val5", "withinProp2": 2}] } }] """ val df = spark.read.option("dropFieldIfAllNull", "true").json(Seq(sample).toDS) val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp"))) val dfcol2 = dfcol1.withColumn("year_partition", year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), $"offset")) val dfcol3 = dfcol2.drop("timestamp") val hudiOptions: Map[String, String] = Map[String, String]( DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test", DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL, DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts", DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName, "hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 1024), "hoodie.parquet.compression.ratio" -> String.valueOf(0.5), "hoodie.insert.shuffle.parallelism" -> String.valueOf(2) ) dfcol3.write.format("org.apache.hudi") .options(hudiOptions) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "year_partition") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "year_partition") .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST") .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/") {code} Will throw not in union exception: {code:java} Caused by: org.apache.avro.UnresolvedUnionException: Not in union [{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]: {"withinProp1": "val2", "withinProp2": 1} {code} was: Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration of Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue now exists in Hudi master. Reproduce steps: Run following script {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.spark.sql.SaveMode import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import spark.implicits._val sample = """ [{ "partition": 0, "offset": 5, "timestamp": "1581508884", "value": { "prop1": "val1", "prop2": [{"withinProp1": "val2", "withinProp2": 1}] } }, { "partition": 1, "offset": 10, "timestamp": "1581108884", "value": { "prop1": "val4", "prop2": [{"withinProp1": "val5", "withinProp2": 2}] } }] """val df = spark.read.option("dropFieldIfAllNull", "true").json(Seq(sample).toDS) val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp"))) val dfcol2 = dfcol1.withColumn("year_partition", year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), $"offset")) val dfcol3 = dfcol2.drop("timestamp")val hudiOptions: Map[String, String] = Map[String, String]( DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test", DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL, DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts", DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName,
[jira] [Created] (HUDI-713) Datasource Writer throws error on resolving array of struct fields
Wenning Ding created HUDI-713: - Summary: Datasource Writer throws error on resolving array of struct fields Key: HUDI-713 URL: https://issues.apache.org/jira/browse/HUDI-713 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Wenning Ding Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration of Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue now exists in Hudi master. Reproduce steps: Run following script {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.spark.sql.SaveMode import org.apache.spark.sql._ import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ import spark.implicits._val sample = """ [{ "partition": 0, "offset": 5, "timestamp": "1581508884", "value": { "prop1": "val1", "prop2": [{"withinProp1": "val2", "withinProp2": 1}] } }, { "partition": 1, "offset": 10, "timestamp": "1581108884", "value": { "prop1": "val4", "prop2": [{"withinProp1": "val5", "withinProp2": 2}] } }] """val df = spark.read.option("dropFieldIfAllNull", "true").json(Seq(sample).toDS) val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp"))) val dfcol2 = dfcol1.withColumn("year_partition", year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), $"offset")) val dfcol3 = dfcol2.drop("timestamp")val hudiOptions: Map[String, String] = Map[String, String]( DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test", DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL, DataSourceWriteOptions.OPERATION_OPT_KEY -> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts", DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName, "hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 1024), "hoodie.parquet.compression.ratio" -> String.valueOf(0.5), "hoodie.insert.shuffle.parallelism" -> String.valueOf(2) )dfcol3.write.format("org.apache.hudi") .options(hudiOptions) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id") .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "year_partition") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "year_partition") .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST") .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/") {code} Will throw not in union exception: {code:java} Caused by: org.apache.avro.UnresolvedUnionException: Not in union [{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]: {"withinProp1": "val2", "withinProp2": 1} {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3
[ https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17027742#comment-17027742 ] Wenning Ding commented on HUDI-515: --- [~nishith29] Thanks for your advice. I realized just coping the code maybe is not a good option. Using java reflection to make it compatible with Hive 2&3 looks a better way. > Resolve API conflict for Hive 2 and Hive 3 > -- > > Key: HUDI-515 > URL: https://issues.apache.org/jira/browse/HUDI-515 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: Wenning Ding >Priority: Major > > Currently I am working on supporting HIVE 3. There is an API issue. > In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: > HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is > removed in Hive 3 and replaced by > HiveFileFormatUtils.getFromPathRecursively(). > Ideally, Hudi should support both Hive 2 & Hive 3 so that both > {code:java} > mvn clean install{code} > and > {code:java} > mvn clean install -Dhive.version=3.x{code} > could work. > One solution is to directly copy source code from > [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] > and put that code inside *HoodieCombineHiveInputFormat.java*. > The other way is using java reflection to decide which method to use. > (preferred) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3
[ https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-515: -- Description: Currently I am working on supporting HIVE 3. There is an API issue. In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside *HoodieCombineHiveInputFormat.java*. The other way is using java reflection to decide which method to use. (preferred) was: Currently I am working on supporting HIVE 3. There is an API issue. In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside *HoodieCombineHiveInputFormat.java*. The other way is use java reflection to decide which method to use. (preferred) > Resolve API conflict for Hive 2 and Hive 3 > -- > > Key: HUDI-515 > URL: https://issues.apache.org/jira/browse/HUDI-515 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: Wenning Ding >Priority: Major > > Currently I am working on supporting HIVE 3. There is an API issue. > In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: > HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is > removed in Hive 3 and replaced by > HiveFileFormatUtils.getFromPathRecursively(). > Ideally, Hudi should support both Hive 2 & Hive 3 so that both > {code:java} > mvn clean install{code} > and > {code:java} > mvn clean install -Dhive.version=3.x{code} > could work. > One solution is to directly copy source code from > [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] > and put that code inside *HoodieCombineHiveInputFormat.java*. > The other way is using java reflection to decide which method to use. > (preferred) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3
[ https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-515: -- Description: Currently I am working on supporting HIVE 3. There is an API issue. In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside *HoodieCombineHiveInputFormat.java*. The other way is use java reflection to decide which method to use. (preferred) was: Currently I am working on supporting HIVE 3. There is an API issue. In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside *HoodieCombineHiveInputFormat.java*. > Resolve API conflict for Hive 2 and Hive 3 > -- > > Key: HUDI-515 > URL: https://issues.apache.org/jira/browse/HUDI-515 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: Wenning Ding >Priority: Major > > Currently I am working on supporting HIVE 3. There is an API issue. > In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: > HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is > removed in Hive 3 and replaced by > HiveFileFormatUtils.getFromPathRecursively(). > Ideally, Hudi should support both Hive 2 & Hive 3 so that both > {code:java} > mvn clean install{code} > and > {code:java} > mvn clean install -Dhive.version=3.x{code} > could work. > One solution is to directly copy source code from > [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] > and put that code inside *HoodieCombineHiveInputFormat.java*. > The other way is use java reflection to decide which method to use. > (preferred) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3
[ https://issues.apache.org/jira/browse/HUDI-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-515: -- Description: Currently I am working on supporting HIVE 3. There is an API issue. In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside *HoodieCombineHiveInputFormat.java*. was: Currently I am working on supporting HIVE 3. There is an API issue. In [HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java], it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside [HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java]. > Resolve API conflict for Hive 2 and Hive 3 > -- > > Key: HUDI-515 > URL: https://issues.apache.org/jira/browse/HUDI-515 > Project: Apache Hudi (incubating) > Issue Type: Sub-task >Reporter: Wenning Ding >Priority: Major > > Currently I am working on supporting HIVE 3. There is an API issue. > In *HoodieCombineHiveInputFormat.java*, it calls a Hive method: > HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is > removed in Hive 3 and replaced by > HiveFileFormatUtils.getFromPathRecursively(). > Ideally, Hudi should support both Hive 2 & Hive 3 so that both > {code:java} > mvn clean install{code} > and > {code:java} > mvn clean install -Dhive.version=3.x{code} > could work. > One solution is to directly copy source code from > [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] > and put that code inside *HoodieCombineHiveInputFormat.java*. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-515) Resolve API conflict for Hive 2 and Hive 3
Wenning Ding created HUDI-515: - Summary: Resolve API conflict for Hive 2 and Hive 3 Key: HUDI-515 URL: https://issues.apache.org/jira/browse/HUDI-515 Project: Apache Hudi (incubating) Issue Type: Sub-task Reporter: Wenning Ding Currently I am working on supporting HIVE 3. There is an API issue. In [HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java], it calls a Hive method: HiveFileFormatUtils.getPartitionDescFromPathRecursively(). But this method is removed in Hive 3 and replaced by HiveFileFormatUtils.getFromPathRecursively(). Ideally, Hudi should support both Hive 2 & Hive 3 so that both {code:java} mvn clean install{code} and {code:java} mvn clean install -Dhive.version=3.x{code} could work. One solution is to directly copy source code from [HiveFileFormatUtils.getFromPathRecursively()|https://github.com/apache/hive/blob/release-3.1.2-rc0/ql/src/java/org/apache/hadoop/hive/ql/io/HiveFileFormatUtils.java] and put that code inside [HoodieCombineHiveInputFormat.java|https://code.amazon.com/packages/Aws157Hudi/blobs/a6517d4bfb2584a983342297d350bc9f6d0f38db/--/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java]. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-495) Update deprecated HBase API
Wenning Ding created HUDI-495: - Summary: Update deprecated HBase API Key: HUDI-495 URL: https://issues.apache.org/jira/browse/HUDI-495 Project: Apache Hudi (incubating) Issue Type: Improvement Reporter: Wenning Ding Internally we are using HBase 2.x, and HBase 2.x no longer supports _*Htable.flushCommits()*_ and it is replaced by _*BufferedMutator.flush()*_. Thus for put operation and delete operation, we can use BufferedMutator instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing
[ https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998480#comment-16998480 ] Wenning Ding commented on HUDI-259: --- Hey [~Pratyaksh], I am also working on hadoop 3 support for Hudi. After I using Hadoop 3.x and Hive 3.x. The unit tests for hudi-hive module fail when they trying to start hive metastore and hiveserver2, are you facing the same issue? > Hadoop 3 support for Hudi writing > - > > Key: HUDI-259 > URL: https://issues.apache.org/jira/browse/HUDI-259 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Usability >Reporter: Vinoth Chandar >Assignee: Pratyaksh Sharma >Priority: Major > > Sample issues > > [https://github.com/apache/incubator-hudi/issues/735] > [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] > [https://github.com/apache/incubator-hudi/issues/898] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-353) Add support for Hive style partitioning path
Wenning Ding created HUDI-353: - Summary: Add support for Hive style partitioning path Key: HUDI-353 URL: https://issues.apache.org/jira/browse/HUDI-353 Project: Apache Hudi (incubating) Issue Type: Improvement Reporter: Wenning Ding In Hive, the partition folder name follows this format: =. But in Hudi, the name of its partition folder is . e.g. A dataset is partitioned by three columns: year, month and day. In Hive, the data is saved in: {{...//year=2019/month=05/day=01/xxx.parquet}} In Hudi, the data is saved in: {{...//2019/05/01/xxx.parquet}} Basically I add a new option in Spark datasource named {{HIVE_STYLE_PARTITIONING_FILED_OPT_KEY}} which indicates whether using hive style partitioning or not. By default this option is false (not use). Also, if using hive style partitioning, instead of scanning the dataset and manually adding/updating all partitions, we can use "MSCK REPAIR TABLE " to automatically sync all the partition info with Hive MetaStore. h3. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table
[ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-325: -- Description: h3. Description While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating. h3. Reproduction {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveModeval df = Seq( (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_test" var tablePath = "hdfs:///user/hadoop/" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // update hudi dataset val df2 = Seq( (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"), (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3") ).toDF("event_id", "event_name", "event_ts", "event_type") df2.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) {code} Then do query in Hive: {code:java} select count(*) from hudi_test; {code} It returns: {code:java} java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputI
[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table
[ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-325: -- Description: h3. Description While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating. h3. Reproduction {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveModeval df = Seq( (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_test" var tablePath = "hdfs:///user/hadoop/" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // update hudi dataset val df2 = Seq( (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"), (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3") ).toDF("event_id", "event_name", "event_ts", "event_type") df2.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) {code} Then do query in Hive: {code:java} select count(*) from hudi_test; {code} It returns: {code:java} java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputI
[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table
[ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-325: -- Description: h3. Description While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating. h3. Reproduction {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveModeval df = Seq( (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_test" var tablePath = "hdfs:///user/hadoop/" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // update hudi dataset val df2 = Seq( (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"), (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3") ).toDF("event_id", "event_name", "event_ts", "event_type") df2.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) {code} Then do query in Hive: {code:java} select count(*) from hudi_test; {code} It returns: {code:java} java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputI
[jira] [Updated] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table
[ https://issues.apache.org/jira/browse/HUDI-325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-325: -- Description: h3. Description While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating. h3. Reproduction {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveModeval df = Seq( (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_test" var tablePath = "hdfs:///user/hadoop/" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // update hudi dataset val df2 = Seq( (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"), (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3") ).toDF("event_id", "event_name", "event_ts", "event_type") df2.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) {code} Then do query in Hive: {code:java} select count(*) from hudi_test; {code} It returns: {code:java} java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269) at org.apache.tez.dag.app.dag.RootInputI
[jira] [Created] (HUDI-325) Unable to query by Hive after updating HDFS Hudi table
Wenning Ding created HUDI-325: - Summary: Unable to query by Hive after updating HDFS Hudi table Key: HUDI-325 URL: https://issues.apache.org/jira/browse/HUDI-325 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Wenning Ding h3. Description While doing internal testing in EMR, we found that if Hudi table path follows this kind of format: hdfs:///user/... or hdfs:/user/... then Hudi table would unable to query by Hive after updating. h3. Reproduction {code:java} import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.spark.sql.SaveModeval df = Seq( (100, "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"), (101, "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"), (104, "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"), (105, "event_name_678", "2015-01-01T13:51:42.248818Z", "type2") ).toDF("event_id", "event_name", "event_ts", "event_type") var tableName = "hudi_test" var tablePath = "hdfs:///user/hadoop/" + tableName // write hudi dataset df.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Overwrite) .save(tablePath) // update hudi dataset val df2 = Seq( (100, "event_name_1", "2015-01-01T13:51:39.340396Z", "type1"), (107, "event_name_578", "2015-01-01T13:51:42.248818Z", "type3") ).toDF("event_id", "event_name", "event_ts", "event_type") df2.write.format("org.apache.hudi") .option("hoodie.upsert.shuffle.parallelism", "2") .option(HoodieWriteConfig.TABLE_NAME, tableName) .option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL) .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL) .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id") .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts") .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true") .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName) .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type") .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false") .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, "org.apache.hudi.hive.MultiPartKeysValueExtractor") .mode(SaveMode.Append) .save(tablePath) {code} Then do query in Hive: {code:java} select count(*) from hudi_test; {code} It returns: {code:java} java.io.IOException: cannot find dir = hdfs://ip-172-30-6-236.ec2.internal:8020/user/hadoop/elb_logs_hudi_cow_8/2015-01-01/cb7531ac-dadf-4118-b722-55cb34bc66f2-0_34-7-336_20191104223321.parquet in pathToPartitionInfo: [hdfs:/user/hadoop/elb_logs_hudi_cow_8/2015-01-01] at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:394) at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getPartitionDescFromPathRecursively(HiveFileFormatUtils.java:357) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:284) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:184) at org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:161) at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:207) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278) at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apac
[jira] [Resolved] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table
[ https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding resolved HUDI-313. --- Resolution: Fixed > Unable to SELECT COUNT(*) from a MOR realtime table > --- > > Key: HUDI-313 > URL: https://issues.apache.org/jira/browse/HUDI-313 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > While I query like this in Hive: > {code:java} > SELECT COUNT(*) FROM hudi_test_rt; > OR: > SELECT COUNT(1) FROM hudi_test_rt;{code} > It returns: > {code:java} > 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: > java.lang.RuntimeException: java.io.IOException: > java.lang.NumberFormatException: For input string: "" > at > org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) > at > org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) > at > org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.lang.NumberFormatException: For input > string: "" > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) > at > org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) > ... 19 more > Caused by: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:592) > at java.lang.Integer.parseInt(Integer.java:615) > at > org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186) > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377) > at > org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) > at > org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197) > at > org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) > ... 20 more > {code} > Some investigations: > Basically, Hive try to update projection column ids during each > getRecordReader stage. But for CO
[jira] [Resolved] (HUDI-314) Unable to query a multi-partitions MOR realtime table
[ https://issues.apache.org/jira/browse/HUDI-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding resolved HUDI-314. --- Resolution: Fixed > Unable to query a multi-partitions MOR realtime table > - > > Key: HUDI-314 > URL: https://issues.apache.org/jira/browse/HUDI-314 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > h3. Description > I created a Hudi MOR table with multiple partition keys. The partition keys > are "year", "month" and "day". > While I try to query its realtime time in Hive like this: > {code:java} > SELECT * FROM hudi_multi_partitions_test_rt; > {code} > It returns: > {code:java} > java.lang.Exception: java.io.IOException: > org.apache.avro.SchemaParseException: Illegal character in: year/month/day > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) > ~[hadoop-mapreduce-client-common-2.8.4.jar:?] > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) > ~[hadoop-mapreduce-client-common-2.8.4.jar:?] > Caused by: java.io.IOException: org.apache.avro.SchemaParseException: Illegal > character in: year/month/day > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) > ~[hive-exec-2.3.3.jar:2.3.3] > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) > ~[hive-exec-2.3.3.jar:2.3.3] > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) > ~[hive-exec-2.3.3.jar:2.3.3] > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) > ~[hadoop-mapreduce-client-core-2.8.4.jar:?] > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) > ~[hadoop-mapreduce-client-core-2.8.4.jar:?] > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) > ~[hadoop-mapreduce-client-core-2.8.4.jar:?] > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270) > ~[hadoop-mapreduce-client-common-2.8.4.jar:?] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[?:1.8.0_212] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_212] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_212] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212] > Caused by: org.apache.avro.SchemaParseException: Illegal character in: > year/month/day > at org.apache.avro.Schema.validateName(Schema.java:1083) > ~[avro-1.7.7.jar:1.7.7] > at org.apache.avro.Schema.access$200(Schema.java:79) > ~[avro-1.7.7.jar:1.7.7] > at org.apache.avro.Schema$Field.(Schema.java:372) > ~[avro-1.7.7.jar:1.7.7] > at org.apache.avro.Schema$Field.(Schema.java:367) > ~[avro-1.7.7.jar:1.7.7] > at > org.apache.hudi.common.util.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:166) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.addPartitionFields(AbstractRealtimeRecordReader.java:305) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:328) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:103) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:48) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:45) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:233) > ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) > ~[hive-exec-2.3.3.jar:2.3.3] > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.j
[jira] [Created] (HUDI-314) Unable to query a multi-partitions MOR realtime table
Wenning Ding created HUDI-314: - Summary: Unable to query a multi-partitions MOR realtime table Key: HUDI-314 URL: https://issues.apache.org/jira/browse/HUDI-314 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Wenning Ding h3. Description I created a Hudi MOR table with multiple partition keys. The partition keys are "year", "month" and "day". While I try to query its realtime time in Hive like this: {code:java} SELECT * FROM hudi_multi_partitions_test_rt; {code} It returns: {code:java} java.lang.Exception: java.io.IOException: org.apache.avro.SchemaParseException: Illegal character in: year/month/day at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:489) ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:549) ~[hadoop-mapreduce-client-common-2.8.4.jar:?] Caused by: java.io.IOException: org.apache.avro.SchemaParseException: Illegal character in: year/month/day at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) ~[hive-exec-2.3.3.jar:2.3.3] at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) ~[hive-exec-2.3.3.jar:2.3.3] at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) ~[hive-exec-2.3.3.jar:2.3.3] at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270) ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_212] at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212] Caused by: org.apache.avro.SchemaParseException: Illegal character in: year/month/day at org.apache.avro.Schema.validateName(Schema.java:1083) ~[avro-1.7.7.jar:1.7.7] at org.apache.avro.Schema.access$200(Schema.java:79) ~[avro-1.7.7.jar:1.7.7] at org.apache.avro.Schema$Field.(Schema.java:372) ~[avro-1.7.7.jar:1.7.7] at org.apache.avro.Schema$Field.(Schema.java:367) ~[avro-1.7.7.jar:1.7.7] at org.apache.hudi.common.util.HoodieAvroUtils.appendNullSchemaFields(HoodieAvroUtils.java:166) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.addPartitionFields(AbstractRealtimeRecordReader.java:305) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.init(AbstractRealtimeRecordReader.java:328) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.AbstractRealtimeRecordReader.(AbstractRealtimeRecordReader.java:103) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.RealtimeCompactedRecordReader.(RealtimeCompactedRecordReader.java:48) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.constructRecordReader(HoodieRealtimeRecordReader.java:67) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.HoodieRealtimeRecordReader.(HoodieRealtimeRecordReader.java:45) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:233) ~[hudi-hadoop-mr-bundle-0.5.1-SNAPSHOT.jar:0.5.1-SNAPSHOT] at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ~[hive-exec-2.3.3.jar:2.3.3] at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.(MapTask.java:169) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:432) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) ~[hadoop-mapreduce-client-core-2.8.4.jar:?] at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:270) ~[hadoop-mapreduce-client-common-2.8.4.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:51
[jira] [Updated] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table
[ https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-313: -- Description: While I query like this in Hive: {code:java} SELECT COUNT(*) FROM hudi_test_rt; OR: SELECT COUNT(1) FROM hudi_test_rt;{code} It returns: {code:java} 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.RuntimeException: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) ... 19 more Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:592) at java.lang.Integer.parseInt(Integer.java:615) at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ... 20 more {code} Some investigations: Basically, Hive try to update projection column ids during each getRecordReader stage. But for COUNT(\*) or COUNT(1), they don't need any projection column id which is an empty string. And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi required columns in the projection column ids and make the column ids like "2,0,3". Therefore, when Hive trying to update projection column ids, it combines an empty string with Hudi required columns ids and finally get the column ids like ",2,0,3". This first comma will cause an error during the parsing stage. One possible solution is to add a method to check if the projection column ids start with comma. If it is start with comma, then remove
[jira] [Updated] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table
[ https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-313: -- Description: While I query like this in Hive: {code:java} SELECT COUNT(*) FROM hudi_test_rt; OR: SELECT COUNT(1) FROM hudi_test_rt;{code} It returns: {code:java} 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.RuntimeException: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) ... 19 more Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:592) at java.lang.Integer.parseInt(Integer.java:615) at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ... 20 more {code} Some investigations: Basically, Hive try to update projection column ids during each getRecordReader stage. But for COUNT(*) or COUNT(1), they don't need any projection column id which is an empty string. And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi required columns in the projection column ids and make the column ids like "2,0,3". Therefore, when Hive trying to update projection column ids, it combines an empty string with Hudi required columns ids and finally get the column ids like ",2,0,3". This first comma will cause an error during the parsing stage. One possible solution is to add a method to check if the projection column ids start with comma. If it is start with comma, then remove
[jira] [Updated] (HUDI-313) Unable SELECT COUNT(*) from a MOR realtime table
[ https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-313: -- Description: While I query like this in Hive: {code:java} SELECT COUNT(*) FROM hudi_test_rt; OR: SELECT COUNT(1) FROM hudi_test_rt;{code} It returns: {code:java} 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.RuntimeException: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) ... 19 more Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:592) at java.lang.Integer.parseInt(Integer.java:615) at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ... 20 more {code} Some investigations: Basically, Hive try to update projection column ids during each getRecordReader stage. But for COUNT(*) or COUNT(1), they don't need any projection column id which is an empty string. And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi required columns in the projection column ids and make the column ids like "2,0,3". Therefore, when Hive trying to update projection column ids, it combines an empty string with Hudi required columns ids and finally get the column ids like ",2,0,3". This first comma will cause an error during the parsing stage. One possible solution is to add a method to check if the projection column ids start with comma. If it is start with comma, then remove
[jira] [Updated] (HUDI-313) Unable to SELECT COUNT(*) from a MOR realtime table
[ https://issues.apache.org/jira/browse/HUDI-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenning Ding updated HUDI-313: -- Summary: Unable to SELECT COUNT(*) from a MOR realtime table (was: Unable SELECT COUNT(*) from a MOR realtime table) > Unable to SELECT COUNT(*) from a MOR realtime table > --- > > Key: HUDI-313 > URL: https://issues.apache.org/jira/browse/HUDI-313 > Project: Apache Hudi (incubating) > Issue Type: Bug >Reporter: Wenning Ding >Priority: Major > > While I query like this in Hive: > {code:java} > SELECT COUNT(*) FROM hudi_test_rt; > OR: > SELECT COUNT(1) FROM hudi_test_rt;{code} > It returns: > {code:java} > 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: > java.lang.RuntimeException: java.io.IOException: > java.lang.NumberFormatException: For input string: "" > at > org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) > at > org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) > at > org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) > at > org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.io.IOException: java.lang.NumberFormatException: For input > string: "" > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) > at > org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) > at > org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) > ... 19 more > Caused by: java.lang.NumberFormatException: For input string: "" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:592) > at java.lang.Integer.parseInt(Integer.java:615) > at > org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186) > at > org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377) > at > org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) > at > org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) > at > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) > at > org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197) > at > org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222) > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) > ... 20 more > {code} > Some investigations: > Basically, Hive try to update projection column ids during each > getRecordReader stage. But for CO
[jira] [Created] (HUDI-313) Unable SELECT COUNT(*) from a MOR realtime table
Wenning Ding created HUDI-313: - Summary: Unable SELECT COUNT(*) from a MOR realtime table Key: HUDI-313 URL: https://issues.apache.org/jira/browse/HUDI-313 Project: Apache Hudi (incubating) Issue Type: Bug Reporter: Wenning Ding While I query like this in Hive: {code:java} SELECT COUNT(*) FROM hudi_test_rt; OR: SELECT COUNT(1) FROM hudi_test_rt;{code} It returns: {code:java} 2019-10-21 17:38:27,895 [ERROR] [TezChild] |tez.TezProcessor|: java.lang.RuntimeException: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:206) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.next(TezGroupedSplitsInputFormat.java:152) at org.apache.tez.mapreduce.lib.MRReaderMapred.next(MRReaderMapred.java:116) at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.pushRecord(MapRecordSource.java:62) at org.apache.hadoop.hive.ql.exec.tez.MapRecordProcessor.run(MapRecordProcessor.java:419) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:185) at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:168) at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73) at org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61) at org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37) at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.lang.NumberFormatException: For input string: "" at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderCreationException(HiveIOExceptionHandlerChain.java:97) at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderCreationException(HiveIOExceptionHandlerUtil.java:57) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:379) at org.apache.hadoop.mapred.split.TezGroupedSplitsInputFormat$TezGroupedSplitsRecordReader.initNextRecordReader(TezGroupedSplitsInputFormat.java:203) ... 19 more Caused by: java.lang.NumberFormatException: For input string: "" at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) at java.lang.Integer.parseInt(Integer.java:592) at java.lang.Integer.parseInt(Integer.java:615) at org.apache.hadoop.hive.serde2.ColumnProjectionUtils.getReadColumnIDs(ColumnProjectionUtils.java:186) at org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(DataWritableReadSupport.java:377) at org.apache.hadoop.hive.ql.io.parquet.ParquetRecordReaderBase.getSplit(ParquetRecordReaderBase.java:84) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:75) at org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60) at org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:75) at org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:197) at org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat.getRecordReader(HoodieParquetRealtimeInputFormat.java:222) at org.apache.hadoop.hive.ql.io.HiveInputFormat.getRecordReader(HiveInputFormat.java:376) ... 20 more {code} Some investigations: Basically, Hive try to update projection column ids during each getRecordReader stage. But for COUNT(*) or COUNT(1), they don't need any projection column id which is an empty string. And for Hudi, to support compaction in MOR table, Hudi manually adds three Hudi required columns in the projection column ids and make the column ids like "2,0,3". Therefore, when Hive trying to update projection column ids, it combines an empty string with Hudi required columns ids and finally get the column ids like ",2,0,3". This first comma will cause an error during the parsing stage.
[jira] [Created] (HUDI-301) Failed to update a non-partition MOR table
Wenning Ding created HUDI-301: - Summary: Failed to update a non-partition MOR table Key: HUDI-301 URL: https://issues.apache.org/jira/browse/HUDI-301 Project: Apache Hudi (incubating) Issue Type: Bug Components: Common Core Reporter: Wenning Ding We met this exception when trying to update a field for a non-partition MOR table. {code:java} org.apache.hudi.exception.HoodieUpsertException: Error upserting bucketType UPDATE for partition :0 at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:273) at org.apache.hudi.HoodieWriteClient.lambda$upsertRecordsInternal$507693af$1(HoodieWriteClient.java:457) at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) at org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:853) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1182) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD.iterator(RDD.scala:288) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:163) at org.apache.hadoop.fs.Path.(Path.java:175) at org.apache.hadoop.fs.Path.(Path.java:110) at org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:145) at org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:194) at org.apache.hudi.table.HoodieMergeOnReadTable.handleUpdate(HoodieMergeOnReadTable.java:116) at org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpsertPartition(HoodieCopyOnWriteTable.java:265) ... 30 more {code} I have created a PR to solve this issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)