[GitHub] [hudi] vinothsiva1989 opened a new issue #2031: [SUPPORT]
vinothsiva1989 opened a new issue #2031: URL: https://github.com/apache/hudi/issues/2031 i am new to hudi please help . A clear and concise description of the problem. **To Reproduce** Steps to reproduce the behavior: step1: spark-shell \ --packages org.apache.hudi:hudi-spark-bundle_2.12:0.6.0,'org.apache.spark:spark-avro_2.12:3.0.0' \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.hive.convertMetastoreParquet=false' step2: import org.apache.spark.sql.SaveMode import org.apache.spark.sql.functions._ import org.apache.hudi.DataSourceWriteOptions import org.apache.hudi.config.HoodieWriteConfig import org.apache.hudi.hive.MultiPartKeysValueExtractor import org.apache.spark.sql._ step3: create hive table create external table hudi_parquet (op string, pk_id int, name string, value int, updated_at timestamp, created_at timestamp) stored as parquet location '/user/vinoth.siva/hudi_parquet'; step4: insert data into hive table insert into hudi_parquet values ('I',5,'htc',50,'2020-02-06 18:00:39','2020-02-06 18:00:39') step 5: loding data into dataframe val df=spark.read.parquet("/user/vinoth.siva/hudi_parquet/00_0"); df.show() Step6: Hudi options val hudiOptions = Map[String,String]( HoodieWriteConfig.TABLE_NAME -> "my_hudi_table", DataSourceWriteOptions.TABLE_TYPE_OPT_KEY -> "COPY_ON_WRITE", DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "pk_id", DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "created_at", DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "updated_at", DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> "my_hudi_table", DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "created_at", DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> classOf[MultiPartKeysValueExtractor].getName ) step7: Writing hudi data. df.write.format("org.apache.hudi").option(DataSourceWriteOptions.OPERATION_OPT_KEY, DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL).options(hudiOptions).mode(SaveMode.Overwrite).save("/user/vinoth.siva/hudi_cow") **Expected behavior** A clear and concise description of what you expected to happen. **Environment Description** * Hudi version :0.6.0 * Spark version :3 * Hive version :1.2 * Hadoop version :2.7 * Storage (HDFS/S3/GCS..) :hdfs * Running on Docker? (yes/no) 🔕 no **Additional context** Add any other context about the problem here. **Stacktrace** 20/08/25 10:01:21 WARN hudi.HoodieSparkSqlWriter$: hoodie table at /user/vinoth.siva/hudi_cow already exists. Deleting existing data & overwriting with new data. 20/08/25 10:01:22 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.lang.NoSuchMethodError: 'java.lang.Object org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.fromRow(org.apache.spark.sql.catalyst.InternalRow)' at org.apache.hudi.AvroConversionUtils$.$anonfun$createRdd$1(AvroConversionUtils.scala:44) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$SliceIterator.next(Iterator.scala:271) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1423) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)
[jira] [Resolved] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu resolved HUDI-1103. --- Resolution: Fixed > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.1 > > > {color}Currently, the delete data demo code is not runnable in spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183721#comment-17183721 ] wangxianghu commented on HUDI-1103: --- fixed via asf-site branch : 1d30581ef8ddfa08a56b2a70f29370b927c649ca > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.1 > > > {color}Currently, the delete data demo code is not runnable in spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide
[ https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1103: -- Status: Open (was: New) > Improve the code format of Delete data demo in Quick-Start Guide > > > Key: HUDI-1103 > URL: https://issues.apache.org/jira/browse/HUDI-1103 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: wangxianghu >Assignee: Trevorzhang >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.1 > > > {color}Currently, the delete data demo code is not runnable in spark-shell > {code:java} > scala> val df = spark > df: org.apache.spark.sql.SparkSession = > org.apache.spark.sql.SparkSession@74e7d97bscala> .read > :1: error: illegal start of definition > .read > ^scala> .json(spark.sparkContext.parallelize(deletes, 2)) > :1: error: illegal start of definition > .json(spark.sparkContext.parallelize(deletes, 2)) > ^ > {code} > This dot symbol should be at the end of the line or put a "\" at the end > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1214) Need ability to set deltastreamer checkpoints when doing Spark datasource writes
[ https://issues.apache.org/jira/browse/HUDI-1214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183706#comment-17183706 ] Balaji Varadarajan commented on HUDI-1214: -- [~Trevorzhang] : Go for it > Need ability to set deltastreamer checkpoints when doing Spark datasource > writes > > > Key: HUDI-1214 > URL: https://issues.apache.org/jira/browse/HUDI-1214 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > Such support is needed for bootstrapping cases when users use spark write to > do initial bootstrap and then subsequently use deltastreamer. > DeltaStreamer manages checkpoints inside hoodie commit files and expects > checkpoints in previously committed metadata. Users are expected to pass > checkpoint or initial checkpoint provider when performing bootstrap through > deltastreamer. Such support is not present when doing bootstrap using Spark > Datasource. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1201) HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset when commit files do not have checkpoint
[ https://issues.apache.org/jira/browse/HUDI-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17183707#comment-17183707 ] Balaji Varadarajan commented on HUDI-1201: -- [~Trevorzhang] : Go for it * [|https://issues.apache.org/jira/secure/AddComment!default.jspa?id=13324048] > HoodieDeltaStreamer: Allow user overrides to read from earliest kafka offset > when commit files do not have checkpoint > - > > Key: HUDI-1201 > URL: https://issues.apache.org/jira/browse/HUDI-1201 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > [https://github.com/apache/hudi/issues/1985] > > It would be easier for user to just specify deltastreamer to read from > earliest offset instead of implementing -initial-checkpoint-provider or > passing raw kafka checkpoints when the table was initially bootstrapped > through spark.write(). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] bvaradar commented on issue #1979: [SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?
bvaradar commented on issue #1979: URL: https://github.com/apache/hudi/issues/1979#issuecomment-679522323 @hughfdjackson : In general getting incremental read to discard duplicates is not possible for MOR table types as we defer the merging of records to compaction. I was thinking about alternate ways to achieve your use-case for COW table by using an application level boolean flag. Let me know if this makes sense: 1. Introduce additional boolean column "changed". Default Value is false. 2. Have your own implementation of HoodieRecordPayload plugged-in. 3a In HoodieRecordPayload.getInsertValue(), return an avro record with changed = true. This function is called first time when the new record is inserted. 3(b) In HoodieRecordPayload.combineAndGetUpdateValue(), if you determine, there is no material change, set changed = false else set it to true. In your incremental query, add the filter changed = true to filter out those without material changes ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2007: [SUPPORT] Is Timeline metadata queryable ?
bvaradar commented on issue #2007: URL: https://github.com/apache/hudi/issues/2007#issuecomment-679497286 @ashishmgofficial : If you use DeltaStreamer, it comes with kafka integration and manages checkpoints internally. So, there is no need to query timeline metadata separately. Do you have any specific requirement where you need to look at timeline ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] Trevor-zhang commented on a change in pull request #2021: [HUDI-1218] Introduce BulkInsertSortMode as Independent class
Trevor-zhang commented on a change in pull request #2021: URL: https://github.com/apache/hudi/pull/2021#discussion_r476102080 ## File path: hudi-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java ## @@ -39,10 +39,4 @@ public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) { throw new HoodieException("The bulk insert mode \"" + sortMode.name() + "\" is not supported."); Review comment: > `The bulk insert sort mode `? better? I think so. i have changed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2021: [HUDI-1218]Introduce BulkInsertSortMode as Independent class
yanghua commented on a change in pull request #2021: URL: https://github.com/apache/hudi/pull/2021#discussion_r476098066 ## File path: hudi-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerFactory.java ## @@ -39,10 +39,4 @@ public static BulkInsertPartitioner get(BulkInsertSortMode sortMode) { throw new HoodieException("The bulk insert mode \"" + sortMode.name() + "\" is not supported."); Review comment: `The bulk insert sort mode `? better? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran commented on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-679481632 Thank you for your answer @bvaradar ! > Can you please add the details of "commit showfiles --commit 20200821153748" ``` ╔═══╤╤═╤═══╤═══╤═╤══╤═══╗ ║ Partition Path│ File ID│ Previous Commit │ Total Records Updated │ Total Records Written │ Total Bytes Written │ Total Errors │ File Size ║ ╠═══╪╪═╪═══╪═══╪═╪══╪═══╣ ║ daas_date=2020-04 │ 63bacea1-d6af-4ce0-8dc8-6ce9db8df332-0 │ 20200821152906 │ 212 │ 534115│ 22998619 │ 0│ 22998619 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-04 │ 9c5e022c-feda-4059-84f6-752344cea4a9-0 │ 20200821152906 │ 89│ 460341│ 18755115 │ 0│ 18755115 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-04 │ 80be527b-eda7-42f3-8565-c15e9447d731-0 │ 20200821152906 │ 39│ 192455│ 9112346 │ 0│ 9112346 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-04 │ 569b2555-5cd6-416a-b7d7-11897603a1e3-0 │ 20200821152906 │ 3 │ 483483│ 19114286 │ 0│ 19114286 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ 80be527b-eda7-42f3-8565-c15e9447d731-1 │ 20200821152906 │ 106 │ 302728│ 13385764 │ 0│ 13385764 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ 27da8cb6-e4b7-4c29-904b-25d3ba321d0a-0 │ 20200821152906 │ 84│ 482538│ 19568311 │ 0│ 19568311 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ 0c376059-0279-4967-8002-70c3cd9c6b8e-0 │ 20200821152906 │ 84│ 498131│ 21751990 │ 0│ 21751990 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ 9730fe61-5584-4156-b25c-8c8ef41583f4-0 │ 20200821152906 │ 76│ 500352│ 19812831 │ 0│ 19812831 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ c2c3fb95-3e58-4021-80c4-7e48aace8dda-0 │ 20200821152906 │ 72│ 484533│ 21001957 │ 0│ 21001957 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ ec2ba5dc-7dd7-4cc7-93cd-1358476a124f-0 │ 20200821152906 │ 61│ 509569│ 21960018 │ 0│ 21960018 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2020-05 │ bd54d7bb-2fb7-475f-8ca2-47594a1c3206-0 │ 20200821152906 │ 46│ 342451│ 14678548 │ 0│ 14678548 ║ ╟───┼┼─┼───┼───┼─┼──┼───╢ ║ daas_date=2019-05 │ a0d89a7a-0621-469a-8359-c4c4b8948ff5-1 │ 20200821152906 │ 3 │ 445248│ 16992382 │ 0│ 16992382 ║ ╟─
[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
n3nash commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r476026846 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw + */ +public class TimelineUtils { + + /** + * Returns partitions that have new data strictly after commitTime. + * Does not include internal operations such as clean in the timeline. + */ + public static List getPartitionsWritten(HoodieTimeline timeline) { +HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline(); +return getPartitionsMutated(timelineToSync); + } + + /** + * Returns partitions that have been modified including internal operations such as clean in the passed timeline. + */ + public static List getPartitionsMutated(HoodieTimeline timeline) { Review comment: Sure, getAffectedPartitions is fine @satishkotha, I think initially it was getWrittenPartitions or something.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
n3nash commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r476026846 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw + */ +public class TimelineUtils { + + /** + * Returns partitions that have new data strictly after commitTime. + * Does not include internal operations such as clean in the timeline. + */ + public static List getPartitionsWritten(HoodieTimeline timeline) { +HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline(); +return getPartitionsMutated(timelineToSync); + } + + /** + * Returns partitions that have been modified including internal operations such as clean in the passed timeline. + */ + public static List getPartitionsMutated(HoodieTimeline timeline) { Review comment: Sure, getAffectedPartitions is fine.. @satishkotha This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1135] Make timeline server timeout settings configurable.
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 218d4a6 [HUDI-1135] Make timeline server timeout settings configurable. 218d4a6 is described below commit 218d4a6836ce86216d289fa5fbb389871ad7ba0f Author: Prashant Wason AuthorDate: Mon Aug 24 14:10:48 2020 -0700 [HUDI-1135] Make timeline server timeout settings configurable. --- .../hudi/common/table/view/FileSystemViewManager.java | 5 +++-- .../common/table/view/FileSystemViewStorageConfig.java| 15 ++- .../table/view/RemoteHoodieTableFileSystemView.java | 8 +++- 3 files changed, 24 insertions(+), 4 deletions(-) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java index c4ab712..d310181 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java @@ -169,9 +169,10 @@ public class FileSystemViewManager { private static RemoteHoodieTableFileSystemView createRemoteFileSystemView(SerializableConfiguration conf, FileSystemViewStorageConfig viewConf, HoodieTableMetaClient metaClient) { LOG.info("Creating remote view for basePath " + metaClient.getBasePath() + ". Server=" -+ viewConf.getRemoteViewServerHost() + ":" + viewConf.getRemoteViewServerPort()); ++ viewConf.getRemoteViewServerHost() + ":" + viewConf.getRemoteViewServerPort() + ", Timeout=" ++ viewConf.getRemoteTimelineClientTimeoutSecs()); return new RemoteHoodieTableFileSystemView(viewConf.getRemoteViewServerHost(), viewConf.getRemoteViewServerPort(), -metaClient); +metaClient, viewConf.getRemoteTimelineClientTimeoutSecs()); } /** diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java index 434f873..603f88f 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java @@ -44,6 +44,8 @@ public class FileSystemViewStorageConfig extends DefaultHoodieConfig { public static final String FILESYSTEM_VIEW_BOOTSTRAP_BASE_FILE_FRACTION = "hoodie.filesystem.view.spillable.bootstrap.base.file.mem.fraction"; private static final String ROCKSDB_BASE_PATH_PROP = "hoodie.filesystem.view.rocksdb.base.path"; + public static final String FILESTYSTEM_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = + "hoodie.filesystem.view.remote.timeout.secs"; public static final FileSystemViewStorageType DEFAULT_VIEW_STORAGE_TYPE = FileSystemViewStorageType.MEMORY; public static final FileSystemViewStorageType DEFAULT_SECONDARY_VIEW_STORAGE_TYPE = FileSystemViewStorageType.MEMORY; @@ -52,7 +54,7 @@ public class FileSystemViewStorageConfig extends DefaultHoodieConfig { public static final String DEFAULT_FILESYSTEM_VIEW_INCREMENTAL_SYNC_MODE = "false"; public static final String DEFUALT_REMOTE_VIEW_SERVER_HOST = "localhost"; public static final Integer DEFAULT_REMOTE_VIEW_SERVER_PORT = 26754; - + public static final Integer DEFAULT_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = 5 * 60; // 5 min public static final String DEFAULT_VIEW_SPILLABLE_DIR = "/tmp/view_map/"; private static final Double DEFAULT_MEM_FRACTION_FOR_PENDING_COMPACTION = 0.01; private static final Double DEFAULT_MEM_FRACTION_FOR_EXTERNAL_DATA_FILE = 0.05; @@ -91,6 +93,10 @@ public class FileSystemViewStorageConfig extends DefaultHoodieConfig { return Integer.parseInt(props.getProperty(FILESYSTEM_VIEW_REMOTE_PORT)); } + public Integer getRemoteTimelineClientTimeoutSecs() { +return Integer.parseInt(props.getProperty(FILESTYSTEM_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS)); + } + public long getMaxMemoryForFileGroupMap() { long totalMemory = Long.parseLong(props.getProperty(FILESYSTEM_VIEW_SPILLABLE_MEM)); return totalMemory - getMaxMemoryForPendingCompaction() - getMaxMemoryForBootstrapBaseFile(); @@ -175,6 +181,11 @@ public class FileSystemViewStorageConfig extends DefaultHoodieConfig { return this; } +public Builder withRemoteTimelineClientTimeoutSecs(Long timelineClientTimeoutSecs) { + props.setProperty(FILESTYSTEM_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS, timelineClientTimeoutSecs.toString()); + return this; +} + public Builder withMemFractionForPendingCompaction(Double memFractionForPendingCompaction) { props.setProperty(FILESYSTEM_VIEW_PENDING_COMPACTION_MEM_FRACTION, memFractionForPendingComp
[GitHub] [hudi] n3nash merged pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.
n3nash merged pull request #2026: URL: https://github.com/apache/hudi/pull/2026 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 9b1f16b [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class. 9b1f16b is described below commit 9b1f16b604143f5a6926db57173f921fbb6c Author: Prashant Wason AuthorDate: Mon Aug 24 14:24:50 2020 -0700 [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class. --- .../apache/hudi/common/table/timeline/HoodieDefaultTimeline.java | 7 +++ .../java/org/apache/hudi/common/table/timeline/HoodieTimeline.java | 5 + 2 files changed, 12 insertions(+) diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java index c7a6230..678d056 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java @@ -133,6 +133,13 @@ public class HoodieDefaultTimeline implements HoodieTimeline { } @Override + public HoodieDefaultTimeline findInstantsAfterOrEquals(String commitTime, int numCommits) { +return new HoodieDefaultTimeline(instants.stream() +.filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), GREATER_THAN_OR_EQUALS, commitTime)) +.limit(numCommits), details); + } + + @Override public HoodieDefaultTimeline findInstantsBefore(String instantTime) { return new HoodieDefaultTimeline(instants.stream() .filter(s -> HoodieTimeline.compareTimestamps(s.getTimestamp(), LESSER_THAN, instantTime)), diff --git a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java index 45b9e34..b7c405e 100644 --- a/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java +++ b/hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java @@ -141,6 +141,11 @@ public interface HoodieTimeline extends Serializable { HoodieTimeline filterPendingCompactionTimeline(); /** + * Create a new Timeline with all the instants after startTs. + */ + HoodieTimeline findInstantsAfterOrEquals(String commitTime, int numCommits); + + /** * Create a new Timeline with instants after startTs and before or on endTs. */ HoodieTimeline findInstantsInRange(String startTs, String endTs);
[GitHub] [hudi] n3nash merged pull request #2027: [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.
n3nash merged pull request #2027: URL: https://github.com/apache/hudi/pull/2027 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2030: [HUDI-1130] hudi-test-suite support for schema evolution (can be trig…
n3nash commented on pull request #2030: URL: https://github.com/apache/hudi/pull/2030#issuecomment-679429416 @vinothchandar Yes, that PR is following later today by @modi95. We will merge this after that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
satishkotha commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475962559 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw + */ +public class TimelineUtils { + + /** + * Returns partitions that have new data strictly after commitTime. + * Does not include internal operations such as clean in the timeline. + */ + public static List getPartitionsWritten(HoodieTimeline timeline) { +HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline(); +return getPartitionsMutated(timelineToSync); + } + + /** + * Returns partitions that have been modified including internal operations such as clean in the passed timeline. + */ + public static List getPartitionsMutated(HoodieTimeline timeline) { +return timeline.filterCompletedInstants().getInstants().flatMap(s -> { + switch (s.getAction()) { +case HoodieTimeline.COMMIT_ACTION: +case HoodieTimeline.DELTA_COMMIT_ACTION: + try { +HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(s).get(), HoodieCommitMetadata.class); +return commitMetadata.getPartitionToWriteStats().keySet().stream(); + } catch (IOException e) { +throw new HoodieIOException("Failed to get partitions written between " + timeline.firstInstant() + " " + timeline.lastInstant(), e); + } +case HoodieTimeline.CLEAN_ACTION: + try { +HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantDetails(s).get()); +return cleanMetadata.getPartitionMetadata().keySet().stream(); + } catch (IOException e) { +throw new HoodieIOException("Failed to get partitions cleaned between " + timeline.firstInstant() + " " + timeline.lastInstant(), e); + } +case HoodieTimeline.COMPACTION_ACTION: + // compaction is not a completed instant. So no need to consider this action. +case HoodieTimeline.SAVEPOINT_ACTION: +case HoodieTimeline.ROLLBACK_ACTION: +case HoodieTimeline.RESTORE_ACTION: + return Stream.empty(); Review comment: compaction is not treated as completed instants. So this can be ignored. I implemented all other actions and added tests This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha merged pull request #2028: Site update and release page for 0.6.0
bhasudha merged pull request #2028: URL: https://github.com/apache/hudi/pull/2028 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
satishkotha commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475961204 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw + */ +public class TimelineUtils { + + /** + * Returns partitions that have new data strictly after commitTime. + * Does not include internal operations such as clean in the timeline. + */ + public static List getPartitionsWritten(HoodieTimeline timeline) { +HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline(); +return getPartitionsMutated(timelineToSync); + } + + /** + * Returns partitions that have been modified including internal operations such as clean in the passed timeline. + */ + public static List getPartitionsMutated(HoodieTimeline timeline) { Review comment: Yes, this is for internal use. I have no strong opinion on name. I think it was initially getAffectedPartitions, changed to getPartitionsMutated because of @n3nash suggestion. i'm fine with going back if he agrees ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw Review comment: thank you! fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2030: [HUDI-1130] hudi-test-suite support for schema evolution (can be trig…
vinothchandar commented on pull request #2030: URL: https://github.com/apache/hudi/pull/2030#issuecomment-679411848 can we first make the test-suite tests work on master and run in CI, before we merge more features? cc @n3nash This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1130) Allow for schema evolution within DAG for hudi test suite
[ https://issues.apache.org/jira/browse/HUDI-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1130: - Labels: pull-request-available (was: ) > Allow for schema evolution within DAG for hudi test suite > - > > Key: HUDI-1130 > URL: https://issues.apache.org/jira/browse/HUDI-1130 > Project: Apache Hudi > Issue Type: Improvement > Components: Testing >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nbalajee opened a new pull request #2030: [HUDI-1130] hudi-test-suite support for schema evolution (can be trig…
nbalajee opened a new pull request #2030: URL: https://github.com/apache/hudi/pull/2030 …gered on any insert/upsert DAG node). ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request Each insert/upsert dag node in the hudi test-suite can specify a schema to be used for the node. DAG execution context is reinitialized with the new/evolved schema. This allows verification of schema evolution scenarios. ## Brief change log - Modify the test suite to accept "reinitialize_context" flag and new schema file. - reinitialize the writer context, as part of DAG node execution. (Remaining nodes will use the updated schema). ## Verify this pull request Verified using the hudi-test-suite. This change added tests and can be verified as follows: - Launched hudi test suite with the following: ``` insert_1: config: record_size: 7000 num_partitions_insert: 1 repeat_count: 5 num_records_insert: 10 reinitialize_context: true hoodie.deltastreamer.schemaprovider.source.schema.file: "file:///tmp/evolved.avsc" ``` ## Committer checklist - [ x] Has a corresponding JIRA in PR title & commit - [x ] Commit message is descriptive of the change - [ x] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475940809 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieHFileRealtimeInputFormat.java ## @@ -0,0 +1,110 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop.realtime; + +import java.io.IOException; +import java.util.Arrays; +import java.util.stream.Stream; + +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.hive.serde2.ColumnProjectionUtils; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hudi.common.table.timeline.HoodieDefaultTimeline; +import org.apache.hudi.common.util.ValidationUtils; +import org.apache.hudi.hadoop.HoodieHFileInputFormat; +import org.apache.hudi.hadoop.UseFileSplitsFromInputFormat; +import org.apache.hudi.hadoop.UseRecordReaderFromInputFormat; +import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils; +import org.apache.hudi.hadoop.utils.HoodieRealtimeInputFormatUtils; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; + +/** + * HoodieRealtimeInputFormat for HUDI datasets which store data in HFile base file format. + */ +@UseRecordReaderFromInputFormat +@UseFileSplitsFromInputFormat +public class HoodieHFileRealtimeInputFormat extends HoodieHFileInputFormat { + + private static final Logger LOG = LogManager.getLogger(HoodieHFileRealtimeInputFormat.class); + + @Override + public InputSplit[] getSplits(JobConf job, int numSplits) throws IOException { +Stream fileSplits = Arrays.stream(super.getSplits(job, numSplits)).map(is -> (FileSplit) is); +return HoodieRealtimeInputFormatUtils.getRealtimeSplits(job, fileSplits); + } + + @Override + public FileStatus[] listStatus(JobConf job) throws IOException { +// Call the HoodieInputFormat::listStatus to obtain all latest hfiles, based on commit timeline. +return super.listStatus(job); + } + + @Override + protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline timeline) { +// no specific filtering for Realtime format +return timeline; + } + + @Override + public RecordReader getRecordReader(final InputSplit split, final JobConf jobConf, + final Reporter reporter) throws IOException { +// Hive on Spark invokes multiple getRecordReaders from different threads in the same spark task (and hence the Review comment: we need to share code somehow. This same large comment need not be in multiple places This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] umehrot2 commented on issue #1981: [SUPPORT] Huge performance Difference Between Hudi and Regular Parquet in Athena
umehrot2 commented on issue #1981: URL: https://github.com/apache/hudi/issues/1981#issuecomment-679408763 @rubenssoto yes currently EMR presto is on 0.232, but in upcoming releases you will see later versions of presto where you will be able to use this patch. If you want to manually give it a shot on current emr version..you can try to build presto 0.233 and replace presto-hive jar I believe on all nodes of the cluster and restart presto-server on all nodes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475940393 ## File path: hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieHFileInputFormat.java ## @@ -0,0 +1,163 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.hadoop; + +import org.apache.hadoop.conf.Configurable; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileStatus; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.ArrayWritable; +import org.apache.hadoop.io.NullWritable; +import org.apache.hadoop.mapred.FileInputFormat; +import org.apache.hadoop.mapred.InputSplit; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.RecordReader; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapreduce.Job; +import org.apache.hudi.common.model.HoodieFileFormat; +import org.apache.hudi.common.table.HoodieTableMetaClient; +import org.apache.hudi.common.table.timeline.HoodieDefaultTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.hadoop.utils.HoodieHiveUtils; +import org.apache.hudi.hadoop.utils.HoodieInputFormatUtils; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; +import java.util.Map; + +/** + * HoodieInputFormat for HUDI datasets which store data in HFile base file format. + */ +@UseFileSplitsFromInputFormat +public class HoodieHFileInputFormat extends FileInputFormat implements Configurable { + + private static final Logger LOG = LogManager.getLogger(HoodieHFileInputFormat.class); + + protected Configuration conf; + + protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline timeline) { +return HoodieInputFormatUtils.filterInstantsTimeline(timeline); + } + + @Override + public FileStatus[] listStatus(JobConf job) throws IOException { Review comment: While this is true, we can use more helpers and avoid copying to a large degree? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475939558 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java ## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.log.block; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.model.HoodieLogFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieHFileReader; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; + +import org.apache.avro.Schema; +import org.apache.avro.Schema.Field; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.io.compress.Compression; +import org.apache.hadoop.hbase.io.hfile.CacheConfig; +import org.apache.hadoop.hbase.io.hfile.HFile; +import org.apache.hadoop.hbase.io.hfile.HFileContext; +import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder; +import org.apache.hadoop.hbase.util.Pair; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.TreeMap; +import java.util.stream.Collectors; + +import javax.annotation.Nonnull; + +/** + * HoodieHFileDataBlock contains a list of records stored inside an HFile format. It is used with the HFile + * base file format. + */ +public class HoodieHFileDataBlock extends HoodieDataBlock { + private static final Logger LOG = LogManager.getLogger(HoodieHFileDataBlock.class); + private static Compression.Algorithm compressionAlgorithm = Compression.Algorithm.GZ; + private static int blockSize = 1 * 1024 * 1024; + + public HoodieHFileDataBlock(@Nonnull Map logBlockHeader, + @Nonnull Map logBlockFooter, + @Nonnull Option blockContentLocation, @Nonnull Option content, + FSDataInputStream inputStream, boolean readBlockLazily) { +super(logBlockHeader, logBlockFooter, blockContentLocation, content, inputStream, readBlockLazily); + } + + public HoodieHFileDataBlock(HoodieLogFile logFile, FSDataInputStream inputStream, Option content, + boolean readBlockLazily, long position, long blockSize, long blockEndpos, Schema readerSchema, + Map header, Map footer) { +super(content, inputStream, readBlockLazily, + Option.of(new HoodieLogBlockContentLocation(logFile, position, blockSize, blockEndpos)), readerSchema, header, + footer); + } + + public HoodieHFileDataBlock(@Nonnull List records, @Nonnull Map header) { +super(records, header, new HashMap<>()); + } + + @Override + public HoodieLogBlockType getBlockType() { +return HoodieLogBlockType.HFILE_DATA_BLOCK; + } + + @Override + protected byte[] serializeRecords() throws IOException { +HFileContext context = new HFileContextBuilder().withBlockSize(blockSize).withCompression(compressionAlgorithm) +.build(); +Configuration conf = new Configuration(); +CacheConfig cacheConfig = new CacheConfig(conf); +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +FSDataOutputStream ostream = new FSDataOutputStream(baos, null); + +HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig) +.withOutputStream(ostream).withFileContext(context).create(); + +// Serialize records into bytes +Map recordMap = new TreeMap<>(); +Iterator itr = records.iterator(); +boolean useIntegerKey = false; +int key = 0; +int keySize = 0; +Field keyField = records.get(0).getSchema().getField(HoodieRecord.RECORD_KEY_METADATA_FIELD); +if (keyField == null) { + // Missing key metadata field so we should use an integer sequence key + useIntegerKey = true; + keySize = (int) Math.ceil(Math.log(records.size())) + 1; +} +while (itr.hasNext(
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475939229 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java ## @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.log.block; + +import org.apache.hudi.avro.HoodieAvroUtils; +import org.apache.hudi.common.model.HoodieLogFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.io.storage.HoodieHFileReader; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; + +import org.apache.avro.Schema; +import org.apache.avro.Schema.Field; +import org.apache.avro.generic.IndexedRecord; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FSDataInputStream; +import org.apache.hadoop.fs.FSDataOutputStream; +import org.apache.hadoop.hbase.KeyValue; +import org.apache.hadoop.hbase.io.compress.Compression; +import org.apache.hadoop.hbase.io.hfile.CacheConfig; +import org.apache.hadoop.hbase.io.hfile.HFile; +import org.apache.hadoop.hbase.io.hfile.HFileContext; +import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder; +import org.apache.hadoop.hbase.util.Pair; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.TreeMap; +import java.util.stream.Collectors; + +import javax.annotation.Nonnull; + +/** + * HoodieHFileDataBlock contains a list of records stored inside an HFile format. It is used with the HFile + * base file format. + */ +public class HoodieHFileDataBlock extends HoodieDataBlock { + private static final Logger LOG = LogManager.getLogger(HoodieHFileDataBlock.class); + private static Compression.Algorithm compressionAlgorithm = Compression.Algorithm.GZ; + private static int blockSize = 1 * 1024 * 1024; + + public HoodieHFileDataBlock(@Nonnull Map logBlockHeader, + @Nonnull Map logBlockFooter, + @Nonnull Option blockContentLocation, @Nonnull Option content, + FSDataInputStream inputStream, boolean readBlockLazily) { +super(logBlockHeader, logBlockFooter, blockContentLocation, content, inputStream, readBlockLazily); + } + + public HoodieHFileDataBlock(HoodieLogFile logFile, FSDataInputStream inputStream, Option content, + boolean readBlockLazily, long position, long blockSize, long blockEndpos, Schema readerSchema, + Map header, Map footer) { +super(content, inputStream, readBlockLazily, + Option.of(new HoodieLogBlockContentLocation(logFile, position, blockSize, blockEndpos)), readerSchema, header, + footer); + } + + public HoodieHFileDataBlock(@Nonnull List records, @Nonnull Map header) { +super(records, header, new HashMap<>()); + } + + @Override + public HoodieLogBlockType getBlockType() { +return HoodieLogBlockType.HFILE_DATA_BLOCK; + } + + @Override + protected byte[] serializeRecords() throws IOException { +HFileContext context = new HFileContextBuilder().withBlockSize(blockSize).withCompression(compressionAlgorithm) +.build(); +Configuration conf = new Configuration(); +CacheConfig cacheConfig = new CacheConfig(conf); +ByteArrayOutputStream baos = new ByteArrayOutputStream(); +FSDataOutputStream ostream = new FSDataOutputStream(baos, null); + +HFile.Writer writer = HFile.getWriterFactory(conf, cacheConfig) +.withOutputStream(ostream).withFileContext(context).create(); + +// Serialize records into bytes +Map recordMap = new TreeMap<>(); +Iterator itr = records.iterator(); +boolean useIntegerKey = false; +int key = 0; +int keySize = 0; +Field keyField = records.get(0).getSchema().getField(HoodieRecord.RECORD_KEY_METADATA_FIELD); +if (keyField == null) { + // Missing key metadata field so we should use an integer sequence key + useIntegerKey = true; + keySize = (int) Math.ceil(Math.log(records.size())) + 1; +} +while (itr.hasNext(
[GitHub] [hudi] s-sanjay commented on issue #1895: HUDI Dataset backed by Hive Metastore fails on Presto with Unknown converted type TIMESTAMP_MICROS
s-sanjay commented on issue #1895: URL: https://github.com/apache/hudi/issues/1895#issuecomment-679406205 @FelixKJose I have raised a [PR](https://github.com/prestodb/presto/pull/15074) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475937184 ## File path: hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieHFileConfig.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io.storage; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.hbase.io.compress.Compression; +import org.apache.hudi.common.bloom.BloomFilter; + +public class HoodieHFileConfig { + + private Compression.Algorithm compressionAlgorithm; + private int blockSize; + private long maxFileSize; + private Configuration hadoopConf; + private BloomFilter bloomFilter; + + public HoodieHFileConfig(Compression.Algorithm compressionAlgorithm, int blockSize, long maxFileSize, Review comment: I think its here. https://github.com/nsivabalan/hudi/commit/70ee9947335c1ed0c92a285225ab728adaa7c5ab ? @nsivabalan ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475936227 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.hudi.client.SparkTaskContextSupplier; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieUpsertException; +import org.apache.hudi.table.HoodieTable; + +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.Iterator; +import java.util.Map; +import java.util.PriorityQueue; +import java.util.Queue; + +/** + * Hoodie merge handle which writes records (new inserts or updates) sorted by their key. + * + * The implementation performs a merge-sort by comparing the key of the record being written to the list of + * keys in newRecordKeys (sorted in-memory). + */ +public class HoodieSortedMergeHandle extends HoodieMergeHandle { + + private Queue newRecordKeysSorted = new PriorityQueue<>(); + + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Iterator> recordItr, String partitionPath, String fileId, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, sparkTaskContextSupplier); +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Called by compactor code path. + */ + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, Review comment: So, no code calls this atm ? The following code in `HoodieMergeOnReadTableCompactor` ``` if (oldDataFileOpt.isPresent()) { result = hoodieCopyOnWriteTable.handleUpdate(instantTime, operation.getPartitionPath(), operation.getFileId(), scanner.getRecords(), oldDataFileOpt.get()); } ``` ends up calling the following. ``` protected HoodieMergeHandle getUpdateHandle(String instantTime, String partitionPath, String fileId, Map> keyToNewRecords, HoodieBaseFile dataFileToBeMerged) { return new HoodieMergeHandle<>(config, instantTime, this, keyToNewRecords, partitionPath, fileId, dataFileToBeMerged, sparkTaskContextSupplier); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475936227 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.hudi.client.SparkTaskContextSupplier; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieUpsertException; +import org.apache.hudi.table.HoodieTable; + +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.Iterator; +import java.util.Map; +import java.util.PriorityQueue; +import java.util.Queue; + +/** + * Hoodie merge handle which writes records (new inserts or updates) sorted by their key. + * + * The implementation performs a merge-sort by comparing the key of the record being written to the list of + * keys in newRecordKeys (sorted in-memory). + */ +public class HoodieSortedMergeHandle extends HoodieMergeHandle { + + private Queue newRecordKeysSorted = new PriorityQueue<>(); + + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Iterator> recordItr, String partitionPath, String fileId, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, sparkTaskContextSupplier); +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Called by compactor code path. + */ + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, Review comment: So, no code calls this atm ? @prashantwason The following code in `HoodieMergeOnReadTableCompactor` ``` if (oldDataFileOpt.isPresent()) { result = hoodieCopyOnWriteTable.handleUpdate(instantTime, operation.getPartitionPath(), operation.getFileId(), scanner.getRecords(), oldDataFileOpt.get()); } ``` ends up calling the following. ``` protected HoodieMergeHandle getUpdateHandle(String instantTime, String partitionPath, String fileId, Map> keyToNewRecords, HoodieBaseFile dataFileToBeMerged) { return new HoodieMergeHandle<>(config, instantTime, this, keyToNewRecords, partitionPath, fileId, dataFileToBeMerged, sparkTaskContextSupplier); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475935700 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.hudi.client.SparkTaskContextSupplier; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieUpsertException; +import org.apache.hudi.table.HoodieTable; + +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.Iterator; +import java.util.Map; +import java.util.PriorityQueue; +import java.util.Queue; + +/** + * Hoodie merge handle which writes records (new inserts or updates) sorted by their key. + * + * The implementation performs a merge-sort by comparing the key of the record being written to the list of + * keys in newRecordKeys (sorted in-memory). + */ +public class HoodieSortedMergeHandle extends HoodieMergeHandle { + + private Queue newRecordKeysSorted = new PriorityQueue<>(); + + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Iterator> recordItr, String partitionPath, String fileId, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, sparkTaskContextSupplier); +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Called by compactor code path. + */ + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Map> keyToNewRecordsOrig, String partitionPath, String fileId, + HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, keyToNewRecordsOrig, partitionPath, fileId, dataFileToBeMerged, +sparkTaskContextSupplier); + +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Go through an old record. Here if we detect a newer version shows up, we write the new one to the file. + */ + @Override + public void write(GenericRecord oldRecord) { +String key = oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString(); + +// To maintain overall sorted order across updates and inserts, write any new inserts whose keys are less than +// the oldRecord's key. +while (!newRecordKeysSorted.isEmpty() && newRecordKeysSorted.peek().compareTo(key) <= 0) { + String keyToPreWrite = newRecordKeysSorted.remove(); Review comment: cc @prashantwason can you please chime in. I feel we can avoid the queue altogether and just sort merge? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query
bvaradar commented on issue #1980: URL: https://github.com/apache/hudi/issues/1980#issuecomment-679403428 Yes, this is expected. We retain the penultimate version of the file to prevent a running query from failing. In this case, you might see only one version of some file which did not see any deletes. That is expected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
bvaradar commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475931147 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw Review comment: typo : queryxw -> query ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw + */ +public class TimelineUtils { + + /** + * Returns partitions that have new data strictly after commitTime. + * Does not include internal operations such as clean in the timeline. + */ + public static List getPartitionsWritten(HoodieTimeline timeline) { +HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline(); +return getPartitionsMutated(timelineToSync); + } + + /** + * Returns partitions that have been modified including internal operations such as clean in the passed timeline. + */ + public static List getPartitionsMutated(HoodieTimeline timeline) { Review comment: I am assuming you are planning to use this API internally. Right ? ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475932748 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java ## @@ -55,7 +56,7 @@ private long recordsWritten = 0; private long insertRecordsWritten = 0; private long recordsDeleted = 0; - private Iterator> recordIterator; + private Map> recordMap; Review comment: okay understood this better. looks good for now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.
prashantwason commented on pull request #2026: URL: https://github.com/apache/hudi/pull/2026#issuecomment-679401505 @n3nash corrected the errors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.
prashantwason commented on a change in pull request #2026: URL: https://github.com/apache/hudi/pull/2026#discussion_r475932300 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java ## @@ -147,7 +153,7 @@ public RemoteHoodieTableFileSystemView(String server, int port, HoodieTableMetaC String url = builder.toString(); LOG.info("Sending request : (" + url + ")"); Response response; -int timeout = 1000 * 300; // 5 min timeout +int timeout = this.timeoutSecs * 1000; // msec Review comment: Yes, msec. ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java ## @@ -52,7 +54,7 @@ public static final String DEFAULT_FILESYSTEM_VIEW_INCREMENTAL_SYNC_MODE = "false"; public static final String DEFUALT_REMOTE_VIEW_SERVER_HOST = "localhost"; public static final Integer DEFAULT_REMOTE_VIEW_SERVER_PORT = 26754; - + public static final Integer DEFAULT_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = 300 * 60; // 5 min Review comment: Corrected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] modi95 commented on a change in pull request #1975: [HUDI-1194][WIP] Reorganize HoodieHiveClient based on the way to call Hive API
modi95 commented on a change in pull request #1975: URL: https://github.com/apache/hudi/pull/1975#discussion_r475932170 ## File path: pom.xml ## @@ -94,7 +94,7 @@ 2.9.9 2.7.3 org.apache.hive -2.3.1 +2.3.6 Review comment: Why are we updating Hive? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] modi95 commented on a change in pull request #1975: [HUDI-1194][WIP] Reorganize HoodieHiveClient based on the way to call Hive API
modi95 commented on a change in pull request #1975: URL: https://github.com/apache/hudi/pull/1975#discussion_r475930539 ## File path: hudi-spark/src/main/scala/org/apache/hudi/DataSourceOptions.scala ## @@ -303,8 +305,28 @@ object DataSourceWriteOptions { val DEFAULT_HIVE_PARTITION_EXTRACTOR_CLASS_OPT_VAL = classOf[SlashEncodedDayPartitionValueExtractor].getCanonicalName val DEFAULT_HIVE_ASSUME_DATE_PARTITION_OPT_VAL = "false" val DEFAULT_USE_PRE_APACHE_INPUT_FORMAT_OPT_VAL = "false" + val DEFAULT_HIVE_CLIENT_CLASS_OPT_VAL = classOf[HoodieHiveJDBCClient].getCanonicalName + + @deprecated + val HIVE_USE_JDBC_OPT_KEY = "hoodie.datasource.hive_sync.use_jdbc" + @deprecated val DEFAULT_HIVE_USE_JDBC_OPT_VAL = "true" + def translateUseJDBCToHiveClientClass(optParams: Map[String, String]) : Map[String, String] = { +if (optParams.contains(HIVE_USE_JDBC_OPT_KEY) && !optParams.contains(HIVE_CLIENT_CLASS_OPT_KEY)) { + log.warn(HIVE_USE_JDBC_OPT_KEY + " is deprecated and will be removed in a later release; Please use " + HIVE_CLIENT_CLASS_OPT_KEY) + if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("true")) { +optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> DEFAULT_HIVE_CLIENT_CLASS_OPT_VAL) + } else if (optParams(HIVE_USE_JDBC_OPT_KEY).equals("false")) { +optParams ++ Map(HIVE_CLIENT_CLASS_OPT_KEY -> classOf[HoodieHiveDriverClient].getCanonicalName) Review comment: Could you add a comment either here or in `HoodieHiveDriverClient` explain why this is used when `HIVE_USE_JDBC_OPT_KEY` is `false`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jiegzhan commented on issue #1980: [SUPPORT] Small files (423KB) generated after running delete query
jiegzhan commented on issue #1980: URL: https://github.com/apache/hudi/issues/1980#issuecomment-679398095 @bvaradar, before re-clustering is available, I tested [hoodie.cleaner.commits.retained](https://hudi.apache.org/docs/configurations.html#retainCommits). I set option("hoodie.cleaner.commits.retained", 1), then issued a few delete queries. For each parquet file in S3, the latest version and 1 older version (sometimes, not always) got kept in S3, all other versions are gone from S3. Is this how it works? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
n3nash commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475925314 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table.timeline; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.exception.HoodieIOException; + +import java.io.IOException; +import java.util.List; +import java.util.stream.Collectors; +import java.util.stream.Stream; + +/** + * TimelineUtils provides a common way to query incremental meta-data changes for a hoodie table. + * + * This is useful in multiple places including: + * 1) HiveSync - this can be used to query partitions that changed since previous sync. + * 2) Incremental reads - InputFormats can use this API to queryxw + */ +public class TimelineUtils { + + /** + * Returns partitions that have new data strictly after commitTime. + * Does not include internal operations such as clean in the timeline. + */ + public static List getPartitionsWritten(HoodieTimeline timeline) { +HoodieTimeline timelineToSync = timeline.getCommitsAndCompactionTimeline(); +return getPartitionsMutated(timelineToSync); + } + + /** + * Returns partitions that have been modified including internal operations such as clean in the passed timeline. + */ + public static List getPartitionsMutated(HoodieTimeline timeline) { +return timeline.filterCompletedInstants().getInstants().flatMap(s -> { + switch (s.getAction()) { +case HoodieTimeline.COMMIT_ACTION: +case HoodieTimeline.DELTA_COMMIT_ACTION: + try { +HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(s).get(), HoodieCommitMetadata.class); +return commitMetadata.getPartitionToWriteStats().keySet().stream(); + } catch (IOException e) { +throw new HoodieIOException("Failed to get partitions written between " + timeline.firstInstant() + " " + timeline.lastInstant(), e); + } +case HoodieTimeline.CLEAN_ACTION: + try { +HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantDetails(s).get()); +return cleanMetadata.getPartitionMetadata().keySet().stream(); + } catch (IOException e) { +throw new HoodieIOException("Failed to get partitions cleaned between " + timeline.firstInstant() + " " + timeline.lastInstant(), e); + } +case HoodieTimeline.COMPACTION_ACTION: + // compaction is not a completed instant. So no need to consider this action. +case HoodieTimeline.SAVEPOINT_ACTION: +case HoodieTimeline.ROLLBACK_ACTION: +case HoodieTimeline.RESTORE_ACTION: + return Stream.empty(); Review comment: Do you want to throw an exception here for now so it's not treated incorrectly ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
n3nash commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475925115 ## File path: hudi-common/src/test/java/org/apache/hudi/common/table/TestTimelineUtils.java ## @@ -0,0 +1,160 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.table; + +import org.apache.hudi.avro.model.HoodieCleanMetadata; +import org.apache.hudi.avro.model.HoodieCleanPartitionMetadata; +import org.apache.hudi.common.model.HoodieCleaningPolicy; +import org.apache.hudi.common.model.HoodieCommitMetadata; +import org.apache.hudi.common.model.HoodieWriteStat; +import org.apache.hudi.common.table.timeline.HoodieActiveTimeline; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.table.timeline.HoodieTimeline; +import org.apache.hudi.common.table.timeline.TimelineMetadataUtils; +import org.apache.hudi.common.table.timeline.TimelineUtils; +import org.apache.hudi.common.testutils.HoodieCommonTestHarness; +import org.apache.hudi.common.util.Option; +import org.junit.jupiter.api.BeforeEach; +import org.junit.jupiter.api.Test; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.nio.file.Paths; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertTrue; + +public class TestTimelineUtils extends HoodieCommonTestHarness { + + @BeforeEach + public void setUp() throws Exception { +initMetaClient(); + } + + @Test + public void testGetPartitions() throws IOException { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); +HoodieTimeline activeCommitTimeline = activeTimeline.getCommitTimeline(); +assertTrue(activeCommitTimeline.empty()); + +String olderPartition = "0"; // older partitions that is modified by all cleans +for (int i = 1; i <= 5; i++) { + String ts = i + ""; + HoodieInstant instant = new HoodieInstant(true, HoodieTimeline.COMMIT_ACTION, ts); + activeTimeline.createNewInstant(instant); + activeTimeline.saveAsComplete(instant, Option.of(getCommitMeta(basePath, ts, ts, 2))); + + HoodieInstant cleanInstant = new HoodieInstant(true, HoodieTimeline.CLEAN_ACTION, ts); + activeTimeline.createNewInstant(cleanInstant); + activeTimeline.saveAsComplete(cleanInstant, getCleanMeta(olderPartition, ts)); +} + +metaClient.reloadActiveTimeline(); + +// verify modified partitions included cleaned data +List partitions = TimelineUtils.getPartitionsMutated(metaClient.getActiveTimeline().findInstantsAfter("1", 10)); +assertEquals(5, partitions.size()); +assertEquals(partitions, Arrays.asList(new String[]{"0", "2", "3", "4", "5"})); + +partitions = TimelineUtils.getPartitionsMutated(metaClient.getActiveTimeline().findInstantsInRange("1", "4")); +assertEquals(4, partitions.size()); +assertEquals(partitions, Arrays.asList(new String[]{"0", "2", "3", "4"})); + +// verify only commit actions +partitions = TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().findInstantsAfter("1", 10)); +assertEquals(4, partitions.size()); +assertEquals(partitions, Arrays.asList(new String[]{"2", "3", "4", "5"})); + +partitions = TimelineUtils.getPartitionsWritten(metaClient.getActiveTimeline().findInstantsInRange("1", "4")); +assertEquals(3, partitions.size()); +assertEquals(partitions, Arrays.asList(new String[]{"2", "3", "4"})); + } + + @Test + public void testGetPartitionsUnpartitioned() throws IOException { +HoodieActiveTimeline activeTimeline = metaClient.getActiveTimeline(); +HoodieTimeline activeCommitTimeline = activeTimeline.getCommitTimeline(); +assertTrue(activeCommitTimeline.empty()); + +String partitionPath = ""; +for (int i = 1; i <= 5; i++) { + String ts = i + ""; + HoodieInstant instant = new HoodieInstant(true, HoodieTimeline.COMMIT_ACTION, ts); + activeTimeline.createNewInstant(instant); + active
[GitHub] [hudi] prashanthvg89 opened a new issue #2029: Records seen with _hoodie_is_deleted set to true on non-existing record
prashanthvg89 opened a new issue #2029: URL: https://github.com/apache/hudi/issues/2029 If a Hudi table, let's say, has zero rows and I issue an upsert with _hoodie_is_deleted = true then the record is still visible when I read the table. It works if the record was already existing but as a user I expect the same behavior on a non-existing record, too. So far the only way out seems to be to use soft delete or query with _hoodie_is_deleted = false This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.
n3nash commented on a change in pull request #2026: URL: https://github.com/apache/hudi/pull/2026#discussion_r475920521 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/RemoteHoodieTableFileSystemView.java ## @@ -147,7 +153,7 @@ public RemoteHoodieTableFileSystemView(String server, int port, HoodieTableMetaC String url = builder.toString(); LOG.info("Sending request : (" + url + ")"); Response response; -int timeout = 1000 * 300; // 5 min timeout +int timeout = this.timeoutSecs * 1000; // msec Review comment: Does the Request.connect expect timeout in millisecs ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
satishkotha commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475920279 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java ## @@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) { return details.apply(instant); } + /** + * Returns partitions that have been modified in the timeline. This includes internal operations such as clean. + * Note that this only returns data for completed instants. + */ + public List getPartitionsMutated() { +return filterCompletedInstants().getInstants().flatMap(s -> { + switch (s.getAction()) { +case HoodieTimeline.COMMIT_ACTION: +case HoodieTimeline.DELTA_COMMIT_ACTION: + try { +HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(getInstantDetails(s).get(), HoodieCommitMetadata.class); +return commitMetadata.getPartitionToWriteStats().keySet().stream(); + } catch (IOException e) { +throw new HoodieIOException("Failed to get partitions written between " + firstInstant() + " " + lastInstant(), e); + } +case HoodieTimeline.CLEAN_ACTION: Review comment: The method takes in a timeline. So hive sync only passes "commit" timeline to this method and gets only partitions modified by commit instants. I added another method just for clarity that only looks at commits. Let me know if you have any suggestions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.
n3nash commented on a change in pull request #2026: URL: https://github.com/apache/hudi/pull/2026#discussion_r475920074 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewStorageConfig.java ## @@ -52,7 +54,7 @@ public static final String DEFAULT_FILESYSTEM_VIEW_INCREMENTAL_SYNC_MODE = "false"; public static final String DEFUALT_REMOTE_VIEW_SERVER_HOST = "localhost"; public static final Integer DEFAULT_REMOTE_VIEW_SERVER_PORT = 26754; - + public static final Integer DEFAULT_REMOTE_TIMELINE_CLIENT_TIMEOUT_SECS = 300 * 60; // 5 min Review comment: @prashantwason How is this 5 mins ? If it's secs, it should just be 300 ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
satishkotha commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475919445 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java ## @@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) { return details.apply(instant); } + /** Review comment: Thank you. Moved it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
satishkotha commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475919254 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java ## @@ -232,6 +233,12 @@ */ Option getInstantDetails(HoodieInstant instant); + /** + * Returns partitions that have been modified in the timeline. This includes internal operations such as clean. + * Note that this only returns data for completed instants. + */ + List getPartitionsMutated(); Review comment: created TimelineUtils This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (ea983ff -> f7e02aa)
This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from ea983ff [HUDI-1137] Add option to configure different path selector add f7e02aa [MINOR] Update DOAP with 0.6.0 Release (#2024) No new revisions were added by this update. Summary of changes: doap_HUDI.rdf | 5 + 1 file changed, 5 insertions(+)
[GitHub] [hudi] bhasudha merged pull request #2024: [MINOR] Update DOAP with 0.6.0 Release
bhasudha merged pull request #2024: URL: https://github.com/apache/hudi/pull/2024 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #2016: [WIP] Add release page doc for 0.6.0
bhasudha commented on pull request #2016: URL: https://github.com/apache/hudi/pull/2016#issuecomment-679383433 closing this in favor of https://github.com/apache/hudi/pull/2028 . Capture the comment there. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha closed pull request #2016: [WIP] Add release page doc for 0.6.0
bhasudha closed pull request #2016: URL: https://github.com/apache/hudi/pull/2016 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha opened a new pull request #2028: Site update and release page for 0.6.0
bhasudha opened a new pull request #2028: URL: https://github.com/apache/hudi/pull/2028 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1136) Add back findInstantsAfterOrEquals to the HoodieTimeline class
[ https://issues.apache.org/jira/browse/HUDI-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1136: - Labels: pull-request-available (was: ) > Add back findInstantsAfterOrEquals to the HoodieTimeline class > -- > > Key: HUDI-1136 > URL: https://issues.apache.org/jira/browse/HUDI-1136 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Prashant Wason >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475907308 ## File path: hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java ## @@ -45,22 +45,39 @@ import org.apache.avro.generic.GenericRecord; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hbase.Cell; +import org.apache.hadoop.hbase.io.hfile.CacheConfig; +import org.apache.hadoop.hbase.io.hfile.HFile; +import org.apache.hadoop.hbase.io.hfile.HFileScanner; import org.apache.log4j.LogManager; import org.apache.log4j.Logger; import org.apache.parquet.avro.AvroSchemaConverter; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.metadata.CompressionCodecName; import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.StructType; + +import com.databricks.spark.avro.SchemaConverters; + +import scala.Function1; Review comment: Figured out a way to fix it for now. tests seem happy This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason opened a new pull request #2027: [HUDI-1136] Add back findInstantsAfterOrEquals to the HoodieTimeline class.
prashantwason opened a new pull request #2027: URL: https://github.com/apache/hudi/pull/2027 ## What is the purpose of the pull request Add an API findInstantsAfterOrEquals to HoodieTimeline. ## Verify this pull request This pull request is a trivial rework / code cleanup without any test coverage. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
bvaradar commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475895369 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java ## @@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) { return details.apply(instant); } + /** Review comment: Timeline APIs are only about instants in general. I think adding partitions here is breaking that abstraction. Can you move this to some helper class ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java ## @@ -232,6 +233,12 @@ */ Option getInstantDetails(HoodieInstant instant); + /** + * Returns partitions that have been modified in the timeline. This includes internal operations such as clean. + * Note that this only returns data for completed instants. + */ + List getPartitionsMutated(); Review comment: +1 on abstraction point. I think having a separate helper class would be better. ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieDefaultTimeline.java ## @@ -296,6 +300,42 @@ public boolean isBeforeTimelineStarts(String instant) { return details.apply(instant); } + /** + * Returns partitions that have been modified in the timeline. This includes internal operations such as clean. + * Note that this only returns data for completed instants. + */ + public List getPartitionsMutated() { +return filterCompletedInstants().getInstants().flatMap(s -> { + switch (s.getAction()) { +case HoodieTimeline.COMMIT_ACTION: +case HoodieTimeline.DELTA_COMMIT_ACTION: + try { +HoodieCommitMetadata commitMetadata = HoodieCommitMetadata.fromBytes(getInstantDetails(s).get(), HoodieCommitMetadata.class); +return commitMetadata.getPartitionToWriteStats().keySet().stream(); + } catch (IOException e) { +throw new HoodieIOException("Failed to get partitions written between " + firstInstant() + " " + lastInstant(), e); + } +case HoodieTimeline.CLEAN_ACTION: Review comment: We dont need to look at clean and (event compaction) for figuring out changed partitions for the case of hive syncing. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
satishkotha commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475901534 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java ## @@ -232,6 +233,12 @@ */ Option getInstantDetails(HoodieInstant instant); + /** + * Returns partitions that have been modified in the timeline. This includes internal operations such as clean. + * Note that this only returns data for completed instants. + */ + List getPartitionsMutated(); Review comment: @n3nash I want this to be in hudi-common, so this can be reused in hadoop-mr and hive-sync. Do you want me to create TimelineUtils in common? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1135) Make timeline server timeout settings configurable
[ https://issues.apache.org/jira/browse/HUDI-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1135: - Labels: pull-request-available (was: ) > Make timeline server timeout settings configurable > -- > > Key: HUDI-1135 > URL: https://issues.apache.org/jira/browse/HUDI-1135 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: Nishith Agarwal >Assignee: Nishith Agarwal >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] prashantwason opened a new pull request #2026: [HUDI-1135] Make timeline server timeout settings configurable.
prashantwason opened a new pull request #2026: URL: https://github.com/apache/hudi/pull/2026 ## *Tips* ## What is the purpose of the pull request Make timeline server timeout settings configurable. ## Brief change log Add timeout config settings for timeline server. ## Verify this pull request This pull request is a trivial rework / code cleanup without any test coverage. This pull request is already covered by existing tests. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475893386 ## File path: hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestUtils.java ## @@ -45,22 +45,39 @@ import org.apache.avro.generic.GenericRecord; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; +import org.apache.hadoop.hbase.Cell; +import org.apache.hadoop.hbase.io.hfile.CacheConfig; +import org.apache.hadoop.hbase.io.hfile.HFile; +import org.apache.hadoop.hbase.io.hfile.HFileScanner; import org.apache.log4j.LogManager; import org.apache.log4j.Logger; import org.apache.parquet.avro.AvroSchemaConverter; import org.apache.parquet.hadoop.ParquetWriter; import org.apache.parquet.hadoop.metadata.CompressionCodecName; import org.apache.spark.SparkConf; +import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; +import org.apache.spark.sql.types.DataType; +import org.apache.spark.sql.types.StructType; + +import com.databricks.spark.avro.SchemaConverters; + +import scala.Function1; Review comment: @prashantwason it's best we don't import scala methods into java. com.databricks/avro is simply deprecated. We have eequivalent functionality in spark-avro itself. In general we have to fix this test code that reads HFile as a DataSet in a different way . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on a change in pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
n3nash commented on a change in pull request #1964: URL: https://github.com/apache/hudi/pull/1964#discussion_r475892688 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java ## @@ -232,6 +233,12 @@ */ Option getInstantDetails(HoodieInstant instant); + /** + * Returns partitions that have been modified in the timeline. This includes internal operations such as clean. + * Note that this only returns data for completed instants. + */ + List getPartitionsMutated(); Review comment: I don't think this makes sense in the timeline as of now. If you take a look at the timeline API's, they only talk about the metadata that has changed. `getPartitionsMutated` conceptually is providing what has changed in the underlying data as opposed to what has changed in the timeline per se. Generally, all of this information should come from the timeline but that requires a full redesign on the timeline. Should we add this API here -> https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieReadClient.java#L195 ? And you can wrap this functionality in a TimelineUtils ? When we have clearer design on timeline, we can merge back the TimelineUtils to the real timeline... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tooptoop4 edited a comment on issue #1955: [SUPPORT] DMS partition treated as part of pk
tooptoop4 edited a comment on issue #1955: URL: https://github.com/apache/hudi/issues/1955#issuecomment-679356547 @nsivabalan perfect, that is how I expect. perhaps the default should be global index? or documentation should be updated? From coming from RDBMS background the PK is unique at table level not at partition level but reading below configs it is not clear that hudi default is different and I'm sure will trip up many newcomers to hudi: "RECORDKEY_FIELD_OPT_KEY (Required): **Primary key** field(s). Nested fields can be specified using the dot notation eg: a.b.c. When using multiple columns as primary key use comma separated notation, eg: "col1,col2,col3,etc". Single or multiple columns as primary key specified by KEYGENERATOR_CLASS_OPT_KEY property. Default value: "uuid" PARTITIONPATH_FIELD_OPT_KEY (Required): Columns to be used for **partitioning** the table. To prevent partitioning, provide empty string as value eg: "". Specify partitioning/no partitioning using KEYGENERATOR_CLASS_OPT_KEY. If synchronizing to hive, also specify using HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY. Default value: "partitionpath"" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tooptoop4 commented on issue #1955: [SUPPORT] DMS partition treated as part of pk
tooptoop4 commented on issue #1955: URL: https://github.com/apache/hudi/issues/1955#issuecomment-679356547 perfect, that is how I expect. perhaps the default should be global index? or documentation should be updated? From coming from RDBMS background the PK is unique at table level not at partition level but reading below configs it is not clear that hudi default is different and I'm sure will trip up many newcomers to hudi: "RECORDKEY_FIELD_OPT_KEY (Required): **Primary key** field(s). Nested fields can be specified using the dot notation eg: a.b.c. When using multiple columns as primary key use comma separated notation, eg: "col1,col2,col3,etc". Single or multiple columns as primary key specified by KEYGENERATOR_CLASS_OPT_KEY property. Default value: "uuid" PARTITIONPATH_FIELD_OPT_KEY (Required): Columns to be used for **partitioning** the table. To prevent partitioning, provide empty string as value eg: "". Specify partitioning/no partitioning using KEYGENERATOR_CLASS_OPT_KEY. If synchronizing to hive, also specify using HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY. Default value: "partitionpath"" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] tooptoop4 commented on issue #1954: [SUPPORT] DMS Caused by: java.lang.IllegalArgumentException: Partition key parts [] does not match with partition values
tooptoop4 commented on issue #1954: URL: https://github.com/apache/hudi/issues/1954#issuecomment-679353671 @bvaradar in each comment I am trying brand new tables with different spark submits. So not changing an existing table. try to reproduce with /home/ec2-user/spark_home/bin/spark-submit --conf "spark.hadoop.fs.s3a.proxy.host=redact" --conf "spark.hadoop.fs.s3a.proxy.port=redact" --conf "spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf "spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars "/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 --deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar --table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync --hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf hoodie.datasource.hive_sync.table=dmstest_multpk7 --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false --target-base-path s3a://redact/my2/multpk7 --target- table dmstest_multpk7 --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company --hoodie-conf "hoodie.datasource.write.partitionpath.field=" --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tbl This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #1955: [SUPPORT] DMS partition treated as part of pk
nsivabalan commented on issue #1955: URL: https://github.com/apache/hudi/issues/1955#issuecomment-679352339 @tooptoop4 : can you clarify what you mean by this. ``` ie for each version_no,group_company combo, i want to get the latest row by TimeCreated (ie the source-ordering-field) and then partition on whatever sys_user that latest row has. ``` But in general, yes, if you use global index with the update partition path set, you should not see any duplicates in your entire hoodie dataset. I can try to illustrate with an eg. Lets say each row consists only 4 vals, v_no(version no), cmp (group_company), time_cr, sys_user. Incase of regular index, combination of record keys and partition path forms unique keys. If you are using regular index and ingest v_1, c_1, t_1, u_1 v_2, c_1, t_1, u_1 v_1, c_1, t_1, u_2 v_1, c_1, t_1, u_3 This will result in 2 rows going to partition u_1, 1 row to partition u_2, and one row to u_3. In 2nd batch of updates, lets say you ingest few more rows. v_1, c_1, t_2, u_1 v_3, c_1, t_2, u_1 v_1, c_2, t_2, u_2 v_1, c_3, t_2, u_3 Here is the result u_1: v_1, c_1, t_2, u_1 (updated with latest value) v_2, c_1, t_1, u_1 v_3, c_1, t_2, u_1 (insert from 2nd batch) u_2: v_1, c_2, t_2, u_2 (updated with latest value) u_3: v_1, c_1, t_1, u_3 v_1, c_3, t_2, u_3(insert from 2nd batch) Incase of global index, only record keys are unique. Lets see an example with global bloom, but with the update partition path config not set. If 1st batch of ingest contains v_1, c_1, t_1, u_1 v_1, c_2, t_1, u_1 v_2, c_1, t_1, u_2 v_3, c_1, t_1, u_3 result will be. v_1, c_1, t_1, u_1 v_1, c_2, t_1, u_1 v_2, c_1, t_1, u_2 v_3, c_1, t_1, u_3 And 2nd batch of ingest contains v_1, c_1, t_2, u_1 (updating with latest time) v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will update U_1 only, since the config is not set. and hence new partition path i.e. u_2 will be ignored. v_2, c_2, t_2, u_2 (new insert) v_1, c_3, t_2, u_3 (new insert) So, the result will be v_1, c_1, t_2, u_1 (updated with latest time) v_1, c_2, t_2, u_1 (updated with latest time even though incoming record was sent to u_2) v_2, c_1, t_1, u_2 v_2, c_2, t_2, u_2 (new insert) v_3, c_1, t_1, u_3 v_1, c_3, t_2, u_3 (new insert) We can go the same with the config value set. result from first batch: v_1, c_1, t_1, u_1 v_1, c_2, t_1, u_1 v_2, c_1, t_1, u_2 v_3, c_1, t_1, u_3 And 2nd batch of ingest contains v_1, c_1, t_2, u_1 (updating with latest time) v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will insert a new record to u_2 and will delete corres record from u_1, since the config is set. v_2, c_2, t_2, u_2 (new insert) v_1, c_3, t_2, u_3 (new insert) So, the result will be v_1, c_1, t_2, u_1 (updated with latest time) v_1, c_2, t_2, u_2 (updated with latest time and old record is deleted) v_2, c_1, t_1, u_2 v_2, c_2, t_2, u_2 (new insert) v_3, c_1, t_1, u_3 v_1, c_3, t_2, u_3 (new insert) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash merged pull request #2023: [HUDI-1137] Add option to configure different path selector
n3nash merged pull request #2023: URL: https://github.com/apache/hudi/pull/2023 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1137] Add option to configure different path selector
This is an automated email from the ASF dual-hosted git repository. nagarwal pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new ea983ff [HUDI-1137] Add option to configure different path selector ea983ff is described below commit ea983ff912dab8604eec3085bc1c041cb6e60bc8 Author: Satish Kotha AuthorDate: Mon Aug 24 11:11:10 2020 -0700 [HUDI-1137] Add option to configure different path selector --- .../integ/testsuite/helpers/DFSTestSuitePathSelector.java | 14 -- .../main/java/org/apache/hudi/utilities/UtilHelpers.java | 9 +++-- .../org/apache/hudi/utilities/sources/AvroDFSSource.java | 3 +-- .../hudi/utilities/sources/helpers/DFSPathSelector.java| 1 + 4 files changed, 21 insertions(+), 6 deletions(-) diff --git a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java index b67e21f..bfc8368 100644 --- a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java +++ b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/helpers/DFSTestSuitePathSelector.java @@ -32,12 +32,17 @@ import org.apache.hudi.common.util.Option; import org.apache.hudi.common.util.collection.ImmutablePair; import org.apache.hudi.common.util.collection.Pair; import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.integ.testsuite.HoodieTestSuiteJob; import org.apache.hudi.utilities.sources.helpers.DFSPathSelector; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + /** * A custom dfs path selector used only for the hudi test suite. To be used only if workload is not run inline. */ public class DFSTestSuitePathSelector extends DFSPathSelector { + private static volatile Logger log = LoggerFactory.getLogger(HoodieTestSuiteJob.class); public DFSTestSuitePathSelector(TypedProperties props, Configuration hadoopConf) { super(props, hadoopConf); @@ -54,9 +59,12 @@ public class DFSTestSuitePathSelector extends DFSPathSelector { lastBatchId = Integer.parseInt(lastCheckpointStr.get()); nextBatchId = lastBatchId + 1; } else { -lastBatchId = -1; -nextBatchId = 0; +lastBatchId = 0; +nextBatchId = 1; } + + log.info("Using DFSTestSuitePathSelector, checkpoint: " + lastCheckpointStr + " sourceLimit: " + sourceLimit + + " lastBatchId: " + lastBatchId + " nextBatchId: " + nextBatchId); // obtain all eligible files for the batch List eligibleFiles = new ArrayList<>(); FileStatus[] fileStatuses = fs.globStatus( @@ -73,6 +81,8 @@ public class DFSTestSuitePathSelector extends DFSPathSelector { } } } + + log.info("Reading " + eligibleFiles.size() + " files. "); // no data to readAvro if (eligibleFiles.size() == 0) { return new ImmutablePair<>(Option.empty(), diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java index 0531196..14e16ab 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java @@ -352,12 +352,17 @@ public class UtilHelpers { } } - public static DFSPathSelector createSourceSelector(String sourceSelectorClass, TypedProperties props, + public static DFSPathSelector createSourceSelector(TypedProperties props, Configuration conf) throws IOException { +String sourceSelectorClass = +props.getString(DFSPathSelector.Config.SOURCE_INPUT_SELECTOR, DFSPathSelector.class.getName()); try { - return (DFSPathSelector) ReflectionUtils.loadClass(sourceSelectorClass, + DFSPathSelector selector = (DFSPathSelector) ReflectionUtils.loadClass(sourceSelectorClass, new Class[]{TypedProperties.class, Configuration.class}, props, conf); + + LOG.info("Using path selector " + selector.getClass().getName()); + return selector; } catch (Throwable e) { throw new IOException("Could not load source selector class " + sourceSelectorClass, e); } diff --git a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java index b5ce96f..b8f29e8 100644 --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroDFSSource.java @@ -47,8 +47,7 @@ public class AvroDFSSource extends AvroSource { SchemaProvider schemaProvider) throws IOException { super(props, sparkContext, sparkSession, schemaProvider)
[GitHub] [hudi] nsivabalan commented on a change in pull request #2016: [WIP] Add release page doc for 0.6.0
nsivabalan commented on a change in pull request #2016: URL: https://github.com/apache/hudi/pull/2016#discussion_r475863568 ## File path: docs/_pages/releases.md ## @@ -5,6 +5,72 @@ layout: releases toc: true last_modified_at: 2020-05-28T08:40:00-07:00 --- +## [Release 0.6.0](https://github.com/apache/hudi/releases/tag/release-0.6.0) ([docs](/docs/0.6.0-quick-start-guide.html)) + +### Download Information + * Source Release : [Apache Hudi 0.6.0 Source Release](https://downloads.apache.org/hudi/0.6.0/hudi-0.6.0.src.tgz) ([asc](https://downloads.apache.org/hudi/0.6.0/hudi-0.6.0.src.tgz.asc), [sha512](https://downloads.apache.org/hudi/0.6.0/hudi-0.6.0.src.tgz.sha512)) + * Apache Hudi jars corresponding to this release is available [here](https://repository.apache.org/#nexus-search;quick~hudi) + +### Migration Guide for this release + - With 0.6.0 Hudi is moving from list based rollback to marker based rollbacks. To smoothly aid this transition a + new property called `hoodie.table.version` is added to hoodie.properties file. Whenever hoodie is launched with + newer table version i.e 1 (or moving from pre 0.6.0 to 0.6.0), an upgrade step will be executed automatically + to adhere to marker based rollback. This automatic upgrade step will happen just once per dataset as the + `hoodie.table.version` will be updated in property file after upgrade is completed. + - Similarly, a command line tool for Downgrading is added if in case some users want to downgrade hoodie from + table version 1 to 0 or move from hoodie 0.6.0 to pre 0.6.0 + +### Release Highlights + + Ingestion side improvements: + - Hudi now supports `Azure Data Lake Storage V2` , `Alluxio` and `Tencent Cloud Object Storage` storages. + - Add support for "bulk_insert" without converting to RDD. This has better performance compared to existing "bulk_insert". +This implementation uses Datasource for writing to storage with support for key generators to operate on Row +(rather than HoodieRecords as per previous "bulk_insert") is added. + - # TODO Add more about bulk insert modes. + - # TODO Add more on bootstrap. + - In previous versions, auto clean runs synchronously after ingestion. Starting 0.6.0, Hudi does cleaning and ingestion in parallel. + - Support async compaction for spark streaming writes to hudi table. Previous versions supported only inline compactions. + - Implemented rollbacks using marker files instead of relying on commit metadata. Please check the migration guide for more details on this. + - A new InlineFileSystem has been added to support embedding any file format as an inline format within a regular file. + + Query side improvements: + - Starting 0.6.0, snapshot queries are feasible via spark datasource. + - In prior versions we only supported HoodieCombineHiveInputFormat for CopyOnWrite tables to ensure that there is a limit on the number of mappers spawned for +any query. Hudi now supports Merge on Read tables also using HoodieCombineInputFormat. + - Speedup spark read queries by caching metaclient in HoodieROPathFilter. This helps reduce listing related overheads in S3 when filtering files for read-optimized queries. + + DeltaStreamer improvements: + - HoodieMultiDeltaStreamer: adds support for ingesting multiple kafka streams in a single DeltaStreamer deployment + - Added a new tool - InitialCheckPointProvider, to set checkpoints when migrating to DeltaStreamer after an initial load of the table is complete.. + - Add CSV source support. + - Added chained transformer that can add chain multiple transformers. + + Indexing improvements: + - Added a new index `HoodieSimpleIndex` which joins incoming records with base files to index records. + - Added ability to configure user defined indexes. + + Key generation improvements: + - Introduced `CustomTimestampBasedKeyGenerator` to support complex keys as record key and custom partition paths. + - Support more time units and dat/time formats in `TimestampBasedKeyGenerator` + + Developer productivity and monitoring improvements: + - Spark DAGs are named to aid better debuggability + - Console, JMX, Prometheus and DataDog metric reporters have been added. + - Support pluggable metrics reporting by introducing proper abstraction for user defined metrics. + + CLI related features: + - Added support for deleting savepoints via CLI + - Added a new command - `export instants`, to export metadata of instants Review comment: sure. you can add something like this. Feel free to edit as per convenience. ``` A command line tool is added to hudi-cli, to assist in upgrading or downgrading the hoodie dataset. "UPGRADE" or "DOWNGRADE" is the command to use. DOWNGRADE has to be done using hudi-cli if someone prefers to downgrade their hoodie dataset from 0.6.0 to any pre 0.6.0 versions. ```
[jira] [Updated] (HUDI-1056) Ensure validate_staged_release.sh also runs against released version in release repo
[ https://issues.apache.org/jira/browse/HUDI-1056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1056: - Labels: pull-request-available (was: ) > Ensure validate_staged_release.sh also runs against released version in > release repo > > > Key: HUDI-1056 > URL: https://issues.apache.org/jira/browse/HUDI-1056 > Project: Apache Hudi > Issue Type: Task > Components: Release & Administrative >Reporter: Balaji Varadarajan >Assignee: sivabalan narayanan >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > Currently, validate_staged_release.sh expects rc_num to be set. Once we > release the version, we need this script to download released source tarball > and run checks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan opened a new pull request #2025: [HUDI-1056] Fix release validate script for rc_num and release_type
nsivabalan opened a new pull request #2025: URL: https://github.com/apache/hudi/pull/2025 ## What is the purpose of the pull request Fixing release validate script for making rc_num optional and to introduce release_type(dev/release) ## Brief change log Fixing release validate script for making rc_num optional and to introduce release_type(dev/release) ## Verify this pull request This change added tests and can be verified as follows: - *Manually verified the change by running a job locally.* ## Committer checklist - [ x] Has a corresponding JIRA in PR title & commit - [ x] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1962: [SUPPORT] Unable to filter hudi table in hive on partition column
bvaradar commented on issue #1962: URL: https://github.com/apache/hudi/issues/1962#issuecomment-679316776 For the second case, Hive Metastore would be filtering out partitions and only return specific paths. I think there is some inconsistency between the path used in the filesystem and the one that is present in meta-store. @sassai : Sorry for the delay, Can you recursively list your hoodie data set and attach the output. Also please add the file contents of latest .commit or .deltacommit file . Also, add the output for one of the partition with location: describe formatted table_name partition (year=xxx,month=xxx,day=xxx,hour=xxx,minute=xxx); This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r475815871 ## File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java ## @@ -81,6 +87,7 @@ public Builder fromProperties(Properties props) { public Builder limitFileSize(long maxFileSize) { props.setProperty(PARQUET_FILE_MAX_BYTES, String.valueOf(maxFileSize)); + props.setProperty(HFILE_FILE_MAX_BYTES, String.valueOf(maxFileSize)); Review comment: not sure if its a good idea to overload two configs like this. we may need to break this builder method up separately. ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java ## @@ -55,7 +56,7 @@ private long recordsWritten = 0; private long insertRecordsWritten = 0; private long recordsDeleted = 0; - private Iterator> recordIterator; + private Map> recordMap; Review comment: need to ensure that having this be a map wont affect normal inserts. ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.hudi.client.SparkTaskContextSupplier; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieUpsertException; +import org.apache.hudi.table.HoodieTable; + +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.Iterator; +import java.util.Map; +import java.util.PriorityQueue; +import java.util.Queue; + +/** + * Hoodie merge handle which writes records (new inserts or updates) sorted by their key. + * + * The implementation performs a merge-sort by comparing the key of the record being written to the list of + * keys in newRecordKeys (sorted in-memory). + */ +public class HoodieSortedMergeHandle extends HoodieMergeHandle { + + private Queue newRecordKeysSorted = new PriorityQueue<>(); + + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Iterator> recordItr, String partitionPath, String fileId, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, sparkTaskContextSupplier); +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Called by compactor code path. + */ + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Map> keyToNewRecordsOrig, String partitionPath, String fileId, + HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, keyToNewRecordsOrig, partitionPath, fileId, dataFileToBeMerged, +sparkTaskContextSupplier); + +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Go through an old record. Here if we detect a newer version shows up, we write the new one to the file. + */ + @Override + public void write(GenericRecord oldRecord) { +String key = oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString(); + +// To maintain overall sorted order across updates and inserts, write any new inserts whose keys are less than +// the oldRecord's key. +while (!newRecordKeysSorted.isEmpty() && newRecordKeysSorted.peek().compareTo(key) <= 0) { + String keyToPreWrite = newRecordKeysSorted.remove(); Review comment: instead, we can just do a streaming sort-merge? ## File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestHoodieDemo.java ## @@ -115,6 +115,35 @@ public void testParquetDemo() throws Exception { testIncrementalHiveQueryAfterCompaction(); } + @Test + public void testHFileDemo() throws Exception { Review comment: this is good. but will add to our integration test runtim
[hudi] branch asf-site updated: Travis CI build asf-site
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/asf-site by this push: new e995bd2 Travis CI build asf-site e995bd2 is described below commit e995bd2819c7df640826bada28f9dd87c0110a62 Author: CI AuthorDate: Mon Aug 24 19:11:45 2020 + Travis CI build asf-site --- content/assets/js/lunr/lunr-store.js | 2 +- content/cn/docs/querying_data.html | 12 ++--- content/docs/comparison.html | 4 +- content/docs/configurations.html | 88 content/docs/deployment.html | 6 +-- content/docs/powered_by.html | 4 ++ content/docs/querying_data.html | 49 content/docs/structure.html | 2 +- 8 files changed, 135 insertions(+), 32 deletions(-) diff --git a/content/assets/js/lunr/lunr-store.js b/content/assets/js/lunr/lunr-store.js index 5e4b619..c07019e 100644 --- a/content/assets/js/lunr/lunr-store.js +++ b/content/assets/js/lunr/lunr-store.js @@ -825,7 +825,7 @@ var store = [{ "url": "https://hudi.apache.org/docs/writing_data.html";, "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{ "title": "查询 Hudi 数据集", -"excerpt":"从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如之前所述。 数据集同步到Hive Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包, 就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说,在写入过程中传递了两个由table name命名的Hive表。 例如,如果table name = hudi_tbl,我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat 支持的数据集的读优化视图,从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat 支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。 如概念部分所述,增量处理所需要的 一个关键原语是增量拉取(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起, 您可� �只获得全部更新和新行。 这与插入更新一起使用,对于构建某 [...] +"excerpt":"从概念上讲,Hudi物理存储一次数据到DFS上,同时在其上提供三个逻辑视图,如之前所述。 数据集同步到Hive Metastore后,它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包, 就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。 具体来说,在写入过程中传递了两个由table name命名的Hive表。 例如,如果table name = hudi_tbl,我们得到 hudi_tbl 实现了由 HoodieParquetInputFormat 支持的数据集的读优化视图,从而提供了纯列式数据。 hudi_tbl_rt 实现了由 HoodieParquetRealtimeInputFormat 支持的数据集的实时视图,从而提供了基础数据和日志数据的合并视图。 如概念部分所述,增量处理所需要的 一个关键原语是增量拉取(以从数据集中获取更改流/日志)。您可以增量提取Hudi数据集,这意味着自指定的即时时间起, 您可� �只获得全部更新和新行。 这与插入更新一起使用,对于构建某 [...] "tags": [], "url": "https://hudi.apache.org/cn/docs/querying_data.html";, "teaser":"https://hudi.apache.org/assets/images/500x300.png"},{ diff --git a/content/cn/docs/querying_data.html b/content/cn/docs/querying_data.html index e808682..3129f6d 100644 --- a/content/cn/docs/querying_data.html +++ b/content/cn/docs/querying_data.html @@ -375,7 +375,7 @@ 增量拉取 - Presto + PrestoDB Impala (3.4 or later) 读优化表 @@ -434,7 +434,7 @@ Y - Presto + PrestoDB Y N @@ -477,8 +477,8 @@ Y - Presto - N + PrestoDB + Y N Y @@ -703,9 +703,9 @@ Upsert实用程序(HoodieDeltaStreamer -Presto +PrestoDB -Presto是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。 +PrestoDB是一种常用的查询引擎,可提供交互式查询性能。 Hudi RO表可以在Presto中无缝查询。 这需要在整个安装过程中将hudi-presto-bundle jar放入/plugin/hive-hadoop2/中。 Impala (3.4 or later) diff --git a/content/docs/comparison.html b/content/docs/comparison.html index 3fd1343..2a50a60 100644 --- a/content/docs/comparison.html +++ b/content/docs/comparison.html @@ -392,7 +392,7 @@ we expect Hudi to positioned at something that ingests parquet with superior per Hive transactions does not offer the read-optimized storage option or the incremental pulling, that Hudi does. In terms of implementation choices, Hudi leverages the full power of a processing framework like Spark, while Hive transactions feature is implemented underneath by Hive tasks/queries kicked off by user or the Hive metastore. Based on our production experience, embedding Hudi as a library into existing Spark pipelines was much easier and less operationally heavy, compared with the other approach. -Hudi is also designed to work with non-hive enginers like Presto/Spark and will incorporate file formats other than parquet over time. +Hudi is also designed to work with non-hive engines like PrestoDB/Spark and will incorporate file formats other than parquet over time. HBase @@ -410,7 +410,7 @@ integration of Hudi library with Spark/Spark streaming DAGs. In case of Non-Spar and later sent into a Hudi table via a Kafka topic/DFS intermediate file. In more conceptual level, data processing pipelines just consist of three components : source, processing, sink, with users ultimately running queries against the sink to use the results of the pipeline. Hudi can act as either a source or sink, that stores data on DFS. Applicability of Hudi to a given stream processing pipeline ultimately boils down to suitability -of Presto/SparkSQL/Hive for your qu
[GitHub] [hudi] bvaradar commented on issue #1954: [SUPPORT] DMS Caused by: java.lang.IllegalArgumentException: Partition key parts [] does not match with partition values
bvaradar commented on issue #1954: URL: https://github.com/apache/hudi/issues/1954#issuecomment-679305798 @tooptoop4 : IIUC, Are you effectively changing a table from non-partitioned to partitioned ? The exception you added to the last comment was about a missing file which does not tie up with your comments. Can you elaborate on the steps to reproduce this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jpugliesi edited a comment on issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query
jpugliesi edited a comment on issue #2002: URL: https://github.com/apache/hudi/issues/2002#issuecomment-679298548 @bvaradar I suspected this may have been the case, but I was not able to find any documentation anywhere that states that a commit tracks the timestamp of when a _specific subset of records is changed_, as opposed to the timestamp of when a write operation was executed. Does such documentation exist, and if so, can you please point me to it? Since incremental query does not necessarily contain the full set of table commits, is there an alternative way to get the full tabe commit history via the Spark or some other API _besides the CLI_? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha opened a new pull request #2024: [MINOR] Update DOAP with 0.6.0 Release
bhasudha opened a new pull request #2024: URL: https://github.com/apache/hudi/pull/2024 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] jpugliesi commented on issue #2002: [SUPPORT] Inconsistent Commits between CLI and Incremental Query
jpugliesi commented on issue #2002: URL: https://github.com/apache/hudi/issues/2002#issuecomment-679298548 @bvaradar I suspected this may have been the case, but I was not able to find any documentation anywhere that states that a commit tracks the timestamp of when a _specific subset of records is changed_, as opposed to the timestamp of when a write operation was executed. Does such documentation exist, and if so, can you please point me to it? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed
n3nash commented on pull request #1964: URL: https://github.com/apache/hudi/pull/1964#issuecomment-679298263 Could we structure this as a "BaseTableMetaClient", "HoodieTableMetaClient" and "HoodieTableIncrementalMetaClient" ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-424) Implement Hive Query Side Integration for querying tables containing bootstrap file slices
[ https://issues.apache.org/jira/browse/HUDI-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-424: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Implement Hive Query Side Integration for querying tables containing > bootstrap file slices > -- > > Key: HUDI-424 > URL: https://issues.apache.org/jira/browse/HUDI-424 > Project: Apache Hudi > Issue Type: Sub-task > Components: Hive Integration >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 336h > Remaining Estimate: 0h > > Support for Hive read-optimized and realtime queries > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-769) Write blog about HoodieMultiTableDeltaStreamer in cwiki
[ https://issues.apache.org/jira/browse/HUDI-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-769: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Write blog about HoodieMultiTableDeltaStreamer in cwiki > --- > > Key: HUDI-769 > URL: https://issues.apache.org/jira/browse/HUDI-769 > Project: Apache Hudi > Issue Type: Improvement > Components: Docs, docs-chinese >Reporter: Balaji Varadarajan >Assignee: Pratyaksh Sharma >Priority: Major > Fix For: 0.6.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table
[ https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-575: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Support Async Compaction for spark streaming writes to hudi table > - > > Key: HUDI-575 > URL: https://issues.apache.org/jira/browse/HUDI-575 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.1 > > > Currenlty, only inline compaction is supported for Structured streaming > writes. > > We need to > * Enable configuring async compaction for streaming writes > * Implement a parallel compaction process like we did for delta streamer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-806) Implement support for bootstrapping via Spark datasource API
[ https://issues.apache.org/jira/browse/HUDI-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-806: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Implement support for bootstrapping via Spark datasource API > > > Key: HUDI-806 > URL: https://issues.apache.org/jira/browse/HUDI-806 > Project: Apache Hudi > Issue Type: Sub-task > Components: Spark Integration >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 336h > Remaining Estimate: 0h > > This Jira tracks the work required to perform bootstrapping through Spark > data source API. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-421) Cleanup bootstrap code and create PR for FileStystemView changes
[ https://issues.apache.org/jira/browse/HUDI-421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-421: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Cleanup bootstrap code and create PR for FileStystemView changes > - > > Key: HUDI-421 > URL: https://issues.apache.org/jira/browse/HUDI-421 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 240h > Remaining Estimate: 0h > > FileSystemView needs changes to identify and handle bootstrap file slices. > Code changes are present in > [https://github.com/bvaradar/hudi/tree/vb_bootstrap] Needs cleanup before > they are ready to become PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-425) Implement support for bootstrapping in HoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-425: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Implement support for bootstrapping in HoodieDeltaStreamer > -- > > Key: HUDI-425 > URL: https://issues.apache.org/jira/browse/HUDI-425 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: help-wanted > Fix For: 0.6.1 > > Time Spent: 168h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1031) Document how to set job scheduling configs for Async compaction
[ https://issues.apache.org/jira/browse/HUDI-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-1031: Fix Version/s: (was: 0.6.0) 0.6.1 > Document how to set job scheduling configs for Async compaction > > > Key: HUDI-1031 > URL: https://issues.apache.org/jira/browse/HUDI-1031 > Project: Apache Hudi > Issue Type: Task > Components: Docs >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > > In case of deltastreamer, Spark job scheduling configs are automatically set. > As the configs needs to be set before spark context is initiated, it is not > fully automated for Structured Streaming > [https://spark.apache.org/docs/latest/job-scheduling.html] > We need to document how to set job scheduling configs for Spark Structured > streaming. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-422) Cleanup bootstrap code and create write APIs for supporting bootstrap
[ https://issues.apache.org/jira/browse/HUDI-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-422: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Cleanup bootstrap code and create write APIs for supporting bootstrap > -- > > Key: HUDI-422 > URL: https://issues.apache.org/jira/browse/HUDI-422 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 96h > Remaining Estimate: 0h > > Once refactor for HoodieWriteClient is done, we can cleanup and introduce > HoodieBootstrapClient as a separate PR. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-428) Web documentation for explaining how to bootstrap
[ https://issues.apache.org/jira/browse/HUDI-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-428: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Web documentation for explaining how to bootstrap > -- > > Key: HUDI-428 > URL: https://issues.apache.org/jira/browse/HUDI-428 > Project: Apache Hudi > Issue Type: Sub-task > Components: Docs >Reporter: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > Need to provide examples (demo) to document bootstrapping -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-807) Spark DS Support for incremental queries for bootstrapped tables
[ https://issues.apache.org/jira/browse/HUDI-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-807: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Spark DS Support for incremental queries for bootstrapped tables > > > Key: HUDI-807 > URL: https://issues.apache.org/jira/browse/HUDI-807 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Udit Mehrotra >Assignee: Udit Mehrotra >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 120h > Remaining Estimate: 0h > > Investigate and figure out the changes required in Spark integration code to > make incremental queries work seamlessly for bootstrapped tables. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-423) Implement upsert functionality for handling updates to these bootstrap file slices
[ https://issues.apache.org/jira/browse/HUDI-423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-423: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Implement upsert functionality for handling updates to these bootstrap file > slices > -- > > Key: HUDI-423 > URL: https://issues.apache.org/jira/browse/HUDI-423 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core, Writer Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 168h > Remaining Estimate: 0h > > Needs support to handle upsert of these file-slices. For MOR tables, also > need compaction support. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-420) Automated end to end Integration Test
[ https://issues.apache.org/jira/browse/HUDI-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-420: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Automated end to end Integration Test > - > > Key: HUDI-420 > URL: https://issues.apache.org/jira/browse/HUDI-420 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Fix For: 0.6.1 > > Time Spent: 72h > Remaining Estimate: 0h > > We need end to end test as part ITTestHoodieDemo to also include bootstrap > table cases. > We can have a new table bootstrapped from the Hoodie table build in the demo > and ensure queries work and return same responses -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-418) Bootstrap Index - Implementation
[ https://issues.apache.org/jira/browse/HUDI-418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha updated HUDI-418: --- Fix Version/s: (was: 0.6.0) 0.6.1 > Bootstrap Index - Implementation > > > Key: HUDI-418 > URL: https://issues.apache.org/jira/browse/HUDI-418 > Project: Apache Hudi > Issue Type: Sub-task > Components: Common Core >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.1 > > Time Spent: 10m > Remaining Estimate: 0h > > An implementation for > [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+12+:+Efficient+Migration+of+Large+Parquet+Tables+to+Apache+Hudi#RFC-12:EfficientMigrationofLargeParquetTablestoApacheHudi-BootstrapIndex:] > is present in > [https://github.com/bvaradar/hudi/blob/vb_bootstrap/hudi-common/src/main/java/org/apache/hudi/common/consolidated/CompositeMapFile.java] > > We need to make it solid with unit-tests and cleanup. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties
[ https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha closed HUDI-114. -- > Allow for clients to overwrite the payload implementation in hoodie.properties > -- > > Key: HUDI-114 > URL: https://issues.apache.org/jira/browse/HUDI-114 > Project: Apache Hudi > Issue Type: Bug > Components: newbie, Writer Core >Reporter: Nishith Agarwal >Assignee: Pratyaksh Sharma >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Right now, once the payload class is set once in hoodie.properties, it cannot > be changed. In some cases, if a code refactor is done and the jar updated, > one may need to pass the new payload class name. > Also, fix picking up the payload name for datasource API. By default > HoodieAvroPayload is written whereas for datasource API default is > OverwriteLatestAvroPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties
[ https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha reopened HUDI-114: > Allow for clients to overwrite the payload implementation in hoodie.properties > -- > > Key: HUDI-114 > URL: https://issues.apache.org/jira/browse/HUDI-114 > Project: Apache Hudi > Issue Type: Bug > Components: newbie, Writer Core >Reporter: Nishith Agarwal >Assignee: Pratyaksh Sharma >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Right now, once the payload class is set once in hoodie.properties, it cannot > be changed. In some cases, if a code refactor is done and the jar updated, > one may need to pass the new payload class name. > Also, fix picking up the payload name for datasource API. By default > HoodieAvroPayload is written whereas for datasource API default is > OverwriteLatestAvroPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-114) Allow for clients to overwrite the payload implementation in hoodie.properties
[ https://issues.apache.org/jira/browse/HUDI-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha resolved HUDI-114. Resolution: Fixed > Allow for clients to overwrite the payload implementation in hoodie.properties > -- > > Key: HUDI-114 > URL: https://issues.apache.org/jira/browse/HUDI-114 > Project: Apache Hudi > Issue Type: Bug > Components: newbie, Writer Core >Reporter: Nishith Agarwal >Assignee: Pratyaksh Sharma >Priority: Minor > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Right now, once the payload class is set once in hoodie.properties, it cannot > be changed. In some cases, if a code refactor is done and the jar updated, > one may need to pass the new payload class name. > Also, fix picking up the payload name for datasource API. By default > HoodieAvroPayload is written whereas for datasource API default is > OverwriteLatestAvroPayload -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (HUDI-590) Cut a new Doc version 0.5.1 explicitly
[ https://issues.apache.org/jira/browse/HUDI-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha reopened HUDI-590: > Cut a new Doc version 0.5.1 explicitly > -- > > Key: HUDI-590 > URL: https://issues.apache.org/jira/browse/HUDI-590 > Project: Apache Hudi > Issue Type: Task > Components: Docs, Release & Administrative >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The latest version of docs needs to be tagged as 0.5.1 explicitly in the > site. Follow instructions in > [https://github.com/apache/incubator-hudi/blob/asf-site/README.md#updating-site] > to create a new dir 0.5.1 under docs/_docs/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (HUDI-590) Cut a new Doc version 0.5.1 explicitly
[ https://issues.apache.org/jira/browse/HUDI-590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha resolved HUDI-590. Resolution: Fixed > Cut a new Doc version 0.5.1 explicitly > -- > > Key: HUDI-590 > URL: https://issues.apache.org/jira/browse/HUDI-590 > Project: Apache Hudi > Issue Type: Task > Components: Docs, Release & Administrative >Reporter: Bhavani Sudha >Assignee: Bhavani Sudha >Priority: Major > Labels: pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The latest version of docs needs to be tagged as 0.5.1 explicitly in the > site. Follow instructions in > [https://github.com/apache/incubator-hudi/blob/asf-site/README.md#updating-site] > to create a new dir 0.5.1 under docs/_docs/ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-624) Split some of the code from PR for HUDI-479
[ https://issues.apache.org/jira/browse/HUDI-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bhavani Sudha closed HUDI-624. -- > Split some of the code from PR for HUDI-479 > > > Key: HUDI-624 > URL: https://issues.apache.org/jira/browse/HUDI-624 > Project: Apache Hudi > Issue Type: Improvement > Components: Code Cleanup >Reporter: Suneel Marthi >Assignee: Suneel Marthi >Priority: Major > Labels: patch, pull-request-available > Fix For: 0.6.0 > > Time Spent: 20m > Remaining Estimate: 0h > > This Jira is to reduce the size of the code base in PR# 1159 for HUDI-479, > making it easier for review. -- This message was sent by Atlassian Jira (v8.3.4#803005)