[GitHub] [hudi] wangxianghu commented on a change in pull request #3006: [HUDI-1943] Lose properties when hoodieWriteConfig initializtion
wangxianghu commented on a change in pull request #3006: URL: https://github.com/apache/hudi/pull/3006#discussion_r641884136 ## File path: hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java ## @@ -242,7 +242,9 @@ public static TypedProperties flinkConf2TypedProperties(Configuration conf) { properties.put(option.key(), option.defaultValue()); } } -return new TypedProperties(properties); +TypedProperties typeProps = new TypedProperties(); +typeProps.putAll(typeProps); Review comment: hi, @hk-lrzy thanks for your contribution. There might be a mistake here , do you mean ` typeProps.putAll(properties)`? BTW, I tried this method, it seems ok? you can try `org.apache.hudi.common.config.TypedProperties#getString(java.lang.String)`, the value you set do exists. the problem might be the `toString()` method shows no values, right? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on a change in pull request #3006: [HUDI-1943] Lose properties when hoodieWriteConfig initializtion
wangxianghu commented on a change in pull request #3006: URL: https://github.com/apache/hudi/pull/3006#discussion_r641884136 ## File path: hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java ## @@ -242,7 +242,9 @@ public static TypedProperties flinkConf2TypedProperties(Configuration conf) { properties.put(option.key(), option.defaultValue()); } } -return new TypedProperties(properties); +TypedProperties typeProps = new TypedProperties(); +typeProps.putAll(typeProps); Review comment: hi, @hk-lrzy thanks for your contribution. There might be a mistake here , do you mean ` typeProps.putAll(properties)`? BTW, I think it's better to fix this problem in the constructor of `TypedProperties `, that's where the problem really exists. WDYT? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1943) Lose properties when hoodieWriteConfig initializtion
[ https://issues.apache.org/jira/browse/HUDI-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang updated HUDI-1943: --- Summary: Lose properties when hoodieWriteConfig initializtion (was: [hudi-flink]Lose properties when hoodieWriteConfig initializtion) > Lose properties when hoodieWriteConfig initializtion > > > Key: HUDI-1943 > URL: https://issues.apache.org/jira/browse/HUDI-1943 > Project: Apache Hudi > Issue Type: Bug > Components: Flink Integration >Reporter: hk__lrzy >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1947) Hudi Commit Callback and commit in a single transaction
Pranoti Shanbhag created HUDI-1947: -- Summary: Hudi Commit Callback and commit in a single transaction Key: HUDI-1947 URL: https://issues.apache.org/jira/browse/HUDI-1947 Project: Apache Hudi Issue Type: New Feature Reporter: Pranoti Shanbhag Hello, I am using Hudi Commit callbacks to call an internal service. As per my understanding, the service is called after the commit on the dataset and if there is a failure in the callback service we would not rollback the commit. The service which we call saves the commit time in a database which is accessed by multiple pipelines to get the incremental delta. For example, when there are 4 commits in hudi dataset, we register 4 commit timestamps in the database. The pipelines that need the incremental delta, run at different frequencies and use this database to fetch new data after their respective runs. For this to work well, we need the hudi commit and call back to be atomic in a single transaction. Otherwise on callback failures, there may be data in the hudi dataset which may not be registered in the DB. Please can you let me know if this can be supported and if there is a way to achieve this with the current implementation. We do have retries set up and are not expecting failures but we want to keep the hudi commits in sync with what we register in the DB. Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] pranotishanbhag opened a new issue #3008: [SUPPORT] Hive Sync issues on deletes and non partitioned table
pranotishanbhag opened a new issue #3008: URL: https://github.com/apache/hudi/issues/3008 **_Tips before filing an issue_** - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? YES - Join the mailing list to engage in conversations and get faster support at dev-subscr...@hudi.apache.org. Part of the Slack groups. Did not find resolution there. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. I am not sure this is a bug but after the analysis we can check. **Describe the problem you faced** I am working on unit tests for some of the operations we would perform. I see my tests failing for the below 2 scenarios: 1. Hive Table is not updated when DELETE operation is called on the dataset. 2. Hive Table is not updated (getting empty table) when no partitions are specified. ### Details on Issue 1: I am trying to sync a hive table on upsert (works fine) and on delete (does not work) in my unit tests. As per the doc [Hudi_Writing-Data](https://hudi.apache.org/docs/writing_data.html#key-generation), we need to use GlobalDeleteKeyGenerator class for delete: > classOf[GlobalDeleteKeyGenerator].getCanonicalName (to be used when OPERATION_OPT_KEY is set to DELETE_OPERATION_OPT_VAL) However, when I use this class I get the error: > “Cause: java.lang.InstantiationException: org.apache.hudi.keygen.GlobalDeleteKeyGenerator” These are the hudi properties set on save: ``` (hoodie.datasource.hive_sync.database,default) (hoodie.combine.before.insert,true) (hoodie.insert.shuffle.parallelism,2) (hoodie.datasource.write.precombine.field,timestamp) (hoodie.datasource.hive_sync.partition_fields,partition) (hoodie.datasource.hive_sync.use_jdbc,false) (hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator) (hoodie.delete.shuffle.parallelism,2) (hoodie.datasource.hive_sync.table,TestHudiTable) (hoodie.index.type,GLOBAL_BLOOM) (hoodie.datasource.write.operation,DELETE) (hoodie.datasource.hive_sync.enable,true) (hoodie.datasource.write.recordkey.field,id) (hoodie.table.name,TestHudiTable) (hoodie.datasource.write.table.type,COPY_ON_WRITE) (hoodie.datasource.write.hive_style_partitioning,true) (hoodie.upsert.shuffle.parallelism,2) (hoodie.cleaner.commits.retained,15) (hoodie.datasource.write.partitionpath.field,partition) (hoodie.datasource.hive_sync.database,default) (hoodie.combine.before.insert,true) (hoodie.embed.timeline.server,false) (hoodie.insert.shuffle.parallelism,2) (hoodie.datasource.write.precombine.field,timestamp) (hoodie.datasource.hive_sync.partition_fields,partition) (hoodie.datasource.hive_sync.use_jdbc,false) (hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator) (hoodie.delete.shuffle.parallelism,2) (hoodie.datasource.hive_sync.table,TestHudiTable) (hoodie.index.type,GLOBAL_BLOOM) (hoodie.datasource.write.operation,DELETE) (hoodie.datasource.hive_sync.enable,true) (hoodie.datasource.write.recordkey.field,id) (hoodie.table.name,TestHudiTable) (hoodie.datasource.write.table.type,COPY_ON_WRITE) (hoodie.datasource.write.hive_style_partitioning,true) (hoodie.upsert.shuffle.parallelism,2) (hoodie.cleaner.commits.retained,15) (hoodie.datasource.write.partitionpath.field,partition) ``` if I switch to MultiPartKeysValueExtractor class, the deletes are not propagated to hive table. The hudi read - spark.read.format(“hudi”).load(BasePath) has the right data where the id is deleted but spark.table(“TableName”) is not consistent and still has the id which was supposed to be deleted. For example, id1 is deleted in Hudi but still in Hive table: ``` spark.read.format("hudi").load() +---++--+--+-+---+-+-+-+ |_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_nam |id |timestamp|partition|dat| +---++--+--+-+---+-+-+-+ |20210528125210 |20210528125210_0_3 |id2 |partition=p1 |9b3aae86-e4ed-4bfa-b6cd-ee41db11ac15-0_0-23-26_20210528125210.parquet|id2|1 |p1 |data2| |20210528125210 |20210528125210_1_1 |id3 |partition=p2 |a53dba79-90a2-4370-8637-e39301b4c10c-0_1-23-27_20210528125210.parquet|id3|3 |p2 |data3| +---++--+--
[GitHub] [hudi] jtmzheng commented on issue #2983: [SUPPORT] Is hoodie.consistency.check.enabled still relevant?
jtmzheng commented on issue #2983: URL: https://github.com/apache/hudi/issues/2983#issuecomment-850619877 @fanaticjo what kind of issues did you see? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2833: [HUDI-89] Add configOption & refactor Hudi configuration framework
zhedoubushishi commented on a change in pull request #2833: URL: https://github.com/apache/hudi/pull/2833#discussion_r641742577 ## File path: hudi-common/src/test/java/org/apache/hudi/common/config/TestConfigOption.java ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.common.config; + +import org.junit.jupiter.api.Test; + +import java.util.HashMap; +import java.util.Map; +import java.util.Properties; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNull; + +public class TestConfigOption { Review comment: Good point. Done ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java ## @@ -50,52 +52,95 @@ * @see HoodieTableMetaClient * @since 0.3.0 */ -public class HoodieTableConfig implements Serializable { +public class HoodieTableConfig extends HoodieConfig implements Serializable { private static final Logger LOG = LogManager.getLogger(HoodieTableConfig.class); public static final String HOODIE_PROPERTIES_FILE = "hoodie.properties"; - public static final String HOODIE_TABLE_NAME_PROP_NAME = "hoodie.table.name"; - public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type"; - public static final String HOODIE_TABLE_VERSION_PROP_NAME = "hoodie.table.version"; - public static final String HOODIE_TABLE_PRECOMBINE_FIELD = "hoodie.table.precombine.field"; - public static final String HOODIE_TABLE_PARTITION_COLUMNS = "hoodie.table.partition.columns"; - - @Deprecated - public static final String HOODIE_RO_FILE_FORMAT_PROP_NAME = "hoodie.table.ro.file.format"; - @Deprecated - public static final String HOODIE_RT_FILE_FORMAT_PROP_NAME = "hoodie.table.rt.file.format"; - public static final String HOODIE_BASE_FILE_FORMAT_PROP_NAME = "hoodie.table.base.file.format"; - public static final String HOODIE_LOG_FILE_FORMAT_PROP_NAME = "hoodie.table.log.file.format"; - public static final String HOODIE_TIMELINE_LAYOUT_VERSION = "hoodie.timeline.layout.version"; - public static final String HOODIE_PAYLOAD_CLASS_PROP_NAME = "hoodie.compaction.payload.class"; - public static final String HOODIE_ARCHIVELOG_FOLDER_PROP_NAME = "hoodie.archivelog.folder"; - public static final String HOODIE_BOOTSTRAP_INDEX_ENABLE = "hoodie.bootstrap.index.enable"; - public static final String HOODIE_BOOTSTRAP_INDEX_CLASS_PROP_NAME = "hoodie.bootstrap.index.class"; - public static final String HOODIE_BOOTSTRAP_BASE_PATH = "hoodie.bootstrap.base.path"; - - public static final HoodieTableType DEFAULT_TABLE_TYPE = HoodieTableType.COPY_ON_WRITE; - public static final HoodieTableVersion DEFAULT_TABLE_VERSION = HoodieTableVersion.ZERO; - public static final HoodieFileFormat DEFAULT_BASE_FILE_FORMAT = HoodieFileFormat.PARQUET; - public static final HoodieFileFormat DEFAULT_LOG_FILE_FORMAT = HoodieFileFormat.HOODIE_LOG; - public static final String DEFAULT_PAYLOAD_CLASS = OverwriteWithLatestAvroPayload.class.getName(); - public static final String NO_OP_BOOTSTRAP_INDEX_CLASS = NoOpBootstrapIndex.class.getName(); - public static final String DEFAULT_BOOTSTRAP_INDEX_CLASS = HFileBootstrapIndex.class.getName(); - public static final String DEFAULT_ARCHIVELOG_FOLDER = ""; - private Properties props; + public static final ConfigOption HOODIE_TABLE_NAME_PROP_NAME = ConfigOption + .key("hoodie.table.name") + .noDefaultValue() + .withDescription("Table name that will be used for registering with Hive. Needs to be same across runs."); + + public static final ConfigOption HOODIE_TABLE_TYPE_PROP_NAME = ConfigOption + .key("hoodie.table.type") + .defaultValue(HoodieTableType.COPY_ON_WRITE) + .withDescription("The table type for the underlying data, for this write. This can’t change between writes."); + + public static final ConfigOption HOODIE_TABLE_VERSION_PROP_NAME = ConfigOption + .key("hoodie.table.version") + .defaultValue(HoodieTableVersion.ZERO) + .withDescription(""); + + public static final ConfigOption HOODIE_TABLE_PRECOMBINE_FIELD = ConfigOption + .key("hoodie.table.
[GitHub] [hudi] fanaticjo commented on issue #2983: [SUPPORT] Is hoodie.consistency.check.enabled still relevant?
fanaticjo commented on issue #2983: URL: https://github.com/apache/hudi/issues/2983#issuecomment-850545801 I think this is still required as i have seen issues still persisting in emr version below 5.33 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] fanaticjo commented on issue #2975: [SUPPORT] Read record using index
fanaticjo commented on issue #2975: URL: https://github.com/apache/hudi/issues/2975#issuecomment-850544149 Hello @calleo i am trying to make this as a generic release in version 0.9.0 , till the time being you can clone this git hub repo - https://github.com/fanaticjo/HudiJavaCustomUpsert/tree/master/src/main/java/com create a jar out of it and load this as a dependent jar with hudi and avro jar for your pysaprk . In your pyspark code you can add this 2 options "hoodie.update.keys": "admission_date,name", #columns you want to update "hoodie.datasource.write.payload.class": "com.hudiUpsert.hudiCustomUpsert", #custom jar Sample piece which is working for me hudi_update_options_with_key = { 'hoodie.table.name': "test", 'hoodie.datasource.write.recordkey.field': 'id', 'hoodie.datasource.write.partitionpath.field': 'dept', 'hoodie.datasource.write.table.name': "test", "hoodie.index.type": "SIMPLE", "hoodie.update.keys": "admission_date,name", "hoodie.datasource.write.payload.class": "com.hudiUpsert.hudiCustomUpsert", 'hoodie.datasource.write.operation': 'upsert', 'hoodie.datasource.write.precombine.field': 'date', 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2 } df.write.format("org.apache.hudi").options(**hudi_update_options_with_key).mode("append").save( "/Users/biswajit/PycharmProjects/hudicustomupsert/output/") -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service
nsivabalan commented on pull request #2967: URL: https://github.com/apache/hudi/pull/2967#issuecomment-850502987 Can you build the site locally and take screenshot and attach it here. would be nice to review that as well. for eg: https://github.com/apache/hudi/pull/2969 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on a change in pull request #2993: [HUDI-1929] Make HoodieDeltaStreamer support configure KeyGenerator by type
wangxianghu commented on a change in pull request #2993: URL: https://github.com/apache/hudi/pull/2993#discussion_r641622004 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -72,8 +71,11 @@ public static final String PRECOMBINE_FIELD_PROP = "hoodie.datasource.write.precombine.field"; public static final String WRITE_PAYLOAD_CLASS = "hoodie.datasource.write.payload.class"; public static final String DEFAULT_WRITE_PAYLOAD_CLASS = OverwriteWithLatestAvroPayload.class.getName(); + public static final String KEYGENERATOR_CLASS_PROP = "hoodie.datasource.write.keygenerator.class"; - public static final String DEFAULT_KEYGENERATOR_CLASS = SimpleAvroKeyGenerator.class.getName(); + public static final String KEYGENERATOR_TYPE_PROP = "hoodie.datasource.write.keygenerator.type"; + public static final String DEFAULT_KEYGENERATOR_TYPE = "simple"; Review comment: > One question, if a user configured both `KEYGENERATOR_CLASS_PROP ` and `KEYGENERATOR_TYPE_PROP `, which one is the higher priority. `KEYGENERATOR_CLASS_PROP` has a higher priority. `KEYGENERATOR_CLASS_PROP` is more complex, I assume the user prefers the type way. if they still configured `KEYGENERATOR_CLASS_PROP` that means the user really wants it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on a change in pull request #2993: [HUDI-1929] Make HoodieDeltaStreamer support configure KeyGenerator by type
wangxianghu commented on a change in pull request #2993: URL: https://github.com/apache/hudi/pull/2993#discussion_r641622004 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java ## @@ -72,8 +71,11 @@ public static final String PRECOMBINE_FIELD_PROP = "hoodie.datasource.write.precombine.field"; public static final String WRITE_PAYLOAD_CLASS = "hoodie.datasource.write.payload.class"; public static final String DEFAULT_WRITE_PAYLOAD_CLASS = OverwriteWithLatestAvroPayload.class.getName(); + public static final String KEYGENERATOR_CLASS_PROP = "hoodie.datasource.write.keygenerator.class"; - public static final String DEFAULT_KEYGENERATOR_CLASS = SimpleAvroKeyGenerator.class.getName(); + public static final String KEYGENERATOR_TYPE_PROP = "hoodie.datasource.write.keygenerator.type"; + public static final String DEFAULT_KEYGENERATOR_TYPE = "simple"; Review comment: > One question, if a user configured both `KEYGENERATOR_CLASS_PROP ` and `KEYGENERATOR_TYPE_PROP `, which one is the higher priority. `KEYGENERATOR_CLASS_PROP` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1946) Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns
[ https://issues.apache.org/jira/browse/HUDI-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianghu Wang closed HUDI-1946. -- Resolution: Not A Problem > Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all > the columns > > > Key: HUDI-1946 > URL: https://issues.apache.org/jira/browse/HUDI-1946 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > When the user wants to derive one or more columns from the existing columns > and the > existing columns are all needed, the user needs to spell all the columns they > need in the SQL. > This will be very troublesome and time-consuming when we have dozens or > hundreds of columns. we can save trouble by using a wildcard in the SQL to > represent all the columns. > that means if we have "id", "name", "age", "ts" these four columns already > and we want to add a new column driver from ts, we can use > "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name" > to represent > "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from > table_name" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu closed pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…
wangxianghu closed pull request #3007: URL: https://github.com/apache/hudi/pull/3007 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter commented on pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…
codecov-commenter commented on pull request #3007: URL: https://github.com/apache/hudi/pull/3007#issuecomment-850460794 # [Codecov](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#3007](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0f4aba1) into [master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bc18c39) will **decrease** coverage by `45.77%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/3007/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #3007 +/- ## - Coverage 55.00% 9.23% -45.78% + Complexity 3850 48 -3802 Files 485 54 -431 Lines 234672025-21442 Branches 2497 243 -2254 - Hits 12909 187-12722 + Misses 94061825 -7581 + Partials 1152 13 -1139 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `?` | | | hudiclient | `?` | | | hudicommon | `?` | | | hudiflink | `?` | | | hudihadoopmr | `?` | | | hudisparkdatasource | `?` | | | hudisync | `?` | | | huditimelineservice | `?` | | | hudiutilities | `9.23% <0.00%> (-61.60%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [.../utilities/transform/SqlQueryBasedTransformer.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9TcWxRdWVyeUJhc2VkVHJhbnNmb3JtZXIuamF2YQ==) | `0.00% <0.00%> (-75.00%)` | :arrow_down: | | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi
[GitHub] [hudi] leesf merged pull request #3004: [HUDI-1940] Add SqlQueryBasedTransformer unit test
leesf merged pull request #3004: URL: https://github.com/apache/hudi/pull/3004 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated: [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004)
This is an automated email from the ASF dual-hosted git repository. leesf pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new 974b476 [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004) 974b476 is described below commit 974b476180e61fac58cd87e78699428d4108a482 Author: wangxianghu AuthorDate: Fri May 28 22:30:30 2021 +0800 [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004) --- .../transform/TestSqlQueryBasedTransformer.java| 94 ++ 1 file changed, 94 insertions(+) diff --git a/hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java b/hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java new file mode 100644 index 000..b6fdc25 --- /dev/null +++ b/hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.transform; + +import org.apache.hudi.common.config.TypedProperties; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.junit.jupiter.api.Test; + +import java.util.Collections; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; + +public class TestSqlQueryBasedTransformer { + + @Test + public void testSqlQuery() { + +SparkSession spark = SparkSession +.builder() +.master("local[2]") +.appName(TestSqlQueryBasedTransformer.class.getName()) +.getOrCreate(); + +JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext()); + +// prepare test data +String testData = "{\n" ++ " \"ts\": 1622126968000,\n" ++ " \"uuid\": \"c978e157-72ee-4819-8f04-8e46e1bb357a\",\n" ++ " \"rider\": \"rider-213\",\n" ++ " \"driver\": \"driver-213\",\n" ++ " \"begin_lat\": 0.4726905879569653,\n" ++ " \"begin_lon\": 0.46157858450465483,\n" ++ " \"end_lat\": 0.754803407008858,\n" ++ " \"end_lon\": 0.9671159942018241,\n" ++ " \"fare\": 34.158284716382845,\n" ++ " \"partitionpath\": \"americas/brazil/sao_paulo\"\n" ++ "}"; + +JavaRDD testRdd = jsc.parallelize(Collections.singletonList(testData), 2); +Dataset ds = spark.read().json(testRdd); + +// create a new column dt, whose value is transformed from ts, format is MMdd +String transSql = "select\n" ++ "\tuuid,\n" ++ "\tbegin_lat,\n" ++ "\tbegin_lon,\n" ++ "\tdriver,\n" ++ "\tend_lat,\n" ++ "\tend_lon,\n" ++ "\tfare,\n" ++ "\tpartitionpath,\n" ++ "\trider,\n" ++ "\tts,\n" ++ "\tFROM_UNIXTIME(ts / 1000, 'MMdd') as dt\n" ++ "from\n" ++ "\t"; +TypedProperties props = new TypedProperties(); +props.put("hoodie.deltastreamer.transformer.sql", transSql); + +// transform +SqlQueryBasedTransformer transformer = new SqlQueryBasedTransformer(); +Dataset result = transformer.apply(jsc, spark, ds, props); + +// check result +assertEquals(11, result.columns().length); +assertNotNull(result.col("dt")); +assertEquals("20210527", result.first().get(10).toString()); + +spark.close(); + } +}
[GitHub] [hudi] wangxianghu commented on pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…
wangxianghu commented on pull request #3007: URL: https://github.com/apache/hudi/pull/3007#issuecomment-850456430 will add the unit test after https://github.com/apache/hudi/pull/3004 merged -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1946) Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns
[ https://issues.apache.org/jira/browse/HUDI-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1946: - Labels: pull-request-available (was: ) > Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all > the columns > > > Key: HUDI-1946 > URL: https://issues.apache.org/jira/browse/HUDI-1946 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Xianghu Wang >Assignee: Xianghu Wang >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > When the user wants to derive one or more columns from the existing columns > and the > existing columns are all needed, the user needs to spell all the columns they > need in the SQL. > This will be very troublesome and time-consuming when we have dozens or > hundreds of columns. we can save trouble by using a wildcard in the SQL to > represent all the columns. > that means if we have "id", "name", "age", "ts" these four columns already > and we want to add a new column driver from ts, we can use > "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name" > to represent > "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from > table_name" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu opened a new pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…
wangxianghu opened a new pull request #3007: URL: https://github.com/apache/hudi/pull/3007 …rd to represent all the columns ## What is the purpose of the pull request When the user wants to derive one or more columns from the existing columns and the existing columns are all needed, the user needs to spell all the columns they need in the SQL. This will be very troublesome and time-consuming when we have dozens or hundreds of columns. we can save trouble by using a wildcard in the SQL to represent all the columns. that means if we have "id", "name", "age", "ts" these four columns already and we want to add a new column driver from ts, we can use "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name" to represent "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name" ## Brief change log This pull request is already covered by existing tests, such as *(please describe tests)*. ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1946) Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns
Xianghu Wang created HUDI-1946: -- Summary: Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns Key: HUDI-1946 URL: https://issues.apache.org/jira/browse/HUDI-1946 Project: Apache Hudi Issue Type: New Feature Reporter: Xianghu Wang Assignee: Xianghu Wang Fix For: 0.9.0 When the user wants to derive one or more columns from the existing columns and the existing columns are all needed, the user needs to spell all the columns they need in the SQL. This will be very troublesome and time-consuming when we have dozens or hundreds of columns. we can save trouble by using a wildcard in the SQL to represent all the columns. that means if we have "id", "name", "age", "ts" these four columns already and we want to add a new column driver from ts, we can use "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name" to represent "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name" -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] wangxianghu commented on a change in pull request #3004: [HUDI-1940] Add SqlQueryBasedTransformer unit test
wangxianghu commented on a change in pull request #3004: URL: https://github.com/apache/hudi/pull/3004#discussion_r641543572 ## File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.transform; + +import org.apache.hudi.common.config.TypedProperties; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.junit.jupiter.api.Test; + +import java.util.Collections; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; + +public class TestSqlQueryBasedTransformer { + + @Test + public void testSqlQuery() { + +SparkSession spark = SparkSession +.builder() +.master("local[2]") +.appName(TestSqlQueryBasedTransformer.class.getName()) +.getOrCreate(); + +JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext()); + +// prepare test data +String testData = "{\n" ++ " \"ts\": 1622126968000,\n" ++ " \"uuid\": \"c978e157-72ee-4819-8f04-8e46e1bb357a\",\n" ++ " \"rider\": \"rider-213\",\n" ++ " \"driver\": \"driver-213\",\n" ++ " \"begin_lat\": 0.4726905879569653,\n" ++ " \"begin_lon\": 0.46157858450465483,\n" ++ " \"end_lat\": 0.754803407008858,\n" ++ " \"end_lon\": 0.9671159942018241,\n" ++ " \"fare\": 34.158284716382845,\n" ++ " \"partitionpath\": \"americas/brazil/sao_paulo\"\n" ++ "}"; + +JavaRDD testRdd = jsc.parallelize(Collections.singletonList(testData), 2); +Dataset ds = spark.read().json(testRdd); + +// create a new column dt, whose value is transformed from ts, format is MMdd +String transSql = "select\n" ++ "\tuuid,\n" ++ "\tbegin_lat,\n" ++ "\tbegin_lon,\n" ++ "\tdriver,\n" ++ "\tend_lat,\n" ++ "\tend_lon,\n" ++ "\tfare,\n" ++ "\tpartitionpath,\n" ++ "\trider,\n" ++ "\tts,\n" ++ "\tFROM_UNIXTIME(ts / 1000, 'MMdd') as dt\n" ++ "from\n" ++ "\t"; +TypedProperties props = new TypedProperties(); +props.put("hoodie.deltastreamer.transformer.sql", transSql); + +// transform +SqlQueryBasedTransformer transformer = new SqlQueryBasedTransformer(); +Dataset result = transformer.apply(jsc, spark, ds, props); + +// check result +assertEquals(result.columns().length, 11); Review comment: > pls use assertEquals(expected, result.columns), so please change to `assertEquals(11, result.columns().length)` @leesf thanks, done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (HUDI-1910) Supporting Kafka based checkpointing for HoodieDeltaStreamer
[ https://issues.apache.org/jira/browse/HUDI-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353335#comment-17353335 ] Vinay commented on HUDI-1910: - [~nishith29] can you pls delete one of the sub-tasks, got added twice. I added it as a sub task as it is related to consumer group offset. > Supporting Kafka based checkpointing for HoodieDeltaStreamer > > > Key: HUDI-1910 > URL: https://issues.apache.org/jira/browse/HUDI-1910 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: Nishith Agarwal >Assignee: Vinay >Priority: Major > Labels: sev:normal, triaged > > HoodieDeltaStreamer currently supports commit metadata based checkpoint. Some > users have requested support for Kafka based checkpoints for freshness > auditing purposes. This ticket tracks any implementation for that. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1945) Support Hudi to read from Kafka Consumer Group Offset
Vinay created HUDI-1945: --- Summary: Support Hudi to read from Kafka Consumer Group Offset Key: HUDI-1945 URL: https://issues.apache.org/jira/browse/HUDI-1945 Project: Apache Hudi Issue Type: Sub-task Components: DeltaStreamer Reporter: Vinay Assignee: Vinay Currently, Hudi provides options to read from latest or earliest. We should even provide users an option to read from group offset as well. This change will be in `KafkaOffsetGen` where we can add a method to support this functionality -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1945) Support Hudi to read from Kafka Consumer Group Offset
[ https://issues.apache.org/jira/browse/HUDI-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinay updated HUDI-1945: Status: New (was: Open) > Support Hudi to read from Kafka Consumer Group Offset > - > > Key: HUDI-1945 > URL: https://issues.apache.org/jira/browse/HUDI-1945 > Project: Apache Hudi > Issue Type: Sub-task > Components: DeltaStreamer >Reporter: Vinay >Assignee: Vinay >Priority: Major > > Currently, Hudi provides options to read from latest or earliest. We should > even provide users an option to read from group offset as well. > This change will be in `KafkaOffsetGen` where we can add a method to support > this functionality -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1944) Support Hudi to read from Kafka Consumer Group Offset
Vinay created HUDI-1944: --- Summary: Support Hudi to read from Kafka Consumer Group Offset Key: HUDI-1944 URL: https://issues.apache.org/jira/browse/HUDI-1944 Project: Apache Hudi Issue Type: Sub-task Components: DeltaStreamer Reporter: Vinay Assignee: Vinay Currently, Hudi provides options to read from latest or earliest. We should even provide users an option to read from group offset as well. This change will be in `KafkaOffsetGen` where we can add a method to support this functionality -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] veenaypatil commented on pull request #2998: [HUDI-1426] Rename classname in camel case format
veenaypatil commented on pull request #2998: URL: https://github.com/apache/hudi/pull/2998#issuecomment-850393465 @leesf I agree, this will be breaking change once users upgrade to new version, I think if we don't want to do this change, we should at least update the document to correct class name to avoid confusion -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-commenter edited a comment on pull request #3006: [HUDI-1943]fix lose properties problem
codecov-commenter edited a comment on pull request #3006: URL: https://github.com/apache/hudi/pull/3006#issuecomment-850365971 # [Codecov](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#3006](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f03396d) into [master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bc18c39) will **increase** coverage by `15.82%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/3006/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master#3006 +/- ## = + Coverage 55.00% 70.83% +15.82% + Complexity 3850 385 -3465 = Files 485 54 -431 Lines 23467 2016-21451 Branches 2497 241 -2256 = - Hits 12909 1428-11481 + Misses 9406 454 -8952 + Partials 1152 134 -1018 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `?` | | | hudiclient | `∅ <ø> (∅)` | | | hudicommon | `?` | | | hudiflink | `?` | | | hudihadoopmr | `?` | | | hudisparkdatasource | `?` | | | hudisync | `?` | | | huditimelineservice | `?` | | | hudiutilities | `70.83% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...in/java/org/apache/hudi/cli/HoodiePrintHelper.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVByaW50SGVscGVyLmphdmE=) | | | | [...in/java/org/apache/hudi/schema/SchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvU2NoZW1hUHJvdmlkZXIuamF2YQ==) | | | | [...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh) | | | | [...able/timeline/versioning/AbstractMigratorBase.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvQWJzdHJhY3RNaWdyYXRvckJhc2UuamF2YQ==) | | | | [...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh) | | | | [...udi/timeline/service/handlers/BaseFileHandler.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9o
[GitHub] [hudi] codecov-commenter edited a comment on pull request #3006: [HUDI-1943]fix lose properties problem
codecov-commenter edited a comment on pull request #3006: URL: https://github.com/apache/hudi/pull/3006#issuecomment-850365971 # [Codecov](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#3006](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f03396d) into [master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bc18c39) will **increase** coverage by `15.82%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/3006/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master#3006 +/- ## = + Coverage 55.00% 70.83% +15.82% + Complexity 3850 385 -3465 = Files 485 54 -431 Lines 23467 2016-21451 Branches 2497 241 -2256 = - Hits 12909 1428-11481 + Misses 9406 454 -8952 + Partials 1152 134 -1018 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `?` | | | hudiclient | `?` | | | hudicommon | `?` | | | hudiflink | `?` | | | hudihadoopmr | `?` | | | hudisparkdatasource | `?` | | | hudisync | `?` | | | huditimelineservice | `?` | | | hudiutilities | `70.83% <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...ava/org/apache/hudi/common/util/BaseFileUtils.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQmFzZUZpbGVVdGlscy5qYXZh) | | | | [...main/java/org/apache/hudi/hive/HiveSyncConfig.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZVN5bmNDb25maWcuamF2YQ==) | | | | [.../common/bloom/HoodieDynamicBoundedBloomFilter.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0hvb2RpZUR5bmFtaWNCb3VuZGVkQmxvb21GaWx0ZXIuamF2YQ==) | | | | [...e/hudi/exception/HoodieDeltaStreamerException.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZURlbHRhU3RyZWFtZXJFeGNlcHRpb24uamF2YQ==) | | | | [.../apache/hudi/common/model/HoodieRecordPayload.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZFBheWxvYWQuamF2YQ==) | | | | [...che/hudi/metadata/TimelineMergedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1j
[GitHub] [hudi] codecov-commenter commented on pull request #3006: [HUDI-1943]fix lose properties problem
codecov-commenter commented on pull request #3006: URL: https://github.com/apache/hudi/pull/3006#issuecomment-850365971 # [Codecov](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#3006](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (f03396d) into [master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (bc18c39) will **decrease** coverage by `45.73%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/3006/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #3006 +/- ## - Coverage 55.00% 9.27% -45.74% + Complexity 3850 48 -3802 Files 485 54 -431 Lines 234672016-21451 Branches 2497 241 -2256 - Hits 12909 187-12722 + Misses 94061816 -7590 + Partials 1152 13 -1139 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `?` | | | hudiclient | `?` | | | hudicommon | `?` | | | hudiflink | `?` | | | hudihadoopmr | `?` | | | hudisparkdatasource | `?` | | | hudisync | `?` | | | huditimelineservice | `?` | | | hudiutilities | `9.27% <ø> (-61.56%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tre
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2923: [HUDI-1864] Added support for Date, Timestamp, LocalDate and LocalDateTime in TimestampBasedAvroKeyGenerator
codecov-commenter edited a comment on pull request #2923: URL: https://github.com/apache/hudi/pull/2923#issuecomment-846613183 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2923](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (1c8fe55) into [master](https://codecov.io/gh/apache/hudi/commit/112732db815660dfdf4b62aa443127d7884b6e63?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (112732d) will **decrease** coverage by `45.69%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2923/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master #2923 +/- ## - Coverage 54.97% 9.27% -45.70% + Complexity 3845 48 -3797 Files 485 54 -431 Lines 234362016-21420 Branches 2494 241 -2253 - Hits 12884 187-12697 + Misses 94001816 -7584 + Partials 1152 13 -1139 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `?` | | | hudiclient | `∅ <ø> (∅)` | | | hudicommon | `?` | | | hudiflink | `?` | | | hudihadoopmr | `?` | | | hudisparkdatasource | `?` | | | hudisync | `?` | | | huditimelineservice | `?` | | | hudiutilities | `9.27% <ø> (-61.61%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | :arrow_down: | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2923/dif
[GitHub] [hudi] leesf commented on a change in pull request #3004: [HUDI-1940] Add SqlQueryBasedTransformer unit test
leesf commented on a change in pull request #3004: URL: https://github.com/apache/hudi/pull/3004#discussion_r641474949 ## File path: hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java ## @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.utilities.transform; + +import org.apache.hudi.common.config.TypedProperties; + +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SparkSession; +import org.junit.jupiter.api.Test; + +import java.util.Collections; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertNotNull; + +public class TestSqlQueryBasedTransformer { + + @Test + public void testSqlQuery() { + +SparkSession spark = SparkSession +.builder() +.master("local[2]") +.appName(TestSqlQueryBasedTransformer.class.getName()) +.getOrCreate(); + +JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext()); + +// prepare test data +String testData = "{\n" ++ " \"ts\": 1622126968000,\n" ++ " \"uuid\": \"c978e157-72ee-4819-8f04-8e46e1bb357a\",\n" ++ " \"rider\": \"rider-213\",\n" ++ " \"driver\": \"driver-213\",\n" ++ " \"begin_lat\": 0.4726905879569653,\n" ++ " \"begin_lon\": 0.46157858450465483,\n" ++ " \"end_lat\": 0.754803407008858,\n" ++ " \"end_lon\": 0.9671159942018241,\n" ++ " \"fare\": 34.158284716382845,\n" ++ " \"partitionpath\": \"americas/brazil/sao_paulo\"\n" ++ "}"; + +JavaRDD testRdd = jsc.parallelize(Collections.singletonList(testData), 2); +Dataset ds = spark.read().json(testRdd); + +// create a new column dt, whose value is transformed from ts, format is MMdd +String transSql = "select\n" ++ "\tuuid,\n" ++ "\tbegin_lat,\n" ++ "\tbegin_lon,\n" ++ "\tdriver,\n" ++ "\tend_lat,\n" ++ "\tend_lon,\n" ++ "\tfare,\n" ++ "\tpartitionpath,\n" ++ "\trider,\n" ++ "\tts,\n" ++ "\tFROM_UNIXTIME(ts / 1000, 'MMdd') as dt\n" ++ "from\n" ++ "\t"; +TypedProperties props = new TypedProperties(); +props.put("hoodie.deltastreamer.transformer.sql", transSql); + +// transform +SqlQueryBasedTransformer transformer = new SqlQueryBasedTransformer(); +Dataset result = transformer.apply(jsc, spark, ds, props); + +// check result +assertEquals(result.columns().length, 11); Review comment: pls use assertEquals(expected, result.columns), so please change to `assertEquals(11, result.columns().length)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] leesf commented on pull request #2998: [HUDI-1426] Rename classname in camel case format
leesf commented on pull request #2998: URL: https://github.com/apache/hudi/pull/2998#issuecomment-850349161 @veenaypatil Thanks for you contribution, after changing the classname it would cause compatibility issues since there are many users used `NonpartitionedKeyGenerator` in their codebase. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1943) [hudi-flink]Lose properties when hoodieWriteConfig initializtion
[ https://issues.apache.org/jira/browse/HUDI-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1943: - Labels: pull-request-available (was: ) > [hudi-flink]Lose properties when hoodieWriteConfig initializtion > > > Key: HUDI-1943 > URL: https://issues.apache.org/jira/browse/HUDI-1943 > Project: Apache Hudi > Issue Type: Bug > Components: Flink Integration >Reporter: hk__lrzy >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] hk-lrzy opened a new pull request #3006: [HUDI-1943]fix lose properties problem
hk-lrzy opened a new pull request #3006: URL: https://github.com/apache/hudi/pull/3006 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (HUDI-1943) [hudi-flink]Lose properties when hoodieWriteConfig initializtion
hk__lrzy created HUDI-1943: -- Summary: [hudi-flink]Lose properties when hoodieWriteConfig initializtion Key: HUDI-1943 URL: https://issues.apache.org/jira/browse/HUDI-1943 Project: Apache Hudi Issue Type: Bug Components: Flink Integration Reporter: hk__lrzy -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] t0il3ts0ap commented on issue #2934: [SUPPORT] Parquet file does not exist when trying to read hudi table incrementally
t0il3ts0ap commented on issue #2934: URL: https://github.com/apache/hudi/issues/2934#issuecomment-850312591 @n3nash Sorry for late reply, was away on vacation 1. We run one instance of deltastreamer job every 2 hrs on source table. Each run at the max sources 6 million records using --source-limit parameter. 2. ~ 12 runs of deltastreamer. Not sure if results in 12 commits. ( Please share some knowledge or documentation here. I apologize that I dont know this. ) 3. Please assume defaults for this. I have not modified cleaner policy for deltastreamer. 4. This is failing in first run itself. I checked manually, there are no files for oldest commits. A sample. case. Lets say I open .hoodie directory for a COW table with no partition https://user-images.githubusercontent.com/8509512/119967062-9be8f780-bfc9-11eb-9bca-87cb8fc4fd07.png";> Now, I open this commit file `20210524175014.commit` to find following data: ``` { "partitionToWriteStats" : { "default" : [ { "fileId" : "890de7c3-7f0d-4586-8e66-83a9f8d9c106-0", "path" : "default/890de7c3-7f0d-4586-8e66-83a9f8d9c106-0_0-23-13521_20210524175014.parquet", "prevCommit" : "20210524160842", "numWrites" : 325368, "numDeletes" : 0, "numUpdateWrites" : 0, "numInserts" : 115, "totalWriteBytes" : 32817083, "totalWriteErrors" : 0, "tempPath" : null, "partitionPath" : "default", "totalLogRecords" : 0, "totalLogFilesCompacted" : 0, "totalLogSizeCompacted" : 0, "totalUpdatedRecordsCompacted" : 0, "totalLogBlocks" : 0, "totalCorruptLogBlock" : 0, "totalRollbackBlocks" : 0, "fileSizeInBytes" : 32817083 } ] }, "compacted" : false, "extraMetadata" : { "schema" : "{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[{\"name\":\"id\",\"type\":\"long\"},{\"name\":\"created_at\",\"type\":[\"string\",\"null\"]},{\"name\":\"created_by\",\"type\":[\"string\",\"null\"]},{\"name\":\"last_modified_by\",\"type\":[\"string\",\"null\"]},{\"name\":\"version\",\"type\":[\"long\",\"null\"]},{\"name\":\"address_reference_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"city\",\"type\":[\"string\",\"null\"]},{\"name\":\"current\",\"type\":[\"boolean\",\"null\"]},{\"name\":\"line_one\",\"type\":[\"string\",\"null\"]},{\"name\":\"line_two\",\"type\":[\"string\",\"null\"]},{\"name\":\"permanent\",\"type\":[\"boolean\",\"null\"]},{\"name\":\"pin_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"source\",\"type\":[\"string\",\"null\"]},{\"name\":\"state\",\"type\":[\"string\",\"null\"]},{\"name\":\"type\",\"type\":[\"string\",\"null\"]},{\"name\":\"email_address\",\"type\":[\"string\",\"null\"]},{\"name\":\ "collection_case_id\",\"type\":[\"long\",\"null\"]},{\"name\":\"updated_date\",\"type\":[\"string\",\"null\"]},{\"name\":\"updated_at\",\"type\":[\"string\",\"null\"]},{\"name\":\"latitude\",\"type\":[\"double\",\"null\"]},{\"name\":\"longitude\",\"type\":[\"double\",\"null\"]},{\"name\":\"__lsn\",\"type\":[\"long\",\"null\"]},{\"name\":\"_hoodie_is_deleted\",\"type\":[\"boolean\",\"null\"]}]}", "deltastreamer.checkpoint.key" : "collection_service.public.addresses,0:1066373,1:1064984,2:1065562,3:1070617,4:1067374,5:1066110" }, "operationType" : "UPSERT", "fileIdAndRelativePaths" : { "890de7c3-7f0d-4586-8e66-83a9f8d9c106-0" : "default/890de7c3-7f0d-4586-8e66-83a9f8d9c106-0_0-23-13521_20210524175014.parquet" }, "totalRecordsDeleted" : 0, "totalLogRecordsCompacted" : 0, "totalLogFilesCompacted" : 0, "totalCompactedRecordsUpdated" : 0, "totalLogFilesSize" : 0, "totalScanTime" : 0, "totalCreateTime" : 0, "totalUpsertTime" : 5089 } ``` Now , this `default/890de7c3-7f0d-4586-8e66-83a9f8d9c106-0_0-23-13521_20210524175014.parquet` mentioned in the commit is missing. This error is raised only while doing [incremental query ](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java ) But if I just search the prefix `890de7c3-7f0d-4586-8e66-83a9f8d9c106` there are many parquet files which I can see Right now, I have overridden `HoodieIncrSource.java` when running deltastreamer for target table to make it work. Added some logic like below, If no checkpoint is found (basically this is first run), do a snapshot query and save last commit as checkpoint. If checkpoint is found carry on with incremental query as usual. What this benefits us with is that we are always doing incremental query on newer commits rather than older commits. Newer commits's parquet file is present. Also it solves the problem of taking snapshot automatically, when reading from a hudi table for the first time. -- This is an automat
[GitHub] [hudi] codecov-commenter edited a comment on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
codecov-commenter edited a comment on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-850284847 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2438](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (739a252) into [master](https://codecov.io/gh/apache/hudi/commit/7fed7352bd506e20e5316bb0b3ed9e5c1e9c76df?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7fed735) will **decrease** coverage by `1.49%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2438/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master#2438 +/- ## - Coverage 54.96% 53.47% -1.50% + Complexity 3844 3459 -385 Files 485 431 -54 Lines 2343721421-2016 Branches 2494 2253 -241 - Hits 1288211454-1428 + Misses 9401 8947 -454 + Partials 1154 1020 -134 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `39.55% <ø> (ø)` | | | hudiclient | `∅ <ø> (∅)` | | | hudicommon | `50.29% <ø> (ø)` | | | hudiflink | `63.41% <ø> (ø)` | | | hudihadoopmr | `51.54% <ø> (ø)` | | | hudisparkdatasource | `73.33% <ø> (ø)` | | | hudisync | `46.44% <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | | | hudiutilities | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...di/utilities/sources/helpers/IncrSourceHelper.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9JbmNyU291cmNlSGVscGVyLmphdmE=) | | | | [...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=) | | | | [...i/utilities/deltastreamer/SourceFormatAdapter.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvU291cmNlRm9ybWF0QWRhcHRlci5qYXZh) | | | | [...s/exception/HoodieIncrementalPullSQLException.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVJbmNyZW1lbnRhbFB1bGxTUUxFeGNlcHRpb24uamF2YQ==) | | | | [...alCheckpointFromAnotherHoodieTimelineProvider.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvSW5pdGlhbENoZWNrcG9pbnRGcm9tQW5vdGhlckhvb2RpZVRpbWVsaW5lUHJvdmlkZXIuamF2YQ==) | | | | [...g/apache/hudi/utilities/schema/SchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&
[GitHub] [hudi] codecov-commenter commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp
codecov-commenter commented on pull request #2438: URL: https://github.com/apache/hudi/pull/2438#issuecomment-850284847 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report > Merging [#2438](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (739a252) into [master](https://codecov.io/gh/apache/hudi/commit/7fed7352bd506e20e5316bb0b3ed9e5c1e9c76df?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (7fed735) will **decrease** coverage by `1.31%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2438/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) ```diff @@ Coverage Diff @@ ## master#2438 +/- ## - Coverage 54.96% 53.65% -1.32% + Complexity 3844 3253 -591 Files 485 407 -78 Lines 2343719762-3675 Branches 2494 2085 -409 - Hits 1288210603-2279 + Misses 9401 8221-1180 + Partials 1154 938 -216 ``` | Flag | Coverage Δ | | |---|---|---| | hudicli | `39.55% <ø> (ø)` | | | hudiclient | `∅ <ø> (∅)` | | | hudicommon | `50.29% <ø> (ø)` | | | hudiflink | `63.41% <ø> (ø)` | | | hudihadoopmr | `51.54% <ø> (ø)` | | | hudisparkdatasource | `73.33% <ø> (ø)` | | | hudisync | `?` | | | huditimelineservice | `?` | | | hudiutilities | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | | |---|---|---| | [...java/org/apache/hudi/utilities/sources/Source.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvU291cmNlLmphdmE=) | | | | [...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=) | | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=) | | | | [.../apache/hudi/timeline/service/TimelineService.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvVGltZWxpbmVTZXJ2aWNlLmphdmE=) | | | | [...di/utilities/sources/helpers/IncrSourceHelper.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9JbmNyU291cmNlSGVscGVyLmphdmE=) | | | | [.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_cam
[GitHub] [hudi] veenaypatil commented on pull request #2998: [HUDI-1426] Rename classname in camel case format
veenaypatil commented on pull request #2998: URL: https://github.com/apache/hudi/pull/2998#issuecomment-850271396 @yanghua can you please look into this PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service
pratyakshsharma commented on pull request #2967: URL: https://github.com/apache/hudi/pull/2967#issuecomment-850210814 @n3nash @nsivabalan Please take a look. All the comments are addressed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service
pratyakshsharma commented on a change in pull request #2967: URL: https://github.com/apache/hudi/pull/2967#discussion_r641327528 ## File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md ## @@ -0,0 +1,77 @@ +--- +title: "Employing correct configurations for Hudi's cleaner table service" +excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`" +author: pratyakshsharma +category: blog +--- + +Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. + +### Reclaiming space and bounding your data lake growth + +Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service. + +### Problem Statement + +In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail. + +### Deeper dive into Hudi Cleaner + +To deal with the mentioned scenario, lets understand the different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts: + + - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `__.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. + - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type. + - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups. + +### Cleaning Policies + +Hudi cleaner currently supports below cleaning policies: + + - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy. + - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file. + +### Examples + +Suppose a user uses the below configs for cleaning: + +```java +hoodie.cleaner.policy=KEEP_LATEST_COMMITS +hoodie.cleaner.commits.retained=10 +``` + +Cleaner selects the versions of files to be cleaned by taking care of the following: + + - Latest version of a file should not be cleaned. + - The commit times of the last 10 (config
[GitHub] [hudi] hushenmin opened a new issue #3005: [SUPPORT]
hushenmin opened a new issue #3005: URL: https://github.com/apache/hudi/issues/3005 How to query history snapshot by given one history partition? At present, through the following method, I can query the historical snapshot of a Hudi partition, but for me, this method is very unfriendly. Because I didn't know how many times there were the most recent commits in the partition I want to query when I query. ``` add jar ${hudi.hadoop.bundle}; set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL; set hoodie.stock_ticks_cow.consume.max.commits=3; set hoodie.stock_ticks_cow.consume.start.timestamp='${min.commit.time}'; ``` If the Hudi community can provide a way to query the full history snapshots of a partition in Hudi, I think this is a great feature, because it can solve the pain point of querying the full history snapshots. In this way, I can directly switch the current full data warehouse `Hive on HDFS` model to `Hive on Hudi` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service
pratyakshsharma commented on a change in pull request #2967: URL: https://github.com/apache/hudi/pull/2967#discussion_r641319609 ## File path: docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md ## @@ -0,0 +1,77 @@ +--- +title: "Employing correct configurations for Hudi's cleaner table service" +excerpt: "Achieving isolation between Hudi writer and readers using `HoodieCleaner.java`" +author: pratyakshsharma +category: blog +--- + +Apache Hudi provides snapshot isolation between writers and readers. This is made possible by Hudi’s MVCC concurrency model. In this blog, we will explain how to employ the right configurations to manage multiple file versions. Furthermore, we will discuss mechanisms available to users generating Hudi tables on how to maintain just the required number of old file versions so that long running readers do not fail. + +### Reclaiming space and bounding your data lake growth + +Hudi provides different table management services to be able to manage your tables on the data lake. One of these services is called the **Cleaner**. As you write more data to your table, for every batch of updates received, Hudi can either generate a new version of the data file with updates applied to records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding rewriting newer version of an existing file (MERGE_ON_READ). In such situations, depending on the frequency of your updates, the number of file versions of log files can grow indefinitely. If your use-cases do not require keeping an infinite history of these versions, it is imperative to have a process that reclaims older versions of the data. This is Hudi’s cleaner service. + +### Problem Statement + +In a data lake architecture, it is a very common scenario to have readers and writers concurrently accessing the same table. As the Hudi cleaner service periodically reclaims older file versions, scenarios arise where a long running query might be accessing a file version that is deemed to be reclaimed by the cleaner. Here, we need to employ the correct configs to ensure readers (aka queries) don’t fail. + +### Deeper dive into Hudi Cleaner + +To deal with the mentioned scenario, lets understand the different cleaning policies that Hudi offers and the corresponding properties that need to be configured. Options are available to schedule cleaning asynchronously or synchronously. Before going into more details, we would like to explain a few underlying concepts: + + - **Hudi base file**: Columnar file which consists of final data after compaction. A base file’s name follows the following naming convention: `__.parquet`. In subsequent writes of this file, file id remains the same and commit time gets updated to show the latest version. This also implies any particular version of a record, given its partition path, can be uniquely located using the file id and instant time. + - **File slice**: A file slice consists of the base file and any log files consisting of the delta, in case of MERGE_ON_READ table type. + - **Hudi File Group**: Any file group in Hudi is uniquely identified by the partition path and the file id that the files in this group have as part of their name. A file group consists of all the file slices in a particular partition path. Also any partition path can have multiple file groups. + +### Cleaning Policies + +Hudi cleaner currently supports below cleaning policies: + + - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal cleaning policy that ensures the effect of having lookback into all the changes that happened in the last X commits. Suppose a writer ingesting data into a Hudi dataset every 30 minutes and the longest running query can take 5 hours to finish, then the user should retain atleast the last 10 commits. With such a configuration, we ensure that the oldest version of a file is kept on disk for at least 5 hours, thereby preventing the longest running query from failing at any point in time. Incremental cleaning is also possible using this policy. + - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the effect of keeping N number of file versions irrespective of time. This policy is use-ful when it is known how many MAX versions of the file does one want to keep at any given time. To achieve the same behaviour as before of preventing long running queries from failing, one should do their calculations based on data patterns. Alternatively, this policy is also useful if a user just wants to maintain 1 latest version of the file. + +### Examples + +Suppose a user uses the below configs for cleaning: + +```java +hoodie.cleaner.policy=KEEP_LATEST_COMMITS +hoodie.cleaner.commits.retained=10 +``` + +Cleaner selects the versions of files to be cleaned by taking care of the following: + + - Latest version of a file should not be cleaned. + - The commit times of the last 10 (config
[jira] [Closed] (HUDI-1923) Add state in StreamWriteFunction to restore
[ https://issues.apache.org/jira/browse/HUDI-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-1923. -- Fix Version/s: 0.9.0 Resolution: Done bc18c39835d6775b063ae072aea4ba43177d66b1 > Add state in StreamWriteFunction to restore > --- > > Key: HUDI-1923 > URL: https://issues.apache.org/jira/browse/HUDI-1923 > Project: Apache Hudi > Issue Type: Improvement > Components: Flink Integration >Reporter: yuzhaojing >Assignee: yuzhaojing >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > In flink, notifyCheckpointComplete not in checkpoint life cycle. If a > checkpoint is success and commit action in notifyCheckpointComplete is > failed, when we restore from the latest checkpoint, the element belong this > instant will be discard. > So, we should store commit state and restore it when flink restart. -- This message was sent by Atlassian Jira (v8.3.4#803005)