[GitHub] [hudi] wangxianghu commented on a change in pull request #3006: [HUDI-1943] Lose properties when hoodieWriteConfig initializtion

2021-05-28 Thread GitBox


wangxianghu commented on a change in pull request #3006:
URL: https://github.com/apache/hudi/pull/3006#discussion_r641884136



##
File path: hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
##
@@ -242,7 +242,9 @@ public static TypedProperties 
flinkConf2TypedProperties(Configuration conf) {
 properties.put(option.key(), option.defaultValue());
   }
 }
-return new TypedProperties(properties);
+TypedProperties typeProps = new TypedProperties();
+typeProps.putAll(typeProps);

Review comment:
   hi, @hk-lrzy thanks for your contribution.
   There might be a mistake here , do you mean ` typeProps.putAll(properties)`?
   
   BTW, I tried this method, it seems ok?
   you can try 
`org.apache.hudi.common.config.TypedProperties#getString(java.lang.String)`,  
the value you set do exists.
   the problem might be the `toString()` method shows no values, right?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #3006: [HUDI-1943] Lose properties when hoodieWriteConfig initializtion

2021-05-28 Thread GitBox


wangxianghu commented on a change in pull request #3006:
URL: https://github.com/apache/hudi/pull/3006#discussion_r641884136



##
File path: hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java
##
@@ -242,7 +242,9 @@ public static TypedProperties 
flinkConf2TypedProperties(Configuration conf) {
 properties.put(option.key(), option.defaultValue());
   }
 }
-return new TypedProperties(properties);
+TypedProperties typeProps = new TypedProperties();
+typeProps.putAll(typeProps);

Review comment:
   hi, @hk-lrzy thanks for your contribution.
   There might be a mistake here , do you mean ` typeProps.putAll(properties)`?
   
   BTW, I think it's better to fix this problem in the constructor of 
`TypedProperties `, that's where the problem really exists.
   WDYT?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1943) Lose properties when hoodieWriteConfig initializtion

2021-05-28 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang updated HUDI-1943:
---
Summary: Lose properties when hoodieWriteConfig initializtion  (was: 
[hudi-flink]Lose properties when hoodieWriteConfig initializtion)

> Lose properties when hoodieWriteConfig initializtion
> 
>
> Key: HUDI-1943
> URL: https://issues.apache.org/jira/browse/HUDI-1943
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: hk__lrzy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1947) Hudi Commit Callback and commit in a single transaction

2021-05-28 Thread Pranoti Shanbhag (Jira)
Pranoti Shanbhag created HUDI-1947:
--

 Summary: Hudi Commit Callback and commit in a single transaction
 Key: HUDI-1947
 URL: https://issues.apache.org/jira/browse/HUDI-1947
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Pranoti Shanbhag


Hello,

I am using Hudi Commit callbacks to call an internal service. As per my 
understanding, the service is called after the commit on the dataset and if 
there is a failure in the callback service we would not rollback the commit.

The service which we call saves the commit time in a database which is accessed 
by multiple pipelines to get the incremental delta. For example, when there are 
4 commits in hudi dataset, we register 4 commit timestamps in the database. The 
pipelines that need the incremental delta, run at different frequencies and use 
this database to fetch new data after their respective runs. 

For this to work well, we need the hudi commit and call back to be atomic in a 
single transaction. Otherwise on callback failures, there may be data in the 
hudi dataset which may not be registered in the DB.

Please can you let me know if this can be supported and if there is a way to 
achieve this with the current implementation. We do have retries set up and are 
not expecting failures but we want to keep the hudi commits in sync with what 
we register in the DB.

 

Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] pranotishanbhag opened a new issue #3008: [SUPPORT] Hive Sync issues on deletes and non partitioned table

2021-05-28 Thread GitBox


pranotishanbhag opened a new issue #3008:
URL: https://github.com/apache/hudi/issues/3008


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? 
   YES
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   Part of the Slack groups. Did not find resolution there.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   I am not sure this is a bug but after the analysis we can check.
   
   **Describe the problem you faced**
   
   I am working on unit tests for some of the operations we would perform. I 
see my tests failing for the below 2 scenarios:
   1. Hive Table is not updated when DELETE operation is called on the dataset.
   2. Hive Table is not updated (getting empty table) when no partitions are 
specified.
   
   ### Details on Issue 1:
   I am trying to sync a hive table on upsert (works fine) and on delete (does 
not work) in my unit tests. As per the doc 
[Hudi_Writing-Data](https://hudi.apache.org/docs/writing_data.html#key-generation),
 we need to use GlobalDeleteKeyGenerator class for delete: 
   
   > classOf[GlobalDeleteKeyGenerator].getCanonicalName (to be used when 
OPERATION_OPT_KEY is set to DELETE_OPERATION_OPT_VAL)
   
   However, when I use this class I get the error: 
   
   > “Cause: java.lang.InstantiationException: 
org.apache.hudi.keygen.GlobalDeleteKeyGenerator” 
   
   These are the hudi properties set on save:
   ```
   (hoodie.datasource.hive_sync.database,default)
   (hoodie.combine.before.insert,true)
   (hoodie.insert.shuffle.parallelism,2)
   (hoodie.datasource.write.precombine.field,timestamp)
   (hoodie.datasource.hive_sync.partition_fields,partition)
   (hoodie.datasource.hive_sync.use_jdbc,false)
   
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator)
   (hoodie.delete.shuffle.parallelism,2)
   (hoodie.datasource.hive_sync.table,TestHudiTable)
   (hoodie.index.type,GLOBAL_BLOOM)
   (hoodie.datasource.write.operation,DELETE)
   (hoodie.datasource.hive_sync.enable,true)
   (hoodie.datasource.write.recordkey.field,id)
   (hoodie.table.name,TestHudiTable)
   (hoodie.datasource.write.table.type,COPY_ON_WRITE)
   (hoodie.datasource.write.hive_style_partitioning,true)
   (hoodie.upsert.shuffle.parallelism,2)
   (hoodie.cleaner.commits.retained,15)
   (hoodie.datasource.write.partitionpath.field,partition)
   (hoodie.datasource.hive_sync.database,default)
   (hoodie.combine.before.insert,true)
   (hoodie.embed.timeline.server,false)
   (hoodie.insert.shuffle.parallelism,2)
   (hoodie.datasource.write.precombine.field,timestamp)
   (hoodie.datasource.hive_sync.partition_fields,partition)
   (hoodie.datasource.hive_sync.use_jdbc,false)
   
(hoodie.datasource.hive_sync.partition_extractor_class,org.apache.hudi.keygen.GlobalDeleteKeyGenerator)
   (hoodie.delete.shuffle.parallelism,2)
   (hoodie.datasource.hive_sync.table,TestHudiTable)
   (hoodie.index.type,GLOBAL_BLOOM)
   (hoodie.datasource.write.operation,DELETE)
   (hoodie.datasource.hive_sync.enable,true)
   (hoodie.datasource.write.recordkey.field,id)
   (hoodie.table.name,TestHudiTable)
   (hoodie.datasource.write.table.type,COPY_ON_WRITE)
   (hoodie.datasource.write.hive_style_partitioning,true)
   (hoodie.upsert.shuffle.parallelism,2)
   (hoodie.cleaner.commits.retained,15)
   (hoodie.datasource.write.partitionpath.field,partition)
   ```
   if I switch to MultiPartKeysValueExtractor class, the deletes are not 
propagated to hive table. The hudi read - 
spark.read.format(“hudi”).load(BasePath) has the right data where the id is 
deleted but spark.table(“TableName”) is not consistent and still has the id 
which was supposed to be deleted. For example, id1 is deleted in Hudi but still 
in Hive table:
   ```
   spark.read.format("hudi").load()
   
+---++--+--+-+---+-+-+-+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_nam
   |id |timestamp|partition|dat|
   
+---++--+--+-+---+-+-+-+
   |20210528125210 |20210528125210_0_3  |id2   |partition=p1
  
|9b3aae86-e4ed-4bfa-b6cd-ee41db11ac15-0_0-23-26_20210528125210.parquet|id2|1
|p1   |data2|
   |20210528125210 |20210528125210_1_1  |id3   |partition=p2
  
|a53dba79-90a2-4370-8637-e39301b4c10c-0_1-23-27_20210528125210.parquet|id3|3
|p2   |data3|
   
+---++--+--

[GitHub] [hudi] jtmzheng commented on issue #2983: [SUPPORT] Is hoodie.consistency.check.enabled still relevant?

2021-05-28 Thread GitBox


jtmzheng commented on issue #2983:
URL: https://github.com/apache/hudi/issues/2983#issuecomment-850619877


   @fanaticjo what kind of issues did you see?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on a change in pull request #2833: [HUDI-89] Add configOption & refactor Hudi configuration framework

2021-05-28 Thread GitBox


zhedoubushishi commented on a change in pull request #2833:
URL: https://github.com/apache/hudi/pull/2833#discussion_r641742577



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/config/TestConfigOption.java
##
@@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.config;
+
+import org.junit.jupiter.api.Test;
+
+import java.util.HashMap;
+import java.util.Map;
+import java.util.Properties;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNull;
+
+public class TestConfigOption {

Review comment:
   Good point. Done

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java
##
@@ -50,52 +52,95 @@
  * @see HoodieTableMetaClient
  * @since 0.3.0
  */
-public class HoodieTableConfig implements Serializable {
+public class HoodieTableConfig extends HoodieConfig implements Serializable {
 
   private static final Logger LOG = 
LogManager.getLogger(HoodieTableConfig.class);
 
   public static final String HOODIE_PROPERTIES_FILE = "hoodie.properties";
-  public static final String HOODIE_TABLE_NAME_PROP_NAME = "hoodie.table.name";
-  public static final String HOODIE_TABLE_TYPE_PROP_NAME = "hoodie.table.type";
-  public static final String HOODIE_TABLE_VERSION_PROP_NAME = 
"hoodie.table.version";
-  public static final String HOODIE_TABLE_PRECOMBINE_FIELD = 
"hoodie.table.precombine.field";
-  public static final String HOODIE_TABLE_PARTITION_COLUMNS = 
"hoodie.table.partition.columns";
-
-  @Deprecated
-  public static final String HOODIE_RO_FILE_FORMAT_PROP_NAME = 
"hoodie.table.ro.file.format";
-  @Deprecated
-  public static final String HOODIE_RT_FILE_FORMAT_PROP_NAME = 
"hoodie.table.rt.file.format";
-  public static final String HOODIE_BASE_FILE_FORMAT_PROP_NAME = 
"hoodie.table.base.file.format";
-  public static final String HOODIE_LOG_FILE_FORMAT_PROP_NAME = 
"hoodie.table.log.file.format";
-  public static final String HOODIE_TIMELINE_LAYOUT_VERSION = 
"hoodie.timeline.layout.version";
-  public static final String HOODIE_PAYLOAD_CLASS_PROP_NAME = 
"hoodie.compaction.payload.class";
-  public static final String HOODIE_ARCHIVELOG_FOLDER_PROP_NAME = 
"hoodie.archivelog.folder";
-  public static final String HOODIE_BOOTSTRAP_INDEX_ENABLE = 
"hoodie.bootstrap.index.enable";
-  public static final String HOODIE_BOOTSTRAP_INDEX_CLASS_PROP_NAME = 
"hoodie.bootstrap.index.class";
-  public static final String HOODIE_BOOTSTRAP_BASE_PATH = 
"hoodie.bootstrap.base.path";
-
-  public static final HoodieTableType DEFAULT_TABLE_TYPE = 
HoodieTableType.COPY_ON_WRITE;
-  public static final HoodieTableVersion DEFAULT_TABLE_VERSION = 
HoodieTableVersion.ZERO;
-  public static final HoodieFileFormat DEFAULT_BASE_FILE_FORMAT = 
HoodieFileFormat.PARQUET;
-  public static final HoodieFileFormat DEFAULT_LOG_FILE_FORMAT = 
HoodieFileFormat.HOODIE_LOG;
-  public static final String DEFAULT_PAYLOAD_CLASS = 
OverwriteWithLatestAvroPayload.class.getName();
-  public static final String NO_OP_BOOTSTRAP_INDEX_CLASS = 
NoOpBootstrapIndex.class.getName();
-  public static final String DEFAULT_BOOTSTRAP_INDEX_CLASS = 
HFileBootstrapIndex.class.getName();
-  public static final String DEFAULT_ARCHIVELOG_FOLDER = "";
 
-  private Properties props;
+  public static final ConfigOption HOODIE_TABLE_NAME_PROP_NAME = 
ConfigOption
+  .key("hoodie.table.name")
+  .noDefaultValue()
+  .withDescription("Table name that will be used for registering with 
Hive. Needs to be same across runs.");
+
+  public static final ConfigOption 
HOODIE_TABLE_TYPE_PROP_NAME = ConfigOption
+  .key("hoodie.table.type")
+  .defaultValue(HoodieTableType.COPY_ON_WRITE)
+  .withDescription("The table type for the underlying data, for this 
write. This can’t change between writes.");
+
+  public static final ConfigOption 
HOODIE_TABLE_VERSION_PROP_NAME = ConfigOption
+  .key("hoodie.table.version")
+  .defaultValue(HoodieTableVersion.ZERO)
+  .withDescription("");
+
+  public static final ConfigOption HOODIE_TABLE_PRECOMBINE_FIELD = 
ConfigOption
+  .key("hoodie.table.

[GitHub] [hudi] fanaticjo commented on issue #2983: [SUPPORT] Is hoodie.consistency.check.enabled still relevant?

2021-05-28 Thread GitBox


fanaticjo commented on issue #2983:
URL: https://github.com/apache/hudi/issues/2983#issuecomment-850545801


   I think this is still required as i have seen issues still persisting in emr 
version below 5.33 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] fanaticjo commented on issue #2975: [SUPPORT] Read record using index

2021-05-28 Thread GitBox


fanaticjo commented on issue #2975:
URL: https://github.com/apache/hudi/issues/2975#issuecomment-850544149


   Hello @calleo   i am trying to make this as a generic release in version 
0.9.0 , till the time being you can clone this git hub repo - 
https://github.com/fanaticjo/HudiJavaCustomUpsert/tree/master/src/main/java/com 
create a jar out of it and load this as a dependent jar with hudi and avro jar 
for your pysaprk . 
   
   In your pyspark code you can add this 2 options 
   "hoodie.update.keys": "admission_date,name",   #columns you want to update 
   "hoodie.datasource.write.payload.class": "com.hudiUpsert.hudiCustomUpsert",  
#custom jar 
   
   Sample piece which is working for me 
   hudi_update_options_with_key = {
   'hoodie.table.name': "test",
   'hoodie.datasource.write.recordkey.field': 'id',
   'hoodie.datasource.write.partitionpath.field': 'dept',
   'hoodie.datasource.write.table.name': "test",
   "hoodie.index.type": "SIMPLE",
   "hoodie.update.keys": "admission_date,name",
   "hoodie.datasource.write.payload.class": 
"com.hudiUpsert.hudiCustomUpsert",
   'hoodie.datasource.write.operation': 'upsert',
   'hoodie.datasource.write.precombine.field': 'date',
   'hoodie.upsert.shuffle.parallelism': 2,
   'hoodie.insert.shuffle.parallelism': 2
   }
   
   
df.write.format("org.apache.hudi").options(**hudi_update_options_with_key).mode("append").save(
   "/Users/biswajit/PycharmProjects/hudicustomupsert/output/")
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #2967: Added blog for Hudi cleaner service

2021-05-28 Thread GitBox


nsivabalan commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-850502987


   Can you build the site locally and take screenshot and attach it here. would 
be nice to review that as well. 
   for eg: https://github.com/apache/hudi/pull/2969


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #2993: [HUDI-1929] Make HoodieDeltaStreamer support configure KeyGenerator by type

2021-05-28 Thread GitBox


wangxianghu commented on a change in pull request #2993:
URL: https://github.com/apache/hudi/pull/2993#discussion_r641622004



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -72,8 +71,11 @@
   public static final String PRECOMBINE_FIELD_PROP = 
"hoodie.datasource.write.precombine.field";
   public static final String WRITE_PAYLOAD_CLASS = 
"hoodie.datasource.write.payload.class";
   public static final String DEFAULT_WRITE_PAYLOAD_CLASS = 
OverwriteWithLatestAvroPayload.class.getName();
+
   public static final String KEYGENERATOR_CLASS_PROP = 
"hoodie.datasource.write.keygenerator.class";
-  public static final String DEFAULT_KEYGENERATOR_CLASS = 
SimpleAvroKeyGenerator.class.getName();
+  public static final String KEYGENERATOR_TYPE_PROP = 
"hoodie.datasource.write.keygenerator.type";
+  public static final String DEFAULT_KEYGENERATOR_TYPE = "simple";

Review comment:
   > One question, if a user configured both `KEYGENERATOR_CLASS_PROP ` and 
`KEYGENERATOR_TYPE_PROP `, which one is the higher priority.
   
   `KEYGENERATOR_CLASS_PROP` has a higher priority. 
   `KEYGENERATOR_CLASS_PROP` is more complex, I assume the user prefers the 
type way. if they still configured `KEYGENERATOR_CLASS_PROP` that means the 
user really wants it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #2993: [HUDI-1929] Make HoodieDeltaStreamer support configure KeyGenerator by type

2021-05-28 Thread GitBox


wangxianghu commented on a change in pull request #2993:
URL: https://github.com/apache/hudi/pull/2993#discussion_r641622004



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
##
@@ -72,8 +71,11 @@
   public static final String PRECOMBINE_FIELD_PROP = 
"hoodie.datasource.write.precombine.field";
   public static final String WRITE_PAYLOAD_CLASS = 
"hoodie.datasource.write.payload.class";
   public static final String DEFAULT_WRITE_PAYLOAD_CLASS = 
OverwriteWithLatestAvroPayload.class.getName();
+
   public static final String KEYGENERATOR_CLASS_PROP = 
"hoodie.datasource.write.keygenerator.class";
-  public static final String DEFAULT_KEYGENERATOR_CLASS = 
SimpleAvroKeyGenerator.class.getName();
+  public static final String KEYGENERATOR_TYPE_PROP = 
"hoodie.datasource.write.keygenerator.type";
+  public static final String DEFAULT_KEYGENERATOR_TYPE = "simple";

Review comment:
   > One question, if a user configured both `KEYGENERATOR_CLASS_PROP ` and 
`KEYGENERATOR_TYPE_PROP `, which one is the higher priority.
   
   `KEYGENERATOR_CLASS_PROP`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1946) Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns

2021-05-28 Thread Xianghu Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghu Wang closed HUDI-1946.
--
Resolution: Not A Problem

> Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all 
> the columns
> 
>
> Key: HUDI-1946
> URL: https://issues.apache.org/jira/browse/HUDI-1946
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When the user wants to derive one or more columns from the existing columns 
> and the
> existing columns are all needed, the user needs to spell all the columns they 
> need in the SQL.
> This will be very troublesome and time-consuming when we have dozens or 
> hundreds of columns. we can save trouble by using a wildcard in the SQL to 
> represent all the columns.
> that means if we have "id", "name", "age", "ts" these four columns already 
> and we want to add a new column driver from ts, we can use
> "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name"
> to represent
> "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from 
> table_name"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] wangxianghu closed pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…

2021-05-28 Thread GitBox


wangxianghu closed pull request #3007:
URL: https://github.com/apache/hudi/pull/3007


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…

2021-05-28 Thread GitBox


codecov-commenter commented on pull request #3007:
URL: https://github.com/apache/hudi/pull/3007#issuecomment-850460794


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#3007](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (0f4aba1) into 
[master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (bc18c39) will **decrease** coverage by `45.77%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/3007/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #3007   +/-   ##
   
   - Coverage 55.00%   9.23%   -45.78% 
   + Complexity 3850  48 -3802 
   
 Files   485  54  -431 
 Lines 234672025-21442 
 Branches   2497 243 -2254 
   
   - Hits  12909 187-12722 
   + Misses 94061825 -7581 
   + Partials   1152  13 -1139 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `?` | |
   | hudiclient | `?` | |
   | hudicommon | `?` | |
   | hudiflink | `?` | |
   | hudihadoopmr | `?` | |
   | hudisparkdatasource | `?` | |
   | hudisync | `?` | |
   | huditimelineservice | `?` | |
   | hudiutilities | `9.23% <0.00%> (-61.60%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/3007?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[.../utilities/transform/SqlQueryBasedTransformer.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3RyYW5zZm9ybS9TcWxRdWVyeUJhc2VkVHJhbnNmb3JtZXIuamF2YQ==)
 | `0.00% <0.00%> (-75.00%)` | :arrow_down: |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/3007/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi

[GitHub] [hudi] leesf merged pull request #3004: [HUDI-1940] Add SqlQueryBasedTransformer unit test

2021-05-28 Thread GitBox


leesf merged pull request #3004:
URL: https://github.com/apache/hudi/pull/3004


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004)

2021-05-28 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 974b476  [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004)
974b476 is described below

commit 974b476180e61fac58cd87e78699428d4108a482
Author: wangxianghu 
AuthorDate: Fri May 28 22:30:30 2021 +0800

[HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004)
---
 .../transform/TestSqlQueryBasedTransformer.java| 94 ++
 1 file changed, 94 insertions(+)

diff --git 
a/hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java
 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java
new file mode 100644
index 000..b6fdc25
--- /dev/null
+++ 
b/hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.config.TypedProperties;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.Test;
+
+import java.util.Collections;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+
+public class TestSqlQueryBasedTransformer {
+
+  @Test
+  public void testSqlQuery() {
+
+SparkSession spark = SparkSession
+.builder()
+.master("local[2]")
+.appName(TestSqlQueryBasedTransformer.class.getName())
+.getOrCreate();
+
+JavaSparkContext jsc = 
JavaSparkContext.fromSparkContext(spark.sparkContext());
+
+// prepare test data
+String testData = "{\n"
++ "  \"ts\": 1622126968000,\n"
++ "  \"uuid\": \"c978e157-72ee-4819-8f04-8e46e1bb357a\",\n"
++ "  \"rider\": \"rider-213\",\n"
++ "  \"driver\": \"driver-213\",\n"
++ "  \"begin_lat\": 0.4726905879569653,\n"
++ "  \"begin_lon\": 0.46157858450465483,\n"
++ "  \"end_lat\": 0.754803407008858,\n"
++ "  \"end_lon\": 0.9671159942018241,\n"
++ "  \"fare\": 34.158284716382845,\n"
++ "  \"partitionpath\": \"americas/brazil/sao_paulo\"\n"
++ "}";
+
+JavaRDD testRdd = 
jsc.parallelize(Collections.singletonList(testData), 2);
+Dataset ds = spark.read().json(testRdd);
+
+// create a new column dt, whose value is transformed from ts, format is 
MMdd
+String transSql = "select\n"
++ "\tuuid,\n"
++ "\tbegin_lat,\n"
++ "\tbegin_lon,\n"
++ "\tdriver,\n"
++ "\tend_lat,\n"
++ "\tend_lon,\n"
++ "\tfare,\n"
++ "\tpartitionpath,\n"
++ "\trider,\n"
++ "\tts,\n"
++ "\tFROM_UNIXTIME(ts / 1000, 'MMdd') as dt\n"
++ "from\n"
++ "\t";
+TypedProperties props = new TypedProperties();
+props.put("hoodie.deltastreamer.transformer.sql", transSql);
+
+// transform
+SqlQueryBasedTransformer transformer = new SqlQueryBasedTransformer();
+Dataset result = transformer.apply(jsc, spark, ds, props);
+
+// check result
+assertEquals(11, result.columns().length);
+assertNotNull(result.col("dt"));
+assertEquals("20210527", result.first().get(10).toString());
+
+spark.close();
+  }
+}


[GitHub] [hudi] wangxianghu commented on pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…

2021-05-28 Thread GitBox


wangxianghu commented on pull request #3007:
URL: https://github.com/apache/hudi/pull/3007#issuecomment-850456430


   will add the unit test after https://github.com/apache/hudi/pull/3004 merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1946) Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns

2021-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1946:
-
Labels: pull-request-available  (was: )

> Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all 
> the columns
> 
>
> Key: HUDI-1946
> URL: https://issues.apache.org/jira/browse/HUDI-1946
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Xianghu Wang
>Assignee: Xianghu Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> When the user wants to derive one or more columns from the existing columns 
> and the
> existing columns are all needed, the user needs to spell all the columns they 
> need in the SQL.
> This will be very troublesome and time-consuming when we have dozens or 
> hundreds of columns. we can save trouble by using a wildcard in the SQL to 
> represent all the columns.
> that means if we have "id", "name", "age", "ts" these four columns already 
> and we want to add a new column driver from ts, we can use
> "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name"
> to represent
> "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from 
> table_name"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] wangxianghu opened a new pull request #3007: [HUDI-1946] Enhance SqlQueryBasedTransformer to allow user use wildca…

2021-05-28 Thread GitBox


wangxianghu opened a new pull request #3007:
URL: https://github.com/apache/hudi/pull/3007


   …rd to represent all the columns
   
   ## What is the purpose of the pull request
   
   When the user wants to derive one or more columns from the existing columns 
and the
   existing columns are all needed, the user needs to spell all the columns 
they need in the SQL.
   
   This will be very troublesome and time-consuming when we have dozens or 
hundreds of columns. we can save trouble by using a wildcard in the SQL to 
represent all the columns.
   
   that means if we have "id", "name", "age", "ts" these four columns already 
and we want to add a new column driver from ts, we can use
   
   "select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name"
   
   to represent
   
   "select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from 
table_name"
   
   ## Brief change log
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1946) Enhance SqlQueryBasedTransformer to allow user use wildcard to represent all the columns

2021-05-28 Thread Xianghu Wang (Jira)
Xianghu Wang created HUDI-1946:
--

 Summary: Enhance SqlQueryBasedTransformer to allow user use 
wildcard to represent all the columns
 Key: HUDI-1946
 URL: https://issues.apache.org/jira/browse/HUDI-1946
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Xianghu Wang
Assignee: Xianghu Wang
 Fix For: 0.9.0


When the user wants to derive one or more columns from the existing columns and 
the
existing columns are all needed, the user needs to spell all the columns they 
need in the SQL.

This will be very troublesome and time-consuming when we have dozens or 
hundreds of columns. we can save trouble by using a wildcard in the SQL to 
represent all the columns.

that means if we have "id", "name", "age", "ts" these four columns already and 
we want to add a new column driver from ts, we can use

"select *, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from table_name"

to represent

"select id, name, age, ts, FROM_UNIXTIME(ts / 1000, 'MMdd') as dt from 
table_name"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] wangxianghu commented on a change in pull request #3004: [HUDI-1940] Add SqlQueryBasedTransformer unit test

2021-05-28 Thread GitBox


wangxianghu commented on a change in pull request #3004:
URL: https://github.com/apache/hudi/pull/3004#discussion_r641543572



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.config.TypedProperties;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.Test;
+
+import java.util.Collections;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+
+public class TestSqlQueryBasedTransformer {
+
+  @Test
+  public void testSqlQuery() {
+
+SparkSession spark = SparkSession
+.builder()
+.master("local[2]")
+.appName(TestSqlQueryBasedTransformer.class.getName())
+.getOrCreate();
+
+JavaSparkContext jsc = 
JavaSparkContext.fromSparkContext(spark.sparkContext());
+
+// prepare test data
+String testData = "{\n"
++ "  \"ts\": 1622126968000,\n"
++ "  \"uuid\": \"c978e157-72ee-4819-8f04-8e46e1bb357a\",\n"
++ "  \"rider\": \"rider-213\",\n"
++ "  \"driver\": \"driver-213\",\n"
++ "  \"begin_lat\": 0.4726905879569653,\n"
++ "  \"begin_lon\": 0.46157858450465483,\n"
++ "  \"end_lat\": 0.754803407008858,\n"
++ "  \"end_lon\": 0.9671159942018241,\n"
++ "  \"fare\": 34.158284716382845,\n"
++ "  \"partitionpath\": \"americas/brazil/sao_paulo\"\n"
++ "}";
+
+JavaRDD testRdd = 
jsc.parallelize(Collections.singletonList(testData), 2);
+Dataset ds = spark.read().json(testRdd);
+
+// create a new column dt, whose value is transformed from ts, format is 
MMdd
+String transSql = "select\n"
++ "\tuuid,\n"
++ "\tbegin_lat,\n"
++ "\tbegin_lon,\n"
++ "\tdriver,\n"
++ "\tend_lat,\n"
++ "\tend_lon,\n"
++ "\tfare,\n"
++ "\tpartitionpath,\n"
++ "\trider,\n"
++ "\tts,\n"
++ "\tFROM_UNIXTIME(ts / 1000, 'MMdd') as dt\n"
++ "from\n"
++ "\t";
+TypedProperties props = new TypedProperties();
+props.put("hoodie.deltastreamer.transformer.sql", transSql);
+
+// transform
+SqlQueryBasedTransformer transformer = new SqlQueryBasedTransformer();
+Dataset result = transformer.apply(jsc, spark, ds, props);
+
+// check result
+assertEquals(result.columns().length, 11);

Review comment:
   > pls use assertEquals(expected, result.columns), so please change to 
`assertEquals(11, result.columns().length)`
   
   @leesf thanks, done




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1910) Supporting Kafka based checkpointing for HoodieDeltaStreamer

2021-05-28 Thread Vinay (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17353335#comment-17353335
 ] 

Vinay commented on HUDI-1910:
-

[~nishith29] can you pls delete one of the sub-tasks, got added twice. I added 
it as a sub task as it is related to consumer group offset.

> Supporting Kafka based checkpointing for HoodieDeltaStreamer
> 
>
> Key: HUDI-1910
> URL: https://issues.apache.org/jira/browse/HUDI-1910
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Nishith Agarwal
>Assignee: Vinay
>Priority: Major
>  Labels: sev:normal, triaged
>
> HoodieDeltaStreamer currently supports commit metadata based checkpoint. Some 
> users have requested support for Kafka based checkpoints for freshness 
> auditing purposes. This ticket tracks any implementation for that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1945) Support Hudi to read from Kafka Consumer Group Offset

2021-05-28 Thread Vinay (Jira)
Vinay created HUDI-1945:
---

 Summary: Support Hudi to read from Kafka Consumer Group Offset
 Key: HUDI-1945
 URL: https://issues.apache.org/jira/browse/HUDI-1945
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: DeltaStreamer
Reporter: Vinay
Assignee: Vinay


Currently, Hudi provides options to read from latest or earliest. We should 
even provide users an option to read from group offset as well.

This change will be in `KafkaOffsetGen` where we can add a method to support 
this functionality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1945) Support Hudi to read from Kafka Consumer Group Offset

2021-05-28 Thread Vinay (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinay updated HUDI-1945:

Status: New  (was: Open)

> Support Hudi to read from Kafka Consumer Group Offset
> -
>
> Key: HUDI-1945
> URL: https://issues.apache.org/jira/browse/HUDI-1945
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer
>Reporter: Vinay
>Assignee: Vinay
>Priority: Major
>
> Currently, Hudi provides options to read from latest or earliest. We should 
> even provide users an option to read from group offset as well.
> This change will be in `KafkaOffsetGen` where we can add a method to support 
> this functionality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1944) Support Hudi to read from Kafka Consumer Group Offset

2021-05-28 Thread Vinay (Jira)
Vinay created HUDI-1944:
---

 Summary: Support Hudi to read from Kafka Consumer Group Offset
 Key: HUDI-1944
 URL: https://issues.apache.org/jira/browse/HUDI-1944
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: DeltaStreamer
Reporter: Vinay
Assignee: Vinay


Currently, Hudi provides options to read from latest or earliest. We should 
even provide users an option to read from group offset as well.

This change will be in `KafkaOffsetGen` where we can add a method to support 
this functionality



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] veenaypatil commented on pull request #2998: [HUDI-1426] Rename classname in camel case format

2021-05-28 Thread GitBox


veenaypatil commented on pull request #2998:
URL: https://github.com/apache/hudi/pull/2998#issuecomment-850393465


   @leesf  I agree, this will be breaking change once users upgrade to new 
version, I think if we don't want to do this change, we should at least update 
the document to correct class name to avoid confusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #3006: [HUDI-1943]fix lose properties problem

2021-05-28 Thread GitBox


codecov-commenter edited a comment on pull request #3006:
URL: https://github.com/apache/hudi/pull/3006#issuecomment-850365971


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#3006](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (f03396d) into 
[master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (bc18c39) will **increase** coverage by `15.82%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/3006/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#3006   +/-   ##
   =
   + Coverage 55.00%   70.83%   +15.82% 
   + Complexity 3850  385 -3465 
   =
 Files   485   54  -431 
 Lines 23467 2016-21451 
 Branches   2497  241 -2256 
   =
   - Hits  12909 1428-11481 
   + Misses 9406  454 -8952 
   + Partials   1152  134 -1018 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `?` | |
   | hudiclient | `∅ <ø> (∅)` | |
   | hudicommon | `?` | |
   | hudiflink | `?` | |
   | hudihadoopmr | `?` | |
   | hudisparkdatasource | `?` | |
   | hudisync | `?` | |
   | huditimelineservice | `?` | |
   | hudiutilities | `70.83% <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...in/java/org/apache/hudi/cli/HoodiePrintHelper.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL0hvb2RpZVByaW50SGVscGVyLmphdmE=)
 | | |
   | 
[...in/java/org/apache/hudi/schema/SchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvU2NoZW1hUHJvdmlkZXIuamF2YQ==)
 | | |
   | 
[...g/apache/hudi/common/model/WriteOperationType.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL1dyaXRlT3BlcmF0aW9uVHlwZS5qYXZh)
 | | |
   | 
[...able/timeline/versioning/AbstractMigratorBase.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL3ZlcnNpb25pbmcvQWJzdHJhY3RNaWdyYXRvckJhc2UuamF2YQ==)
 | | |
   | 
[...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh)
 | | |
   | 
[...udi/timeline/service/handlers/BaseFileHandler.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9o

[GitHub] [hudi] codecov-commenter edited a comment on pull request #3006: [HUDI-1943]fix lose properties problem

2021-05-28 Thread GitBox


codecov-commenter edited a comment on pull request #3006:
URL: https://github.com/apache/hudi/pull/3006#issuecomment-850365971


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#3006](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (f03396d) into 
[master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (bc18c39) will **increase** coverage by `15.82%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/3006/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#3006   +/-   ##
   =
   + Coverage 55.00%   70.83%   +15.82% 
   + Complexity 3850  385 -3465 
   =
 Files   485   54  -431 
 Lines 23467 2016-21451 
 Branches   2497  241 -2256 
   =
   - Hits  12909 1428-11481 
   + Misses 9406  454 -8952 
   + Partials   1152  134 -1018 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `?` | |
   | hudiclient | `?` | |
   | hudicommon | `?` | |
   | hudiflink | `?` | |
   | hudihadoopmr | `?` | |
   | hudisparkdatasource | `?` | |
   | hudisync | `?` | |
   | huditimelineservice | `?` | |
   | hudiutilities | `70.83% <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...ava/org/apache/hudi/common/util/BaseFileUtils.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQmFzZUZpbGVVdGlscy5qYXZh)
 | | |
   | 
[...main/java/org/apache/hudi/hive/HiveSyncConfig.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zeW5jL2h1ZGktaGl2ZS1zeW5jL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL2hpdmUvSGl2ZVN5bmNDb25maWcuamF2YQ==)
 | | |
   | 
[.../common/bloom/HoodieDynamicBoundedBloomFilter.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2Jsb29tL0hvb2RpZUR5bmFtaWNCb3VuZGVkQmxvb21GaWx0ZXIuamF2YQ==)
 | | |
   | 
[...e/hudi/exception/HoodieDeltaStreamerException.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1zcGFyay1kYXRhc291cmNlL2h1ZGktc3Bhcmsvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZURlbHRhU3RyZWFtZXJFeGNlcHRpb24uamF2YQ==)
 | | |
   | 
[.../apache/hudi/common/model/HoodieRecordPayload.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVJlY29yZFBheWxvYWQuamF2YQ==)
 | | |
   | 
[...che/hudi/metadata/TimelineMergedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS1j

[GitHub] [hudi] codecov-commenter commented on pull request #3006: [HUDI-1943]fix lose properties problem

2021-05-28 Thread GitBox


codecov-commenter commented on pull request #3006:
URL: https://github.com/apache/hudi/pull/3006#issuecomment-850365971


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#3006](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (f03396d) into 
[master](https://codecov.io/gh/apache/hudi/commit/bc18c39835d6775b063ae072aea4ba43177d66b1?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (bc18c39) will **decrease** coverage by `45.73%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/3006/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #3006   +/-   ##
   
   - Coverage 55.00%   9.27%   -45.74% 
   + Complexity 3850  48 -3802 
   
 Files   485  54  -431 
 Lines 234672016-21451 
 Branches   2497 241 -2256 
   
   - Hits  12909 187-12722 
   + Misses 94061816 -7590 
   + Partials   1152  13 -1139 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `?` | |
   | hudiclient | `?` | |
   | hudicommon | `?` | |
   | hudiflink | `?` | |
   | hudihadoopmr | `?` | |
   | hudisparkdatasource | `?` | |
   | hudisync | `?` | |
   | huditimelineservice | `?` | |
   | hudiutilities | `9.27% <ø> (-61.56%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/3006?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/3006/diff?src=pr&el=tre

[GitHub] [hudi] codecov-commenter edited a comment on pull request #2923: [HUDI-1864] Added support for Date, Timestamp, LocalDate and LocalDateTime in TimestampBasedAvroKeyGenerator

2021-05-28 Thread GitBox


codecov-commenter edited a comment on pull request #2923:
URL: https://github.com/apache/hudi/pull/2923#issuecomment-846613183


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2923](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (1c8fe55) into 
[master](https://codecov.io/gh/apache/hudi/commit/112732db815660dfdf4b62aa443127d7884b6e63?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (112732d) will **decrease** coverage by `45.69%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2923/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2923   +/-   ##
   
   - Coverage 54.97%   9.27%   -45.70% 
   + Complexity 3845  48 -3797 
   
 Files   485  54  -431 
 Lines 234362016-21420 
 Branches   2494 241 -2253 
   
   - Hits  12884 187-12697 
   + Misses 94001816 -7584 
   + Partials   1152  13 -1139 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `?` | |
   | hudiclient | `∅ <ø> (∅)` | |
   | hudicommon | `?` | |
   | hudiflink | `?` | |
   | hudihadoopmr | `?` | |
   | hudisparkdatasource | `?` | |
   | hudisync | `?` | |
   | huditimelineservice | `?` | |
   | hudiutilities | `9.27% <ø> (-61.61%)` | :arrow_down: |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2923?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2923/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | :arrow_down: |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2923/dif

[GitHub] [hudi] leesf commented on a change in pull request #3004: [HUDI-1940] Add SqlQueryBasedTransformer unit test

2021-05-28 Thread GitBox


leesf commented on a change in pull request #3004:
URL: https://github.com/apache/hudi/pull/3004#discussion_r641474949



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/transform/TestSqlQueryBasedTransformer.java
##
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.transform;
+
+import org.apache.hudi.common.config.TypedProperties;
+
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.Test;
+
+import java.util.Collections;
+
+import static org.junit.jupiter.api.Assertions.assertEquals;
+import static org.junit.jupiter.api.Assertions.assertNotNull;
+
+public class TestSqlQueryBasedTransformer {
+
+  @Test
+  public void testSqlQuery() {
+
+SparkSession spark = SparkSession
+.builder()
+.master("local[2]")
+.appName(TestSqlQueryBasedTransformer.class.getName())
+.getOrCreate();
+
+JavaSparkContext jsc = 
JavaSparkContext.fromSparkContext(spark.sparkContext());
+
+// prepare test data
+String testData = "{\n"
++ "  \"ts\": 1622126968000,\n"
++ "  \"uuid\": \"c978e157-72ee-4819-8f04-8e46e1bb357a\",\n"
++ "  \"rider\": \"rider-213\",\n"
++ "  \"driver\": \"driver-213\",\n"
++ "  \"begin_lat\": 0.4726905879569653,\n"
++ "  \"begin_lon\": 0.46157858450465483,\n"
++ "  \"end_lat\": 0.754803407008858,\n"
++ "  \"end_lon\": 0.9671159942018241,\n"
++ "  \"fare\": 34.158284716382845,\n"
++ "  \"partitionpath\": \"americas/brazil/sao_paulo\"\n"
++ "}";
+
+JavaRDD testRdd = 
jsc.parallelize(Collections.singletonList(testData), 2);
+Dataset ds = spark.read().json(testRdd);
+
+// create a new column dt, whose value is transformed from ts, format is 
MMdd
+String transSql = "select\n"
++ "\tuuid,\n"
++ "\tbegin_lat,\n"
++ "\tbegin_lon,\n"
++ "\tdriver,\n"
++ "\tend_lat,\n"
++ "\tend_lon,\n"
++ "\tfare,\n"
++ "\tpartitionpath,\n"
++ "\trider,\n"
++ "\tts,\n"
++ "\tFROM_UNIXTIME(ts / 1000, 'MMdd') as dt\n"
++ "from\n"
++ "\t";
+TypedProperties props = new TypedProperties();
+props.put("hoodie.deltastreamer.transformer.sql", transSql);
+
+// transform
+SqlQueryBasedTransformer transformer = new SqlQueryBasedTransformer();
+Dataset result = transformer.apply(jsc, spark, ds, props);
+
+// check result
+assertEquals(result.columns().length, 11);

Review comment:
   pls use assertEquals(expected, result.columns), so please change to 
`assertEquals(11, result.columns().length)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #2998: [HUDI-1426] Rename classname in camel case format

2021-05-28 Thread GitBox


leesf commented on pull request #2998:
URL: https://github.com/apache/hudi/pull/2998#issuecomment-850349161


   @veenaypatil Thanks for you contribution, after changing the classname it 
would cause compatibility issues since there are many users used 
`NonpartitionedKeyGenerator` in their codebase.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1943) [hudi-flink]Lose properties when hoodieWriteConfig initializtion

2021-05-28 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1943:
-
Labels: pull-request-available  (was: )

> [hudi-flink]Lose properties when hoodieWriteConfig initializtion
> 
>
> Key: HUDI-1943
> URL: https://issues.apache.org/jira/browse/HUDI-1943
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Flink Integration
>Reporter: hk__lrzy
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hk-lrzy opened a new pull request #3006: [HUDI-1943]fix lose properties problem

2021-05-28 Thread GitBox


hk-lrzy opened a new pull request #3006:
URL: https://github.com/apache/hudi/pull/3006


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1943) [hudi-flink]Lose properties when hoodieWriteConfig initializtion

2021-05-28 Thread hk__lrzy (Jira)
hk__lrzy created HUDI-1943:
--

 Summary: [hudi-flink]Lose properties when hoodieWriteConfig 
initializtion
 Key: HUDI-1943
 URL: https://issues.apache.org/jira/browse/HUDI-1943
 Project: Apache Hudi
  Issue Type: Bug
  Components: Flink Integration
Reporter: hk__lrzy






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] t0il3ts0ap commented on issue #2934: [SUPPORT] Parquet file does not exist when trying to read hudi table incrementally

2021-05-28 Thread GitBox


t0il3ts0ap commented on issue #2934:
URL: https://github.com/apache/hudi/issues/2934#issuecomment-850312591


   @n3nash Sorry for late reply, was away on vacation
   1. We run one instance of deltastreamer job every 2 hrs on source table. 
Each run at the max sources 6 million records using --source-limit parameter.
   2. ~ 12 runs of deltastreamer. Not sure if results in 12 commits.  ( Please 
share some knowledge or documentation here. I apologize that I dont know this. 
) 
   3. Please assume defaults for this. I have not modified cleaner policy for 
deltastreamer. 
   4. This is failing in first run itself. I checked manually, there are no 
files for oldest commits. 
   
   A sample. case. 
   Lets say I open .hoodie directory for a COW table with no partition
   https://user-images.githubusercontent.com/8509512/119967062-9be8f780-bfc9-11eb-9bca-87cb8fc4fd07.png";>
   Now, I open this commit file `20210524175014.commit` to find following data:
   ```
   {
 "partitionToWriteStats" : {
   "default" : [ {
 "fileId" : "890de7c3-7f0d-4586-8e66-83a9f8d9c106-0",
 "path" : 
"default/890de7c3-7f0d-4586-8e66-83a9f8d9c106-0_0-23-13521_20210524175014.parquet",
 "prevCommit" : "20210524160842",
 "numWrites" : 325368,
 "numDeletes" : 0,
 "numUpdateWrites" : 0,
 "numInserts" : 115,
 "totalWriteBytes" : 32817083,
 "totalWriteErrors" : 0,
 "tempPath" : null,
 "partitionPath" : "default",
 "totalLogRecords" : 0,
 "totalLogFilesCompacted" : 0,
 "totalLogSizeCompacted" : 0,
 "totalUpdatedRecordsCompacted" : 0,
 "totalLogBlocks" : 0,
 "totalCorruptLogBlock" : 0,
 "totalRollbackBlocks" : 0,
 "fileSizeInBytes" : 32817083
   } ]
 },
 "compacted" : false,
 "extraMetadata" : {
   "schema" : 
"{\"type\":\"record\",\"name\":\"hoodie_source\",\"namespace\":\"hoodie.source\",\"fields\":[{\"name\":\"id\",\"type\":\"long\"},{\"name\":\"created_at\",\"type\":[\"string\",\"null\"]},{\"name\":\"created_by\",\"type\":[\"string\",\"null\"]},{\"name\":\"last_modified_by\",\"type\":[\"string\",\"null\"]},{\"name\":\"version\",\"type\":[\"long\",\"null\"]},{\"name\":\"address_reference_id\",\"type\":[\"string\",\"null\"]},{\"name\":\"city\",\"type\":[\"string\",\"null\"]},{\"name\":\"current\",\"type\":[\"boolean\",\"null\"]},{\"name\":\"line_one\",\"type\":[\"string\",\"null\"]},{\"name\":\"line_two\",\"type\":[\"string\",\"null\"]},{\"name\":\"permanent\",\"type\":[\"boolean\",\"null\"]},{\"name\":\"pin_code\",\"type\":[\"string\",\"null\"]},{\"name\":\"source\",\"type\":[\"string\",\"null\"]},{\"name\":\"state\",\"type\":[\"string\",\"null\"]},{\"name\":\"type\",\"type\":[\"string\",\"null\"]},{\"name\":\"email_address\",\"type\":[\"string\",\"null\"]},{\"name\":\
 
"collection_case_id\",\"type\":[\"long\",\"null\"]},{\"name\":\"updated_date\",\"type\":[\"string\",\"null\"]},{\"name\":\"updated_at\",\"type\":[\"string\",\"null\"]},{\"name\":\"latitude\",\"type\":[\"double\",\"null\"]},{\"name\":\"longitude\",\"type\":[\"double\",\"null\"]},{\"name\":\"__lsn\",\"type\":[\"long\",\"null\"]},{\"name\":\"_hoodie_is_deleted\",\"type\":[\"boolean\",\"null\"]}]}",
   "deltastreamer.checkpoint.key" : 
"collection_service.public.addresses,0:1066373,1:1064984,2:1065562,3:1070617,4:1067374,5:1066110"
 },
 "operationType" : "UPSERT",
 "fileIdAndRelativePaths" : {
   "890de7c3-7f0d-4586-8e66-83a9f8d9c106-0" : 
"default/890de7c3-7f0d-4586-8e66-83a9f8d9c106-0_0-23-13521_20210524175014.parquet"
 },
 "totalRecordsDeleted" : 0,
 "totalLogRecordsCompacted" : 0,
 "totalLogFilesCompacted" : 0,
 "totalCompactedRecordsUpdated" : 0,
 "totalLogFilesSize" : 0,
 "totalScanTime" : 0,
 "totalCreateTime" : 0,
 "totalUpsertTime" : 5089
   }
   ```
   
   Now , this 
`default/890de7c3-7f0d-4586-8e66-83a9f8d9c106-0_0-23-13521_20210524175014.parquet`
 mentioned in the commit is missing. This error is raised only while doing 
[incremental query 
](https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HoodieIncrSource.java
 )
   
   But if I just search the prefix `890de7c3-7f0d-4586-8e66-83a9f8d9c106` there 
are many parquet files which I can see
   
   
   Right now, I have overridden `HoodieIncrSource.java` when running 
deltastreamer for target table to make it work.
   Added some logic like below, 
   If no checkpoint is found (basically this is first run), do a snapshot query 
and save last commit as checkpoint. 
   If checkpoint is found carry on with incremental query as usual.
   
   What this benefits us with is that we are always doing incremental query on 
newer commits rather than older commits. Newer commits's parquet file is 
present. Also it solves the problem of taking snapshot automatically, when 
reading from a hudi table for the first time.
   


-- 
This is an automat

[GitHub] [hudi] codecov-commenter edited a comment on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp

2021-05-28 Thread GitBox


codecov-commenter edited a comment on pull request #2438:
URL: https://github.com/apache/hudi/pull/2438#issuecomment-850284847


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2438](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (739a252) into 
[master](https://codecov.io/gh/apache/hudi/commit/7fed7352bd506e20e5316bb0b3ed9e5c1e9c76df?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (7fed735) will **decrease** coverage by `1.49%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2438/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2438  +/-   ##
   
   - Coverage 54.96%   53.47%   -1.50% 
   + Complexity 3844 3459 -385 
   
 Files   485  431  -54 
 Lines 2343721421-2016 
 Branches   2494 2253 -241 
   
   - Hits  1288211454-1428 
   + Misses 9401 8947 -454 
   + Partials   1154 1020 -134 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `39.55% <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | |
   | hudicommon | `50.29% <ø> (ø)` | |
   | hudiflink | `63.41% <ø> (ø)` | |
   | hudihadoopmr | `51.54% <ø> (ø)` | |
   | hudisparkdatasource | `73.33% <ø> (ø)` | |
   | hudisync | `46.44% <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | |
   | hudiutilities | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...di/utilities/sources/helpers/IncrSourceHelper.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9JbmNyU291cmNlSGVscGVyLmphdmE=)
 | | |
   | 
[...in/java/org/apache/hudi/utilities/UtilHelpers.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL1V0aWxIZWxwZXJzLmphdmE=)
 | | |
   | 
[...i/utilities/deltastreamer/SourceFormatAdapter.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvU291cmNlRm9ybWF0QWRhcHRlci5qYXZh)
 | | |
   | 
[...s/exception/HoodieIncrementalPullSQLException.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVJbmNyZW1lbnRhbFB1bGxTUUxFeGNlcHRpb24uamF2YQ==)
 | | |
   | 
[...alCheckpointFromAnotherHoodieTimelineProvider.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2NoZWNrcG9pbnRpbmcvSW5pdGlhbENoZWNrcG9pbnRGcm9tQW5vdGhlckhvb2RpZVRpbWVsaW5lUHJvdmlkZXIuamF2YQ==)
 | | |
   | 
[...g/apache/hudi/utilities/schema/SchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&

[GitHub] [hudi] codecov-commenter commented on pull request #2438: [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp

2021-05-28 Thread GitBox


codecov-commenter commented on pull request #2438:
URL: https://github.com/apache/hudi/pull/2438#issuecomment-850284847


   # 
[Codecov](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 Report
   > Merging 
[#2438](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (739a252) into 
[master](https://codecov.io/gh/apache/hudi/commit/7fed7352bd506e20e5316bb0b3ed9e5c1e9c76df?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 (7fed735) will **decrease** coverage by `1.31%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2438/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2438  +/-   ##
   
   - Coverage 54.96%   53.65%   -1.32% 
   + Complexity 3844 3253 -591 
   
 Files   485  407  -78 
 Lines 2343719762-3675 
 Branches   2494 2085 -409 
   
   - Hits  1288210603-2279 
   + Misses 9401 8221-1180 
   + Partials   1154  938 -216 
   ```
   
   | Flag | Coverage Δ | |
   |---|---|---|
   | hudicli | `39.55% <ø> (ø)` | |
   | hudiclient | `∅ <ø> (∅)` | |
   | hudicommon | `50.29% <ø> (ø)` | |
   | hudiflink | `63.41% <ø> (ø)` | |
   | hudihadoopmr | `51.54% <ø> (ø)` | |
   | hudisparkdatasource | `73.33% <ø> (ø)` | |
   | hudisync | `?` | |
   | huditimelineservice | `?` | |
   | hudiutilities | `?` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2438?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
 | Coverage Δ | |
   |---|---|---|
   | 
[...java/org/apache/hudi/utilities/sources/Source.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvU291cmNlLmphdmE=)
 | | |
   | 
[...ies/exception/HoodieSnapshotExporterException.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2V4Y2VwdGlvbi9Ib29kaWVTbmFwc2hvdEV4cG9ydGVyRXhjZXB0aW9uLmphdmE=)
 | | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlcldpdGhQb3N0UHJvY2Vzc29yLmphdmE=)
 | | |
   | 
[.../apache/hudi/timeline/service/TimelineService.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS10aW1lbGluZS1zZXJ2aWNlL3NyYy9tYWluL2phdmEvb3JnL2FwYWNoZS9odWRpL3RpbWVsaW5lL3NlcnZpY2UvVGltZWxpbmVTZXJ2aWNlLmphdmE=)
 | | |
   | 
[...di/utilities/sources/helpers/IncrSourceHelper.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvaGVscGVycy9JbmNyU291cmNlSGVscGVyLmphdmE=)
 | | |
   | 
[.../apache/hudi/hive/MultiPartKeysValueExtractor.java](https://codecov.io/gh/apache/hudi/pull/2438/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_cam

[GitHub] [hudi] veenaypatil commented on pull request #2998: [HUDI-1426] Rename classname in camel case format

2021-05-28 Thread GitBox


veenaypatil commented on pull request #2998:
URL: https://github.com/apache/hudi/pull/2998#issuecomment-850271396


   @yanghua  can you please look into this PR


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #2967: Added blog for Hudi cleaner service

2021-05-28 Thread GitBox


pratyakshsharma commented on pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#issuecomment-850210814


   @n3nash @nsivabalan Please take a look. All the comments are addressed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

2021-05-28 Thread GitBox


pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r641327528



##
File path: 
docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using 
`HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is 
made possible by Hudi’s MVCC concurrency model. In this blog, we will explain 
how to employ the right configurations to manage multiple file versions. 
Furthermore, we will discuss mechanisms available to users generating Hudi 
tables on how to maintain just the required number of old file versions so that 
long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your 
tables on the data lake. One of these services is called the **Cleaner**. As 
you write more data to your table, for every batch of updates received, Hudi 
can either generate a new version of the data file with updates applied to 
records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding 
rewriting newer version of an existing file (MERGE_ON_READ). In such 
situations, depending on the frequency of your updates, the number of file 
versions of log files can grow indefinitely. If your use-cases do not require 
keeping an infinite history of these versions, it is imperative to have a 
process that reclaims older versions of the data. This is Hudi’s cleaner 
service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and 
writers concurrently accessing the same table. As the Hudi cleaner service 
periodically reclaims older file versions, scenarios arise where a long running 
query might be accessing a file version that is deemed to be reclaimed by the 
cleaner. Here, we need to employ the correct configs to ensure readers (aka 
queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning 
policies that Hudi offers and the corresponding properties that need to be 
configured. Options are available to schedule cleaning asynchronously or 
synchronously. Before going into more details, we would like to explain a few 
underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after 
compaction. A base file’s name follows the following naming convention: 
`__.parquet`. In subsequent writes of this 
file, file id remains the same and commit time gets updated to show the latest 
version. This also implies any particular version of a record, given its 
partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files 
consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the 
partition path and the  file id that the files in this group have as part of 
their name. A file group consists of all the file slices in a particular 
partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal 
cleaning policy that ensures the effect of having lookback into all the changes 
that happened in the last X commits. Suppose a writer ingesting data  into a 
Hudi dataset every 30 minutes and the longest running query can take 5 hours to 
finish, then the user should retain atleast the last 10 commits. With such a 
configuration, we ensure that the oldest version of a file is kept on disk for 
at least 5 hours, thereby preventing the longest running query from failing at 
any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the 
effect of keeping N number of file versions irrespective of time. This policy 
is use-ful when it is known how many MAX versions of the file does one want to 
keep at any given time. To achieve the same behaviour as before of preventing 
long running queries from failing, one should do their calculations based on 
data patterns. Alternatively, this policy is also useful if a user just wants 
to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the 
following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (config

[GitHub] [hudi] hushenmin opened a new issue #3005: [SUPPORT]

2021-05-28 Thread GitBox


hushenmin opened a new issue #3005:
URL: https://github.com/apache/hudi/issues/3005


   How to query history snapshot by given one history partition?
   At present, through the following method, I can query the historical 
snapshot of a Hudi partition, but for me, this method is very unfriendly. 
Because I didn't know how many times there were the most recent commits in the 
partition I want to query when I query.
   ```
   add jar ${hudi.hadoop.bundle};
   
   set hoodie.stock_ticks_cow.consume.mode=INCREMENTAL;
   set hoodie.stock_ticks_cow.consume.max.commits=3;
   set hoodie.stock_ticks_cow.consume.start.timestamp='${min.commit.time}';
   ```
   If the Hudi community can provide a way to query the full history snapshots 
of a partition in Hudi, I think this is a great feature, because it can solve 
the pain point of querying the full history snapshots. 
   In this way, I can directly switch the current full data warehouse `Hive on 
HDFS` model to `Hive on Hudi`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on a change in pull request #2967: Added blog for Hudi cleaner service

2021-05-28 Thread GitBox


pratyakshsharma commented on a change in pull request #2967:
URL: https://github.com/apache/hudi/pull/2967#discussion_r641319609



##
File path: 
docs/_posts/2021-05-19-employing-right-configurations-for-hudi-cleaner.md
##
@@ -0,0 +1,77 @@
+---
+title: "Employing correct configurations for Hudi's cleaner table service"
+excerpt: "Achieving isolation between Hudi writer and readers using 
`HoodieCleaner.java`"
+author: pratyakshsharma
+category: blog
+---
+
+Apache Hudi provides snapshot isolation between writers and readers. This is 
made possible by Hudi’s MVCC concurrency model. In this blog, we will explain 
how to employ the right configurations to manage multiple file versions. 
Furthermore, we will discuss mechanisms available to users generating Hudi 
tables on how to maintain just the required number of old file versions so that 
long running readers do not fail. 
+
+### Reclaiming space and bounding your data lake growth
+
+Hudi provides different table management services to be able to manage your 
tables on the data lake. One of these services is called the **Cleaner**. As 
you write more data to your table, for every batch of updates received, Hudi 
can either generate a new version of the data file with updates applied to 
records (COPY_ON_WRITE) or write these delta updates to a log file, avoiding 
rewriting newer version of an existing file (MERGE_ON_READ). In such 
situations, depending on the frequency of your updates, the number of file 
versions of log files can grow indefinitely. If your use-cases do not require 
keeping an infinite history of these versions, it is imperative to have a 
process that reclaims older versions of the data. This is Hudi’s cleaner 
service.
+
+### Problem Statement
+
+In a data lake architecture, it is a very common scenario to have readers and 
writers concurrently accessing the same table. As the Hudi cleaner service 
periodically reclaims older file versions, scenarios arise where a long running 
query might be accessing a file version that is deemed to be reclaimed by the 
cleaner. Here, we need to employ the correct configs to ensure readers (aka 
queries) don’t fail.
+
+### Deeper dive into Hudi Cleaner
+
+To deal with the mentioned scenario, lets understand the  different cleaning 
policies that Hudi offers and the corresponding properties that need to be 
configured. Options are available to schedule cleaning asynchronously or 
synchronously. Before going into more details, we would like to explain a few 
underlying concepts:
+
+ - **Hudi base file**: Columnar file which consists of final data after 
compaction. A base file’s name follows the following naming convention: 
`__.parquet`. In subsequent writes of this 
file, file id remains the same and commit time gets updated to show the latest 
version. This also implies any particular version of a record, given its 
partition path, can be uniquely located using the file id and instant time. 
+ - **File slice**: A file slice consists of the base file and any log files 
consisting of the delta, in case of MERGE_ON_READ table type.
+ - **Hudi File Group**: Any file group in Hudi is uniquely identified by the 
partition path and the  file id that the files in this group have as part of 
their name. A file group consists of all the file slices in a particular 
partition path. Also any partition path can have multiple file groups.
+
+### Cleaning Policies
+
+Hudi cleaner currently supports below cleaning policies:
+
+ - **KEEP_LATEST_COMMITS**: This is the default policy. This is a temporal 
cleaning policy that ensures the effect of having lookback into all the changes 
that happened in the last X commits. Suppose a writer ingesting data  into a 
Hudi dataset every 30 minutes and the longest running query can take 5 hours to 
finish, then the user should retain atleast the last 10 commits. With such a 
configuration, we ensure that the oldest version of a file is kept on disk for 
at least 5 hours, thereby preventing the longest running query from failing at 
any point in time. Incremental cleaning is also possible using this policy.
+ - **KEEP_LATEST_FILE_VERSIONS**: This is a static numeric policy that has the 
effect of keeping N number of file versions irrespective of time. This policy 
is use-ful when it is known how many MAX versions of the file does one want to 
keep at any given time. To achieve the same behaviour as before of preventing 
long running queries from failing, one should do their calculations based on 
data patterns. Alternatively, this policy is also useful if a user just wants 
to maintain 1 latest version of the file.
+
+### Examples
+
+Suppose a user uses the below configs for cleaning:
+
+```java
+hoodie.cleaner.policy=KEEP_LATEST_COMMITS
+hoodie.cleaner.commits.retained=10
+```
+
+Cleaner selects the versions of files to be cleaned by taking care of the 
following:
+
+ - Latest version of a file should not be cleaned.
+ - The commit times of the last 10 (config

[jira] [Closed] (HUDI-1923) Add state in StreamWriteFunction to restore

2021-05-28 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-1923.
--
Fix Version/s: 0.9.0
   Resolution: Done

bc18c39835d6775b063ae072aea4ba43177d66b1

> Add state in StreamWriteFunction to restore
> ---
>
> Key: HUDI-1923
> URL: https://issues.apache.org/jira/browse/HUDI-1923
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Flink Integration
>Reporter: yuzhaojing
>Assignee: yuzhaojing
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>
> In flink, notifyCheckpointComplete not in checkpoint life cycle. If a 
> checkpoint is success and commit action in notifyCheckpointComplete is 
> failed, when we restore from the latest checkpoint, the element belong this 
> instant will be discard.
> So, we should store commit state and restore it when flink restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)