[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


codecov-io edited a comment on pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#issuecomment-629848790


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=h1) 
Report
   > Merging 
[#1633](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25e0b75b3d03b6d460dc18d1a5fce7b881b0e019&el=desc)
 will **decrease** coverage by `54.59%`.
   > The diff coverage is `38.88%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1633/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1633   +/-   ##
   =
   - Coverage 71.81%   17.22%   -54.60% 
   + Complexity 1092  827  -265 
   =
 Files   386  344   -42 
 Lines 1660815481 -1127 
 Branches   1667 1582   -85 
   =
   - Hits  11927 2666 -9261 
   - Misses 395512465 +8510 
   + Partials726  350  -376 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...che/hudi/table/action/commit/BulkInsertHelper.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CdWxrSW5zZXJ0SGVscGVyLmphdmE=)
 | `0.00% <0.00%> (-85.00%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `49.15% <20.00%> (-35.70%)` | `53.00 <1.00> (+6.00)` | :arrow_down: |
   | 
[...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=)
 | `28.49% <44.44%> (-54.40%)` | `17.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[.../table/action/commit/BaseCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CYXNlQ29tbWl0QWN0aW9uRXhlY3V0b3IuamF2YQ==)
 | `46.01% <100.00%> (-38.81%)` | `14.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [314 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `

[jira] [Updated] (HUDI-907) Test Presto mor query support changes in HDFS Env

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-907:
---
Status: Open  (was: New)

> Test Presto mor query support changes in HDFS Env
> -
>
> Key: HUDI-907
> URL: https://issues.apache.org/jira/browse/HUDI-907
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.5.3
>
>
> Test presto integration for HDFS environment as well in addition to S3.
>  
> Blockers faced so far
> [~bdscheller] I tried to apply your presto patch to test mor queries on 
> Presto. The way I set it up was create a docker image from your presto patch 
> and use that image in hudi local docker environment. I observed couple of 
> issues there:
>  * I got NoClassDefFoundError for these classes:
>  ** org/apache/parquet/avro/AvroSchemaConverter
>  ** org/apache/parquet/hadoop/ParquetFileReader
>  ** org/apache/parquet/io/InputFile
>  ** org/apache/parquet/format/TypeDefinedOrder
> I was able to get around the first three errors by shading org.apache.parquet 
> inside hudi-presto-bundle and changing presto-hive to depend on the 
> hudi-presto-bundle. However, for the last one shading dint help because its 
> already a Thrift generated class. I am wondering you  also ran into similar 
> issues while testing S3.  
> Could you please elaborate your test set up so we can do similar thing for 
> HDFS as well. If we need to add more changes to hudi-presto-bundle, we would 
> need to prioritize that for 0.5.3 release asap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-907) Test Presto mor query support changes in HDFS Env

2020-05-17 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-907:
--

 Summary: Test Presto mor query support changes in HDFS Env
 Key: HUDI-907
 URL: https://issues.apache.org/jira/browse/HUDI-907
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Presto Integration
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha
 Fix For: 0.5.3


Test presto integration for HDFS environment as well in addition to S3.

 

Blockers faced so far

[~bdscheller] I tried to apply your presto patch to test mor queries on Presto. 
The way I set it up was create a docker image from your presto patch and use 
that image in hudi local docker environment. I observed couple of issues there:
 * I got NoClassDefFoundError for these classes:
 ** org/apache/parquet/avro/AvroSchemaConverter
 ** org/apache/parquet/hadoop/ParquetFileReader
 ** org/apache/parquet/io/InputFile
 ** org/apache/parquet/format/TypeDefinedOrder

I was able to get around the first three errors by shading org.apache.parquet 
inside hudi-presto-bundle and changing presto-hive to depend on the 
hudi-presto-bundle. However, for the last one shading dint help because its 
already a Thrift generated class. I am wondering you  also ran into similar 
issues while testing S3.  

Could you please elaborate your test set up so we can do similar thing for HDFS 
as well. If we need to add more changes to hudi-presto-bundle, we would need to 
prioritize that for 0.5.3 release asap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-305:
---
Fix Version/s: 0.5.3
   0.6.0

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-705) Add unit test for RollbacksCommand

2020-05-17 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-705.
-
Resolution: Done

Done via master branch: 57132f79bb2dad6cfb215480b435a778714a442d

> Add unit test for RollbacksCommand
> --
>
> Key: HUDI-705
> URL: https://issues.apache.org/jira/browse/HUDI-705
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-705) Add unit test for RollbacksCommand

2020-05-17 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-705:
--
Status: Open  (was: New)

> Add unit test for RollbacksCommand
> --
>
> Key: HUDI-705
> URL: https://issues.apache.org/jira/browse/HUDI-705
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: CLI, Testing
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua merged pull request #1611: [HUDI-705]Add unit test for RollbacksCommand

2020-05-17 Thread GitBox


yanghua merged pull request #1611:
URL: https://github.com/apache/incubator-hudi/pull/1611


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch master updated: [HUDI-705] Add unit test for RollbacksCommand (#1611)

2020-05-17 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 57132f7  [HUDI-705] Add unit test for RollbacksCommand (#1611)
57132f7 is described below

commit 57132f79bb2dad6cfb215480b435a778714a442d
Author: hongdd 
AuthorDate: Mon May 18 14:04:06 2020 +0800

[HUDI-705] Add unit test for RollbacksCommand (#1611)
---
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  10 ++
 .../apache/hudi/cli/commands/RollbacksCommand.java |  19 ++-
 .../hudi/cli/commands/TestRollbacksCommand.java| 182 +
 3 files changed, 204 insertions(+), 7 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
index 5e31e5c..4fc41a1 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
@@ -23,6 +23,7 @@ package org.apache.hudi.cli;
  */
 public class HoodieTableHeaderFields {
   public static final String HEADER_PARTITION = "Partition";
+  public static final String HEADER_INSTANT = "Instant";
   public static final String HEADER_PARTITION_PATH = HEADER_PARTITION + " 
Path";
   public static final String HEADER_FILE_ID = "FileId";
   public static final String HEADER_BASE_INSTANT = "Base-Instant";
@@ -81,4 +82,13 @@ public class HoodieTableHeaderFields {
   public static final String HEADER_HOODIE_PROPERTY = "Property";
   public static final String HEADER_OLD_VALUE = "Old Value";
   public static final String HEADER_NEW_VALUE = "New Value";
+
+  /**
+   * Fields of Rollback.
+   */
+  public static final String HEADER_ROLLBACK_INSTANT = "Rolledback " + 
HEADER_INSTANT;
+  public static final String HEADER_TIME_TOKEN_MILLIS = "Time taken in millis";
+  public static final String HEADER_TOTAL_PARTITIONS = "Total Partitions";
+  public static final String HEADER_DELETED_FILE = "Deleted File";
+  public static final String HEADER_SUCCEEDED = "Succeeded";
 }
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RollbacksCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RollbacksCommand.java
index 70b34bc..4feb4c1 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RollbacksCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/RollbacksCommand.java
@@ -21,6 +21,7 @@ package org.apache.hudi.cli.commands;
 import org.apache.hudi.avro.model.HoodieRollbackMetadata;
 import org.apache.hudi.cli.HoodieCLI;
 import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
 import org.apache.hudi.cli.TableHeader;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
@@ -56,8 +57,7 @@ public class RollbacksCommand implements CommandMarker {
   @CliOption(key = {"sortBy"}, help = "Sorting Field", 
unspecifiedDefaultValue = "") final String sortByField,
   @CliOption(key = {"desc"}, help = "Ordering", unspecifiedDefaultValue = 
"false") final boolean descending,
   @CliOption(key = {"headeronly"}, help = "Print Header Only",
-  unspecifiedDefaultValue = "false") final boolean headerOnly)
-  throws IOException {
+  unspecifiedDefaultValue = "false") final boolean headerOnly) {
 HoodieActiveTimeline activeTimeline = new 
RollbackTimeline(HoodieCLI.getTableMetaClient());
 HoodieTimeline rollback = 
activeTimeline.getRollbackTimeline().filterCompletedInstants();
 
@@ -79,9 +79,11 @@ public class RollbacksCommand implements CommandMarker {
 e.printStackTrace();
   }
 });
-TableHeader header = new 
TableHeader().addTableHeaderField("Instant").addTableHeaderField("Rolledback 
Instant")
-.addTableHeaderField("Total Files Deleted").addTableHeaderField("Time 
taken in millis")
-.addTableHeaderField("Total Partitions");
+TableHeader header = new 
TableHeader().addTableHeaderField(HoodieTableHeaderFields.HEADER_INSTANT)
+.addTableHeaderField(HoodieTableHeaderFields.HEADER_ROLLBACK_INSTANT)
+
.addTableHeaderField(HoodieTableHeaderFields.HEADER_TOTAL_FILES_DELETED)
+.addTableHeaderField(HoodieTableHeaderFields.HEADER_TIME_TOKEN_MILLIS)
+.addTableHeaderField(HoodieTableHeaderFields.HEADER_TOTAL_PARTITIONS);
 return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending, limit, headerOnly, rows);
   }
 
@@ -112,8 +114,11 @@ public class RollbacksCommand implements CommandMarker {
   rows.add(row);
 }));
 
-TableHeader header = new 
TableHeader().addTableHeaderField("Instant").addTableHeaderField("Rolledback 
Instants")
-.addTableHeaderField("Partition").addTableHeaderField("Deleted 
F

[GitHub] [incubator-hudi] yanghua commented on pull request #1572: [HUDI-836] Implement datadog metrics reporter

2020-05-17 Thread GitBox


yanghua commented on pull request #1572:
URL: https://github.com/apache/incubator-hudi/pull/1572#issuecomment-629960658


   Will do a final check.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-906) Sudha: Create gpg key and add to KEYS file

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-906:
---
Fix Version/s: (was: 0.5.0)

> Sudha: Create gpg key and add to KEYS file
> --
>
> Key: HUDI-906
> URL: https://issues.apache.org/jira/browse/HUDI-906
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release & Administrative
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
>
> Steps:
>  # Install [https://gpgtools.org/]
>  # Create gpg key with your apache emailId and publish to key server
>  # Run the following command "gpg --list-sigs vbal...@apache.org  && gpg 
> --armor --export vbal...@apache.org" and append the output to KEYS file.
>  # Create a github PR against master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-906) Sudha: Create gpg key and add to KEYS file

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-906:
---
Parent: (was: HUDI-121)
Issue Type: Task  (was: Sub-task)

> Sudha: Create gpg key and add to KEYS file
> --
>
> Key: HUDI-906
> URL: https://issues.apache.org/jira/browse/HUDI-906
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Release & Administrative
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>
> Steps:
>  # Install [https://gpgtools.org/]
>  # Create gpg key with your apache emailId and publish to key server
>  # Run the following command "gpg --list-sigs vbal...@apache.org  && gpg 
> --armor --export vbal...@apache.org" and append the output to KEYS file.
>  # Create a github PR against master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-906) Sudha: Create gpg key and add to KEYS file

2020-05-17 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-906:
--

 Summary: Sudha: Create gpg key and add to KEYS file
 Key: HUDI-906
 URL: https://issues.apache.org/jira/browse/HUDI-906
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Release & Administrative
Reporter: Bhavani Sudha
Assignee: Nishith Agarwal
 Fix For: 0.5.0


Steps:
 # Install [https://gpgtools.org/]
 # Create gpg key with your apache emailId and publish to key server
 # Run the following command "gpg --list-sigs vbal...@apache.org  && gpg 
--armor --export vbal...@apache.org" and append the output to KEYS file.

 # Create a github PR against master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-906) Sudha: Create gpg key and add to KEYS file

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-906:
--

Assignee: Bhavani Sudha  (was: Nishith Agarwal)

> Sudha: Create gpg key and add to KEYS file
> --
>
> Key: HUDI-906
> URL: https://issues.apache.org/jira/browse/HUDI-906
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Release & Administrative
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>
> Steps:
>  # Install [https://gpgtools.org/]
>  # Create gpg key with your apache emailId and publish to key server
>  # Run the following command "gpg --list-sigs vbal...@apache.org  && gpg 
> --armor --export vbal...@apache.org" and append the output to KEYS file.
>  # Create a github PR against master



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] bvaradar commented on pull request #1639: [MINOR] Fix apache-rat violations

2020-05-17 Thread GitBox


bvaradar commented on pull request #1639:
URL: https://github.com/apache/incubator-hudi/pull/1639#issuecomment-629958724


   @jfrazee : Thanks a lot for your help in identifying and providing the fix. 
There is one more related change as part of this and I took the liberty to add 
the patch to this PR. If the tests goes fine, I will merge the combined 
changes. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on pull request #1639: [MINOR] Fix apache-rat violations

2020-05-17 Thread GitBox


bvaradar commented on pull request #1639:
URL: https://github.com/apache/incubator-hudi/pull/1639#issuecomment-629957097


   For the other classes, the underlying problem is that apache-rat was not 
enabled for hudi-utilities bundle



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash commented on pull request #1638: HUDI-515 Resolve API conflict for Hive 2 & Hive 3

2020-05-17 Thread GitBox


n3nash commented on pull request #1638:
URL: https://github.com/apache/incubator-hudi/pull/1638#issuecomment-629942871


   @zhedoubushishi what does it take to support Hive 3.x for Hudi with a mvn 
flag ? If we cannot support hive 3.x, what is the intention of this PR ? I'm 
not very inclined to use reflection with try-catch here since it's not a clear 
indication of Hive 3.x support.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-890:
---
Description: 
The following commits are included in this release.
 * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
the inheritance chain
 * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
 * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
filterDupes is enabled on UPSERT mode.
 * #1517 HUDI-799 Use appropriate FS when loading configs
 * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
 * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing the 
DataFrame
 * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
 * #1421 HUDI-724 Parallelize getSmallFiles for partitions
 * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
Date type columns
 * #1413 Add constructor to HoodieROTablePathFilter
 * #1415 HUDI-539 Make ROPathFilter conf member serializable
 * #1578 Add changes for presto mor queries
 * #1506 HUDI-782 Add support of Aliyun object storage service.
 * #1432 HUDI-716 Exception: Not an Avro data file when running 
HoodieCleanClient.runClean
 * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
 * #1448 [MINOR] Update DOAP with 0.5.2 Release
 * #1466 HUDI-742 Fix Java Math Exception
 * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
 * #1427 HUDI-727: Copy default values of fields if not present when rewriting 
incoming record with new schema
 * #1515 HUDI-795 Handle auto-deleted empty aux folder
 * #1547 [MINOR]: Fix cli docs for DeltaStreamer
 * #1580 HUDI-852 adding check for table name for Append Save mode
 * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
HoodieGlobalBloomIndex class
 * #1434 HUDI-616 Fixed parquet files getting created on local FS
 * #1633 HUDI-858 Allow multiple operations to be executed within a single 
commit

 * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by 
default

 * #1596 HUDI-863 get decimal properties from derived spark DataType

 * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using timeline 
server

 * #1584 HUDI-902 Avoid exception when getSchemaProvider

 * #1612 HUDI-528 Handle empty commit in incremental pulling

 * #1511 HUDI-789Adjust logic of upsert in HDFSParquetImporter

 * #1627 HUDI-889 Writer supports useJdbc configuration when hive 
synchronization is enabled

  was:
The following commits are included in this release.
 * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
the inheritance chain
 * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
 * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
filterDupes is enabled on UPSERT mode.
 * #1517 HUDI-799 Use appropriate FS when loading configs
 * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
 * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing the 
DataFrame
 * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
 * #1421 HUDI-724 Parallelize getSmallFiles for partitions
 * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
Date type columns
 * #1413 Add constructor to HoodieROTablePathFilter
 * #1415 HUDI-539 Make ROPathFilter conf member serializable
 * #1578 Add changes for presto mor queries
 * #1506 HUDI-782 Add support of Aliyun object storage service.
 * #1432 HUDI-716 Exception: Not an Avro data file when running 
HoodieCleanClient.runClean
 * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
 * #1448 [MINOR] Update DOAP with 0.5.2 Release
 * #1466 HUDI-742 Fix Java Math Exception
 * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
 * #1427 HUDI-727: Copy default values of fields if not present when rewriting 
incoming record with new schema
 * #1515 HUDI-795 Handle auto-deleted empty aux folder
 * #1547 [MINOR]: Fix cli docs for DeltaStreamer
 * #1580 HUDI-852 adding check for table name for Append Save mode
 * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
HoodieGlobalBloomIndex class
 * #1434 HUDI-616 Fixed parquet files getting created on local FS
 * #1633 HUDI-858 Allow multiple operations to be executed within a single 
commit

 * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by 
default

 * #1596 HUDI-863 get decimal properties from derived spark DataType

 * #1602 HUDI-494 fix incorrect record size estimation

 * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using timeline 
server

 * #1584 HUDI-902 Avoid exception when getSchemaPr

[GitHub] [incubator-hudi] leesf commented on pull request #1095: [HUDI-210] Implement prometheus metrics reporter

2020-05-17 Thread GitBox


leesf commented on pull request #1095:
URL: https://github.com/apache/incubator-hudi/pull/1095#issuecomment-629941260


   > What is the status of this PR? Is it ready to merge?
   
   Hi @piyushrl Thanks for the interests, now @xushiyan is taking over the PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-905) Support native filter pushdown for Spark Datasource

2020-05-17 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-905:
---

 Summary: Support native filter pushdown for Spark Datasource
 Key: HUDI-905
 URL: https://issues.apache.org/jira/browse/HUDI-905
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-17 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109805#comment-17109805
 ] 

Yanjia Gary Li commented on HUDI-890:
-

Hi [~bhavanisudha] , #1602 HUDI-494 fix incorrect record size estimation was 
pushed to 0.6.0. Thanks

> Prepare for 0.5.3 patch release
> ---
>
> Key: HUDI-890
> URL: https://issues.apache.org/jira/browse/HUDI-890
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.5.3
>
>
> The following commits are included in this release.
>  * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
> the inheritance chain
>  * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
>  * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
> NumericUtils.java
>  * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
> filterDupes is enabled on UPSERT mode.
>  * #1517 HUDI-799 Use appropriate FS when loading configs
>  * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
>  * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing 
> the DataFrame
>  * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
>  * #1421 HUDI-724 Parallelize getSmallFiles for partitions
>  * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
> Date type columns
>  * #1413 Add constructor to HoodieROTablePathFilter
>  * #1415 HUDI-539 Make ROPathFilter conf member serializable
>  * #1578 Add changes for presto mor queries
>  * #1506 HUDI-782 Add support of Aliyun object storage service.
>  * #1432 HUDI-716 Exception: Not an Avro data file when running 
> HoodieCleanClient.runClean
>  * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
>  * #1448 [MINOR] Update DOAP with 0.5.2 Release
>  * #1466 HUDI-742 Fix Java Math Exception
>  * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
>  * #1427 HUDI-727: Copy default values of fields if not present when 
> rewriting incoming record with new schema
>  * #1515 HUDI-795 Handle auto-deleted empty aux folder
>  * #1547 [MINOR]: Fix cli docs for DeltaStreamer
>  * #1580 HUDI-852 adding check for table name for Append Save mode
>  * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
> HoodieGlobalBloomIndex class
>  * #1434 HUDI-616 Fixed parquet files getting created on local FS
>  * #1633 HUDI-858 Allow multiple operations to be executed within a single 
> commit
>  * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by 
> default
>  * #1596 HUDI-863 get decimal properties from derived spark DataType
>  * #1602 HUDI-494 fix incorrect record size estimation
>  * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using 
> timeline server
>  * #1584 HUDI-902 Avoid exception when getSchemaProvider
>  * #1612 HUDI-528 Handle empty commit in incremental pulling
>  * #1511 HUDI-789Adjust logic of upsert in HDFSParquetImporter
>  * #1627 HUDI-889 Writer supports useJdbc configuration when hive 
> synchronization is enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #281

2020-05-17 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
or

[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-494:

Fix Version/s: (was: 0.5.3)

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] piyushrl commented on pull request #1095: [HUDI-210] Implement prometheus metrics reporter

2020-05-17 Thread GitBox


piyushrl commented on pull request #1095:
URL: https://github.com/apache/incubator-hudi/pull/1095#issuecomment-629926575


What is the status of this PR? Is it ready to merge?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] hddong commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-05-17 Thread GitBox


hddong commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629921612


   @yanghua : Sure, I'll discuss with @pratyakshsharma to make it success.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] yanghua commented on pull request #1558: [HUDI-796]: added deduping logic for upserts case

2020-05-17 Thread GitBox


yanghua commented on pull request #1558:
URL: https://github.com/apache/incubator-hudi/pull/1558#issuecomment-629915198


   > @yanghua I am unable to run integration tests defined in hudi-cli package 
on my local. One of the tests from ITTestRepairsCommand is continuously failing 
in travis build. Need help here.
   
   @hddong Can you help to verify it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch hudi_test_suite_refactor updated (e9ee88c -> c9f6aa6)

2020-05-17 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard e9ee88c  [HUDI-394] Provide a basic implementation of test suite
 add c9f6aa6  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (e9ee88c)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (c9f6aa6)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../hudi/testsuite/TestFileDeltaInputWriter.java  | 19 +--
 .../hudi/testsuite/job/TestHoodieTestSuiteJob.java|  2 +-
 .../testsuite/reader/TestDFSAvroDeltaInputReader.java |  2 +-
 .../reader/TestDFSHoodieDatasetInputReader.java   |  2 +-
 4 files changed, 12 insertions(+), 13 deletions(-)



[incubator-hudi] branch master updated: [HUDI-407] Adding Simple Index to Hoodie. (#1402)

2020-05-17 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 29edf4b  [HUDI-407] Adding Simple Index to Hoodie. (#1402)
29edf4b is described below

commit 29edf4b3b8ade64ec7822d6b7b2a125d5ca781c4
Author: Sivabalan Narayanan 
AuthorDate: Sun May 17 21:32:24 2020 -0400

[HUDI-407] Adding Simple Index to Hoodie. (#1402)

This index finds the location by joining incoming records with records from 
base files.
---
 .../apache/hudi/client/utils/SparkConfigUtils.java |   4 +
 .../org/apache/hudi/config/HoodieIndexConfig.java  |  46 ++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  16 +
 .../java/org/apache/hudi/index/HoodieIndex.java|  15 +-
 .../org/apache/hudi/index/HoodieIndexUtils.java|  90 
 .../apache/hudi/index/bloom/HoodieBloomIndex.java  |  35 +-
 .../hudi/index/bloom/HoodieGlobalBloomIndex.java   |   7 +-
 .../hudi/index/simple/HoodieGlobalSimpleIndex.java | 169 +++
 .../hudi/index/simple/HoodieSimpleIndex.java   | 181 
 .../hudi/io/HoodieKeyLocationFetchHandle.java  |  57 +++
 .../org/apache/hudi/index/TestHoodieIndex.java | 510 +++--
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   1 -
 .../hudi/io/TestHoodieKeyLocationFetchHandle.java  | 210 +
 .../java/org/apache/hudi/avro/HoodieAvroUtils.java |  18 +
 .../org/apache/hudi/common/util/ParquetUtils.java  |  52 ++-
 .../apache/hudi/common/util/TestParquetUtils.java  |  35 +-
 16 files changed, 1381 insertions(+), 65 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/client/utils/SparkConfigUtils.java 
b/hudi-client/src/main/java/org/apache/hudi/client/utils/SparkConfigUtils.java
index 604be01..0a6b608 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/client/utils/SparkConfigUtils.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/client/utils/SparkConfigUtils.java
@@ -99,4 +99,8 @@ public class SparkConfigUtils {
 String fraction = 
properties.getProperty(MAX_MEMORY_FRACTION_FOR_COMPACTION_PROP, 
DEFAULT_MAX_MEMORY_FRACTION_FOR_COMPACTION);
 return getMaxMemoryAllowedForMerge(fraction);
   }
+
+  public static StorageLevel getSimpleIndexInputStorageLevel(Properties 
properties) {
+return 
StorageLevel.fromString(properties.getProperty(HoodieIndexConfig.SIMPLE_INDEX_INPUT_STORAGE_LEVEL));
+  }
 }
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java 
b/hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
index df2177e..4e974af 100644
--- a/hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
+++ b/hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
@@ -62,6 +62,12 @@ public class HoodieIndexConfig extends DefaultHoodieConfig {
   public static final String DEFAULT_BLOOM_INDEX_FILTER_TYPE = 
BloomFilterTypeCode.SIMPLE.name();
   public static final String HOODIE_BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES = 
"hoodie.bloom.index.filter.dynamic.max.entries";
   public static final String 
DEFAULT_HOODIE_BLOOM_INDEX_FILTER_DYNAMIC_MAX_ENTRIES = "10";
+  public static final String SIMPLE_INDEX_USE_CACHING_PROP = 
"hoodie.simple.index.use.caching";
+  public static final String DEFAULT_SIMPLE_INDEX_USE_CACHING = "true";
+  public static final String SIMPLE_INDEX_PARALLELISM_PROP = 
"hoodie.simple.index.parallelism";
+  public static final String DEFAULT_SIMPLE_INDEX_PARALLELISM = "0";
+  public static final String GLOBAL_SIMPLE_INDEX_PARALLELISM_PROP = 
"hoodie.global.simple.index.parallelism";
+  public static final String DEFAULT_GLOBAL_SIMPLE_INDEX_PARALLELISM = "0";
 
   // 1B bloom filter checks happen in 250 seconds. 500ms to read a bloom 
filter.
   // 10M checks in 2500ms, thus amortizing the cost of reading bloom filter 
across partitions.
@@ -80,6 +86,8 @@ public class HoodieIndexConfig extends DefaultHoodieConfig {
 
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = 
"hoodie.bloom.index.input.storage.level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = 
"MEMORY_AND_DISK_SER";
+  public static final String SIMPLE_INDEX_INPUT_STORAGE_LEVEL = 
"hoodie.simple.index.input.storage.level";
+  public static final String DEFAULT_SIMPLE_INDEX_INPUT_STORAGE_LEVEL = 
"MEMORY_AND_DISK_SER";
 
   /**
* Only applies if index type is GLOBAL_BLOOM.
@@ -92,6 +100,9 @@ public class HoodieIndexConfig extends DefaultHoodieConfig {
   public static final String BLOOM_INDEX_UPDATE_PARTITION_PATH = 
"hoodie.bloom.index.update.partition.path";
   public static final String DEFAULT_BLOOM_INDEX_UPDATE_PARTITION_PATH = 
"false";
 
+  public static final String SIMPLE_INDEX_UPDATE_PARTITION_PATH = 
"hoodie.simple.index.update.partition.path";
+  public static final String DEFAULT_SIMPLE_INDEX_UPDATE_PARTITION_PATH = 
"false";

[GitHub] [incubator-hudi] vinothchandar merged pull request #1402: [HUDI-407] Adding Simple Index

2020-05-17 Thread GitBox


vinothchandar merged pull request #1402:
URL: https://github.com/apache/incubator-hudi/pull/1402


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-17 Thread GitBox


vinothchandar commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629898227


   Let’s push this down the line for 0.6.0
   
   @garyli1019 we probably need an alternative strategy here that is more 
aggressive. But I see bloom filter as part of the per record overhead.. let me 
review your latest change..
   
   We need to introduce some sizing strategy abstractions in the code 
ultimately.. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] leesf commented on pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


leesf commented on pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#issuecomment-629897867


   > Merging #1633 into master will decrease coverage by 55.07%.
   The diff coverage is 41.17%.
   please take a look @bvaradar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] vinothchandar merged pull request #1636: [HUDI-895] Remove unnecessary listing .hoodie folder when using timeline server

2020-05-17 Thread GitBox


vinothchandar merged pull request #1636:
URL: https://github.com/apache/incubator-hudi/pull/1636


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch master updated: [HUDI-895] Remove unnecessary listing .hoodie folder when using timeline server (#1636)

2020-05-17 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 3c9da2e  [HUDI-895] Remove unnecessary listing .hoodie folder when 
using timeline server (#1636)
3c9da2e is described below

commit 3c9da2e5f038ea90a823334a7b07bb4d13f90996
Author: Balaji Varadarajan 
AuthorDate: Sun May 17 18:18:53 2020 -0700

[HUDI-895] Remove unnecessary listing .hoodie folder when using timeline 
server (#1636)
---
 .../org/apache/hudi/client/HoodieWriteClient.java  |  6 +--
 .../java/org/apache/hudi/table/HoodieTable.java|  6 +--
 .../common/table/view/FileSystemViewManager.java   | 59 +-
 3 files changed, 40 insertions(+), 31 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java 
b/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
index c781c69..fa0b15c 100644
--- a/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
+++ b/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
@@ -540,10 +540,8 @@ public class HoodieWriteClient extends AbstractHo
 HoodieTimeline.compareTimestamps(latestPending.getTimestamp(), 
HoodieTimeline.LESSER_THAN, instantTime),
 "Latest pending compaction instant time must be earlier than this 
instant time. Latest Compaction :"
 + latestPending + ",  Ingesting at " + instantTime));
-HoodieTable table = HoodieTable.create(metaClient, config, hadoopConf);
-HoodieActiveTimeline activeTimeline = table.getActiveTimeline();
-String commitActionType = table.getMetaClient().getCommitActionType();
-activeTimeline.createNewInstant(new HoodieInstant(State.REQUESTED, 
commitActionType, instantTime));
+metaClient.getActiveTimeline().createNewInstant(new 
HoodieInstant(State.REQUESTED, metaClient.getCommitActionType(),
+instantTime));
   }
 
   /**
diff --git a/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java
index 8acd351..0584dbf 100644
--- a/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java
+++ b/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java
@@ -231,21 +231,21 @@ public abstract class HoodieTable implements Seri
* Get the base file only view of the file system for this table.
*/
   public BaseFileOnlyView getBaseFileOnlyView() {
-return getViewManager().getFileSystemView(metaClient.getBasePath());
+return getViewManager().getFileSystemView(metaClient);
   }
 
   /**
* Get the full view of the file system for this table.
*/
   public SliceView getSliceView() {
-return getViewManager().getFileSystemView(metaClient.getBasePath());
+return getViewManager().getFileSystemView(metaClient);
   }
 
   /**
* Get complete view of the file system for this table with ability to force 
sync.
*/
   public SyncableFileSystemView getHoodieView() {
-return getViewManager().getFileSystemView(metaClient.getBasePath());
+return getViewManager().getFileSystemView(metaClient);
   }
 
   /**
diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
index c5e7764..c4ab712 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/table/view/FileSystemViewManager.java
@@ -58,10 +58,10 @@ public class FileSystemViewManager {
   // Map from Base-Path to View
   private final ConcurrentHashMap 
globalViewMap;
   // Factory Map to create file-system views
-  private final Function2 viewCreator;
+  private final Function2 viewCreator;
 
   public FileSystemViewManager(SerializableConfiguration conf, 
FileSystemViewStorageConfig viewStorageConfig,
-  Function2 
viewCreator) {
+  Function2 viewCreator) {
 this.conf = new SerializableConfiguration(conf);
 this.viewStorageConfig = viewStorageConfig;
 this.globalViewMap = new ConcurrentHashMap<>();
@@ -87,7 +87,21 @@ public class FileSystemViewManager {
* @return
*/
   public SyncableFileSystemView getFileSystemView(String basePath) {
-return globalViewMap.computeIfAbsent(basePath, (path) -> 
viewCreator.apply(path, viewStorageConfig));
+return globalViewMap.computeIfAbsent(basePath, (path) -> {
+  HoodieTableMetaClient metaClient = new 
HoodieTableMetaClient(conf.newCopy(), path);
+  return viewCreator.apply(metaClient, viewStorageConfig);
+});
+  }
+
+  /**
+   * Main API to get the file-system view for the base-path.
+   *
+   * @param metaClient HoodieTableMetaClient
+   * @return
+   */
+  public SyncableFileSystemView getFileSystemView(HoodieTableMetaClient 
metaClien

[GitHub] [incubator-hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-17 Thread GitBox


garyli1019 commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629887094


   We can push this to 0.6.0 if you guys prefer to have more discussions. If 
there is anything I can help please let me know.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-05-17 Thread GitBox


nsivabalan commented on pull request #1602:
URL: https://github.com/apache/incubator-hudi/pull/1602#issuecomment-629884272


   @vinothchandar : are we proceeding with the patch. I haven't started looking 
at it yet, but if we plan to get it into 0.5.3, we need to get this resolved 
asap. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash edited a comment on pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

2020-05-17 Thread GitBox


n3nash edited a comment on pull request #1484:
URL: https://github.com/apache/incubator-hudi/pull/1484#issuecomment-629876523


   @v3nkatesh The rate limiter looks good to me but it's still inspired from 
guava. I'll let @vinothchandar comment since he felt strongly about 
implementing our own.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] n3nash commented on pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

2020-05-17 Thread GitBox


n3nash commented on pull request #1484:
URL: https://github.com/apache/incubator-hudi/pull/1484#issuecomment-629876523


   @v3nkatesh The rate limiter looks good to me but it's still inspired from 
guava. I'll let @vinothchandar since he felt strongly about implementing our 
own.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1433: [HUDI-728]: Implement custom key generator

2020-05-17 Thread GitBox


nsivabalan commented on a change in pull request #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#discussion_r426313249



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.config.TypedProperties;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.exception.HoodieDeltaStreamerException;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * This is a generic implementation of KeyGenerator where users can configure 
record key as a single field or a combination of fields.
+ * Similarly partition path can be configured to have multiple fields or only 
one field. This class expects value for prop
+ * "hoodie.datasource.write.partitionpath.field" in a specific format. For 
example:
+ *
+ * properties.put("hoodie.datasource.write.partitionpath.field", 
"field1:PartitionKeyType1,field2:PartitionKeyType2").
+ *
+ * The complete partition path is created as / and so on.
+ *
+ * Few points to consider:
+ * 1. If you want to customize some partition path field on a timestamp basis, 
you can use field1:timestampBased
+ * 2. If you simply want to have the value of your configured field in the 
partition path, use field1:simple
+ * 3. If you want your table to be non partitioned, simply leave it as blank.
+ *
+ * RecordKey is internally generated using either SimpleKeyGenerator or 
ComplexKeyGenerator.
+ */
+public class CustomKeyGenerator extends KeyGenerator {
+
+  protected final List recordKeyFields;
+  protected final List partitionPathFields;
+  protected final TypedProperties properties;
+  private static final String DEFAULT_PARTITION_PATH_SEPARATOR = "/";
+  private static final String SPLIT_REGEX = ":";
+
+  /**
+   * Used as a part of config in CustomKeyGenerator.java.
+   */
+  public enum PartitionKeyType {
+SIMPLE, TIMESTAMP
+  }
+
+  public CustomKeyGenerator(TypedProperties props) {
+super(props);
+this.properties = props;
+this.recordKeyFields = 
Arrays.stream(props.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+this.partitionPathFields =
+  
Arrays.stream(props.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+//call function to get the record key
+String recordKey = getRecordKey(record);
+//call function to get the partition key based on the type for that 
partition path field
+String partitionPath = getPartitionPath(record);
+return new HoodieKey(recordKey, partitionPath);
+  }
+
+  private String getPartitionPath(GenericRecord record) {
+if (partitionPathFields == null) {
+  throw new HoodieKeyException("Unable to find field names for partition 
path in cfg");
+}
+
+String partitionPathField;
+StringBuilder partitionPath = new StringBuilder();
+
+//Corresponds to no partition case
+if (partitionPathFields.size() == 1 && 
partitionPathFields.get(0).isEmpty()) {
+  return "";
+}
+for (String field : partitionPathFields) {
+  String[] fieldWithType = field.split(SPLIT_REGEX);
+  if (fieldWithType.length != 2) {
+throw new HoodieKeyException("Unable to find field names for partition 
path in proper format");
+  }
+
+  partitionPathField = fieldWithType[0];
+  PartitionKeyType keyType = 
PartitionKeyType.valueOf(fieldWithType[1].toUpperCase());
+  switch (keyType) {
+case SIMPLE:
+  partitionPath.append(new 
SimpleKeyGenerator(properties).getPartitionPath(record, partitionPathField));
+  break;
+case TIMESTAMP:
+  partitionPath.append(new 
TimestampBasedKeyGenerator(properties).getPartitionPath(record, 
partitionPathField));
+  break;
+default

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1433: [HUDI-728]: Implement custom key generator

2020-05-17 Thread GitBox


nsivabalan commented on a change in pull request #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#discussion_r426313186



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/CustomKeyGenerator.java
##
@@ -0,0 +1,128 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.keygen;
+
+import org.apache.hudi.DataSourceWriteOptions;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.config.TypedProperties;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hudi.exception.HoodieDeltaStreamerException;
+import org.apache.hudi.exception.HoodieKeyException;
+
+import java.util.Arrays;
+import java.util.List;
+import java.util.stream.Collectors;
+
+/**
+ * This is a generic implementation of KeyGenerator where users can configure 
record key as a single field or a combination of fields.
+ * Similarly partition path can be configured to have multiple fields or only 
one field. This class expects value for prop
+ * "hoodie.datasource.write.partitionpath.field" in a specific format. For 
example:
+ *
+ * properties.put("hoodie.datasource.write.partitionpath.field", 
"field1:PartitionKeyType1,field2:PartitionKeyType2").
+ *
+ * The complete partition path is created as / and so on.
+ *
+ * Few points to consider:
+ * 1. If you want to customise some partition path field on a timestamp basis, 
you can use field1:timestampBased
+ * 2. If you simply want to have the value of your configured field in the 
partition path, use field1:simple
+ * 3. If you want your table to be non partitioned, simply leave it as blank.
+ *
+ * RecordKey is internally generated using either SimpleKeyGenerator or 
ComplexKeyGenerator.
+ */
+public class CustomKeyGenerator extends KeyGenerator {
+
+  protected final List recordKeyFields;
+  protected final List partitionPathFields;
+  protected final TypedProperties properties;
+  private static final String DEFAULT_PARTITION_PATH_SEPARATOR = "/";
+  private static final String SPLIT_REGEX = ":";
+
+  /**
+   * Used as a part of config in CustomKeyGenerator.java.
+   */
+  public enum PartitionKeyType {
+SIMPLE, TIMESTAMP
+  }
+
+  public CustomKeyGenerator(TypedProperties props) {
+super(props);
+this.properties = props;
+this.recordKeyFields = 
Arrays.stream(props.getString(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+this.partitionPathFields =
+  
Arrays.stream(props.getString(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY()).split(",")).map(String::trim).collect(Collectors.toList());
+  }
+
+  @Override
+  public HoodieKey getKey(GenericRecord record) {
+//call function to get the record key
+String recordKey = getRecordKey(record);
+//call function to get the partition key based on the type for that 
partition path field
+String partitionPath = getPartitionPath(record);
+return new HoodieKey(recordKey, partitionPath);
+  }
+
+  public String getPartitionPath(GenericRecord record) {
+if (partitionPathFields == null) {
+  throw new HoodieKeyException("Unable to find field names for partition 
path in cfg");
+}
+
+String partitionPathField;
+StringBuilder partitionPath = new StringBuilder();
+
+//Corresponds to no partition case
+if (partitionPathFields.size() == 1 && 
partitionPathFields.get(0).isEmpty()) {

Review comment:
   I get it, but have a follow up question and a comment. 
   - I am not sure why would someone set empty string. if users wants to not 
have any partitions, might as well not set it only ? Anyways, guess you 
discussed w/ vinoth on that. Will let you folks decide on whats better. 
   - Assuming, user might set it to empty, in line 69 where we generate  
partitionPathFields, we trim for empty strings isn't? So, the list should be 
empty in that case in my understanding. correct me if I am wrong. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, pleas

[GitHub] [incubator-hudi] n3nash commented on pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2020-05-17 Thread GitBox


n3nash commented on pull request #1100:
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-629869173


   @yanghua addressed comments, rebased. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[incubator-hudi] branch hudi_test_suite_refactor updated (bbd4429 -> e9ee88c)

2020-05-17 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


 discard bbd4429  [HUDI-394] Provide a basic implementation of test suite
 add 25e0b75  [HUDI-723] Register avro schema if infered from SQL 
transformation (#1518)
 add 2ada2ef  [HUDI-902] Avoid exception when getSchemaProvider (#1584)
 add 148b245  [MINOR] Increase heap space for surefire (#1623)
 add 25a0080  [HUDI-714]Add javadoc and comments to hudi write method link 
(#1409)
 add e9ee88c  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (bbd4429)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (e9ee88c)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../hudi/client/AbstractHoodieWriteClient.java |  6 ++
 .../org/apache/hudi/table/WorkloadProfile.java | 11 +++-
 .../org/apache/hudi/common/model/HoodieKey.java|  1 -
 .../hudi/common/table/HoodieTableMetaClient.java   |  4 +-
 .../org/apache/hudi/integ/ITTestHoodieDemo.java|  4 +-
 .../main/java/org/apache/hudi/DataSourceUtils.java |  7 ++
 .../hudi/utilities/deltastreamer/DeltaSync.java| 74 ++
 .../deltastreamer/HoodieDeltaStreamer.java |  6 +-
 .../HoodieMultiTableDeltaStreamer.java | 15 +++--
 ...Provider.java => DelegatingSchemaProvider.java} | 21 --
 .../apache/hudi/utilities/sources/InputBatch.java  | 24 ++-
 .../hudi/utilities/sources/TestInputBatch.java | 37 +++
 pom.xml|  1 +
 13 files changed, 162 insertions(+), 49 deletions(-)
 copy 
hudi-utilities/src/main/java/org/apache/hudi/utilities/schema/{NullTargetSchemaRegistryProvider.java
 => DelegatingSchemaProvider.java} (61%)
 create mode 100644 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestInputBatch.java



[jira] [Updated] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-890:
---
Description: 
The following commits are included in this release.
 * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
the inheritance chain
 * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
 * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
filterDupes is enabled on UPSERT mode.
 * #1517 HUDI-799 Use appropriate FS when loading configs
 * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
 * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing the 
DataFrame
 * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
 * #1421 HUDI-724 Parallelize getSmallFiles for partitions
 * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
Date type columns
 * #1413 Add constructor to HoodieROTablePathFilter
 * #1415 HUDI-539 Make ROPathFilter conf member serializable
 * #1578 Add changes for presto mor queries
 * #1506 HUDI-782 Add support of Aliyun object storage service.
 * #1432 HUDI-716 Exception: Not an Avro data file when running 
HoodieCleanClient.runClean
 * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
 * #1448 [MINOR] Update DOAP with 0.5.2 Release
 * #1466 HUDI-742 Fix Java Math Exception
 * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
 * #1427 HUDI-727: Copy default values of fields if not present when rewriting 
incoming record with new schema
 * #1515 HUDI-795 Handle auto-deleted empty aux folder
 * #1547 [MINOR]: Fix cli docs for DeltaStreamer
 * #1580 HUDI-852 adding check for table name for Append Save mode
 * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
HoodieGlobalBloomIndex class
 * #1434 HUDI-616 Fixed parquet files getting created on local FS
 * #1633 HUDI-858 Allow multiple operations to be executed within a single 
commit

 * #1634 HUDI-846Enable Incremental cleaning and embedded timeline-server by 
default

 * #1596 HUDI-863 get decimal properties from derived spark DataType

 * #1602 HUDI-494 fix incorrect record size estimation

 * #1636 HUDI-895 Remove unnecessary listing .hoodie folder when using timeline 
server

 * #1584 HUDI-902 Avoid exception when getSchemaProvider

 * #1612 HUDI-528 Handle empty commit in incremental pulling

 * #1511 HUDI-789Adjust logic of upsert in HDFSParquetImporter

 * #1627 HUDI-889 Writer supports useJdbc configuration when hive 
synchronization is enabled

  was:
The following commits are included in this release.
 * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
the inheritance chain
 * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
 * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
filterDupes is enabled on UPSERT mode.
 * #1517 HUDI-799 Use appropriate FS when loading configs
 * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
 * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing the 
DataFrame
 * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
 * #1421 HUDI-724 Parallelize getSmallFiles for partitions
 * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
Date type columns
 * #1413 Add constructor to HoodieROTablePathFilter
 * #1415 HUDI-539 Make ROPathFilter conf member serializable
 * #1578 Add changes for presto mor queries
 * #1506 HUDI-782 Add support of Aliyun object storage service.
 * #1432 HUDI-716 Exception: Not an Avro data file when running 
HoodieCleanClient.runClean
 * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
 * #1448 [MINOR] Update DOAP with 0.5.2 Release
 * #1466 HUDI-742 Fix Java Math Exception
 * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
 * #1427 HUDI-727: Copy default values of fields if not present when rewriting 
incoming record with new schema
 * #1515 HUDI-795 Handle auto-deleted empty aux folder
 * #1547 [MINOR]: Fix cli docs for DeltaStreamer
 * #1580 HUDI-852 adding check for table name for Append Save mode
 * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
HoodieGlobalBloomIndex class
 * #1434 HUDI-616 Fixed parquet files getting created on local FS
 * #1633 [HUDI-858] Allow multiple operations to be executed within a single 
commit

 * #1634 [HUDI-846][HUDI-848] Enable Incremental cleaning and embedded 
timeline-server by default

 * #1596 [HUDI-863] get decimal properties from derived spark DataType

 * #1602 [HUDI-494] fix incorrect record size estimation

 * #1636 [HUDI-895] Remove unnecessary listing .hoodie folder wh

[incubator-hudi] branch hudi_test_suite_refactor updated (6f4547d -> bbd4429)

2020-05-17 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


omit 6f4547d  [MINOR] Code cleanup for dag package
omit 33590b7  [MINOR] Code cleanup for DeltaConfig
omit 7db66af  [HUDI-394] Provide a basic implementation of test suite
 add 19ca0b5  [HUDI-785] Refactor compaction/savepoint execution based on 
ActionExector abstraction (#1548)
 add 6de9f5d  [HUDI-819] Fix a bug with MergeOnReadLazyInsertIterable.
 add 06dae30  [HUDI-810] Migrate ClientTestHarness to JUnit 5 (#1553)
 add 69b1630  [HUDI-814] Migrate hudi-client tests to JUnit 5 (#1570)
 add 9059bce  [HUDI-702] Add test for HoodieLogFileCommand (#1522)
 add c4b7162  [MINOR] Reorder HoodieTimeline#compareTimestamp arguments for 
better readability (#1575)
 add 506447f  [HUDI-850] Avoid unnecessary listings in incremental cleaning 
mode (#1576)
 add 096f7f5  [HUDI-813] Migrate hudi-utilities tests to JUnit 5 (#1589)
 add 5e0f5e5  [HUDI-852] adding check for table name for Append Save mode  
(#1580)
 add e21441a  Add changes for presto mor queries (#1578)
 add 366bb10  [HUDI-812] Migrate hudi common tests to JUnit 5 (#1590)
 add f921469  [HUDI-704] Add test for RepairsCommand (#1554)
 add e783ab1  [HUDI-784] Adressing issue with log reader on GCS (#1516)
 add d54b4b8  [HUDI-838] Support schema from HoodieCommitMetadata for 
HiveSync (#1559)
 add fa6aba7  [MINOR] fixed building IndexFileFilter with a wrong condition 
in HoodieGlobalBloomIndex class (#1537)
 add f92b9fd  [MINOR] Fix hardcoding of ports in TestHoodieJmxMetrics 
(#1606)
 add 8d0e231  [HUDI-820] cleaner repair command should only inspect clean 
metadata files (#1542)
 add 6dac101  [HUDI-870] Remove spark context in ClientUtils and 
HoodieIndex (#1609)
 add 5d37e66  [MINOR] Fix HoodieNotSupportedException description in 
KafkaOffsetGen  (#1615)
 add 295d00b  [HUDI-880] Replace part of spark context by hadoop 
configuration in HoodieTable. (#1614)
 add b54517a  [HUDI-886] Replace jsc.hadoopConfiguration by hadoop 
configuration in hudi-client testcase (#1621)
 add e8ffc6f  [HUDI-881] Replace part of spark context by hadoop 
configuration in AbstractHoodieClient and HoodieReadClient (#1620)
 add 404c7e8  [HUDI-884] Shade avro and parquet-avro in 
hudi-hive-sync-bundle (#1618)
 add 32ea4c7  [HUDI-869] Add support for alluxio (#1608)
 add 32bada2  [HUDI-889] Writer supports useJdbc configuration when hive 
synchronization is enabled (#1627)
 add 0d4848b  [HUDI-811] Restructure test packages (#1607)
 add 83796b3  [HUDI-793] Adding proper default to hudi metadata fields and 
proper handling to rewrite routine (#1513)
 add 3a2fe13  [HUDI-701] Add unit test for HDFSParquetImportCommand (#1574)
 add f094f42  [HUDI-843] Add ability to specify time unit for  
TimestampBasedKeyGenerator (#1541)
 add a64afdf  HUDI-528 Handle empty commit in incremental pulling (#1612)
 add bbd4429  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (6f4547d)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (bbd4429)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 hudi-cli/pom.xml   |   7 +
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  21 +-
 .../apache/hudi/cli/commands/CommitsCommand.java   |   2 +-
 .../hudi/cli/commands/CompactionCommand.java   |   2 +-
 .../hudi/cli/commands/FileSystemViewCommand.java   |   4 +-
 .../cli/commands/HDFSParquetImportCommand.java |   8 +-
 .../hudi/cli/commands/HoodieLogFileCommand.java|  19 +-
 .../hudi/cli/commands/HoodieSyncCommand.java   |   2 +-
 .../apache/hudi/cli/commands/RepairsCommand.java   |  59 ++-
 .../hudi/cli/commands/SavepointsCommand.java   |   8 +-
 .../org/apache/hudi/cli/commands/SparkMain.java|  35 +-
 .../java/org/apache/hudi/cli/utils/CommitUtil.java |   2 +-
 .../scala/org/apache/hudi/cli/SparkHelpers.scala   |   4 +
 .../hudi/cli/AbstractShellIntegrationTest.java |  17 +-
 .../cli/commands/TestArchivedCommitsCommand.java   |  28 +-
 .../hudi/cli/commands/TestCleansCommand.java   |  49 +-
 .../cli/commands/TestFileSystemViewCommand.java|  42 +-
 .../cli/command

[jira] [Updated] (HUDI-890) Prepare for 0.5.3 patch release

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-890:
---
Description: 
The following commits are included in this release.
 * #1372 HUDI-652 Decouple HoodieReadClient and AbstractHoodieClient to break 
the inheritance chain
 * #1388 HUDI-681 Remove embeddedTimelineService from HoodieReadClient
 * #1350 HUDI-629: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
filterDupes is enabled on UPSERT mode.
 * #1517 HUDI-799 Use appropriate FS when loading configs
 * #1406 HUDI-713 Fix conversion of Spark array of struct type to Avro schema
 * #1394 HUDI-656[Performance] Return a dummy Spark relation after writing the 
DataFrame
 * #1576 HUDI-850 Avoid unnecessary listings in incremental cleaning mode
 * #1421 HUDI-724 Parallelize getSmallFiles for partitions
 * #1330 HUDI-607 Fix to allow creation/syncing of Hive tables partitioned by 
Date type columns
 * #1413 Add constructor to HoodieROTablePathFilter
 * #1415 HUDI-539 Make ROPathFilter conf member serializable
 * #1578 Add changes for presto mor queries
 * #1506 HUDI-782 Add support of Aliyun object storage service.
 * #1432 HUDI-716 Exception: Not an Avro data file when running 
HoodieCleanClient.runClean
 * #1422 HUDI-400 Check upgrade from old plan to new plan for compaction
 * #1448 [MINOR] Update DOAP with 0.5.2 Release
 * #1466 HUDI-742 Fix Java Math Exception
 * #1416 HUDI-717 Fixed usage of HiveDriver for DDL statements.
 * #1427 HUDI-727: Copy default values of fields if not present when rewriting 
incoming record with new schema
 * #1515 HUDI-795 Handle auto-deleted empty aux folder
 * #1547 [MINOR]: Fix cli docs for DeltaStreamer
 * #1580 HUDI-852 adding check for table name for Append Save mode
 * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
HoodieGlobalBloomIndex class
 * #1434 HUDI-616 Fixed parquet files getting created on local FS
 * #1633 [HUDI-858] Allow multiple operations to be executed within a single 
commit

 * #1634 [HUDI-846][HUDI-848] Enable Incremental cleaning and embedded 
timeline-server by default

 * #1596 [HUDI-863] get decimal properties from derived spark DataType

 * #1602 [HUDI-494] fix incorrect record size estimation

 * #1636 [HUDI-895] Remove unnecessary listing .hoodie folder when using 
timeline server

 * #1584 [HUDI-902] Avoid exception when getSchemaProvider

 * #1612 [HUDI-528] Handle empty commit in incremental pulling

 * #1511 [HUDI-789]Adjust logic of upsert in HDFSParquetImporter

 * #1627 [HUDI-889] Writer supports useJdbc configuration when hive 
synchronization is enabled

  was:
The following commits are included in this release.
 * #1372 [HUDI-652] Decouple HoodieReadClient and AbstractHoodieClient to break 
the inheritance chain
 * #1388 [HUDI-681] Remove embeddedTimelineService from HoodieReadClient
 * #1350 [HUDI-629]: Replace Guava's Hashing with an equivalent in 
NumericUtils.java
 * #1505 [HUDI - 738] Add validation to DeltaStreamer to fail fast when 
filterDupes is enabled on UPSERT mode.
 * #1517 [HUDI-799] Use appropriate FS when loading configs
 * #1406 [HUDI-713] Fix conversion of Spark array of struct type to Avro schema
 * #1394 [HUDI-656][Performance] Return a dummy Spark relation after writing 
the DataFrame
 * #1576 [HUDI-850] Avoid unnecessary listings in incremental cleaning mode
 * #1421 [HUDI-724] Parallelize getSmallFiles for partitions
 * #1330 [HUDI-607] Fix to allow creation/syncing of Hive tables partitioned by 
Date type columns
 * #1413 Add constructor to HoodieROTablePathFilter
 * #1415 [HUDI-539] Make ROPathFilter conf member serializable
 * #1578 Add changes for presto mor queries
 * #1506 [HUDI-782] Add support of Aliyun object storage service.
 * #1432 [HUDI-716] Exception: Not an Avro data file when running 
HoodieCleanClient.runClean
 * #1422 [HUDI-400] Check upgrade from old plan to new plan for compaction
 * #1448 [MINOR] Update DOAP with 0.5.2 Release
 * #1466 [HUDI-742] Fix Java Math Exception
 * #1416 [HUDI-717] Fixed usage of HiveDriver for DDL statements.
 * #1427 [HUDI-727]: Copy default values of fields if not present when 
rewriting incoming record with new schema
 * #1515 [HUDI-795] Handle auto-deleted empty aux folder
 * #1547 [MINOR]: Fix cli docs for DeltaStreamer
 * #1580 [HUDI-852] adding check for table name for Append Save mode
 * #1537 [MINOR] fixed building IndexFileFilter with a wrong condition in 
HoodieGlobalBloomIndex class
 * #1434 [HUDI-616] Fixed parquet files getting created on local FS


> Prepare for 0.5.3 patch release
> ---
>
> Key: HUDI-890
> URL: https://issues.apache.org/jira/browse/HUDI-890
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>Reporter: Bhavani Sudha
>Assignee: Bha

[jira] [Updated] (HUDI-889) Writer supports useJdbc configuration when hive synchronization is enabled

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-889:
---
Fix Version/s: 0.5.3
   0.6.0

> Writer supports useJdbc configuration when hive synchronization is enabled
> --
>
> Key: HUDI-889
> URL: https://issues.apache.org/jira/browse/HUDI-889
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: dzcxzl
>Priority: Trivial
> Fix For: 0.6.0, 0.5.3
>
>
> hudi-hive-sync supports the useJdbc = false configuration, but the writer 
> does not provide this configuration at this stage



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-894) Allow ability to use hive metastore thrift connection to register tables

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-894:
---
Fix Version/s: 0.6.0

> Allow ability to use hive metastore thrift connection to register tables
> 
>
> Key: HUDI-894
> URL: https://issues.apache.org/jira/browse/HUDI-894
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> At the moment, we have 2 ways to register the table with HMS 
> 1) Thrift based HMS
> 2) JDBC through hive server
> For secure clusters, the thrift based HMS works out of the box as long as the 
> correct namespace and connection string is provided, for JDBC, that does not 
> work out of the box. For users who want to register in secure clusters, we 
> want to allow ability to toggle between these 2 approaches.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-789) Adjust logic of upsert in HDFSParquetImporter

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-789:
---
Fix Version/s: 0.5.3

> Adjust logic of upsert in HDFSParquetImporter
> -
>
> Key: HUDI-789
> URL: https://issues.apache.org/jira/browse/HUDI-789
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Utilities
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In HDFSParquetImporter, upsert is equivalent to insert (remove old metadata, 
> then insert). But upsert means update and insert on old data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-528) Incremental Pull fails when latest commit is empty

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-528:
---
Fix Version/s: 0.6.0

> Incremental Pull fails when latest commit is empty
> --
>
> Key: HUDI-528
> URL: https://issues.apache.org/jira/browse/HUDI-528
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Incremental Pull
>Reporter: Javier Vega
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0, help-requested, pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> When trying to create an incremental view of a dataset, an exception is 
> thrown when the latest commit in the time range is empty. In order to 
> determine the schema of the dataset, Hudi will grab the [latest commit file, 
> parse it, and grab the first metadata file 
> path|https://github.com/apache/incubator-hudi/blob/480fc7869d4d69e1219bf278fd9a37f27ac260f6/hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala#L78-L80].
>  If the latest commit was empty though, the field which is used to determine 
> file paths (partitionToWriteStats) will be empty causing the following 
> exception:
>  
>  
> {code:java}
> java.util.NoSuchElementException
>   at java.util.HashMap$HashIterator.nextNode(HashMap.java:1447)
>   at java.util.HashMap$ValueIterator.next(HashMap.java:1474)
>   at org.apache.hudi.IncrementalRelation.(IncrementalRelation.scala:80)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:65)
>   at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:46)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-902) Avoid exception for getting SchemaProvider when no new input data

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-902:
---
Fix Version/s: 0.6.0

> Avoid exception for getting SchemaProvider when no new input data
> -
>
> Key: HUDI-902
> URL: https://issues.apache.org/jira/browse/HUDI-902
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-494:
---
Fix Version/s: 0.6.0

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0, 0.5.3
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-863) nested structs containing decimal types lead to null pointer exception

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-863:
---
Fix Version/s: 0.5.3

> nested structs containing decimal types lead to null pointer exception
> --
>
> Key: HUDI-863
> URL: https://issues.apache.org/jira/browse/HUDI-863
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Roland Johann
>Assignee: Roland Johann
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> Currently the avro schema gets passed to 
> AvroConversionHelper.createConverterToAvro which itself pocesses passed spark 
> sql DataTypes recursively to resolve structs, arrays, etc.  - the AvroSchema 
> gets passed to recursions, but without selection of the relevant field and 
> therefore schema of that field. That leads to a null pointer exception when 
> decimal types will  be processed, because in that case the schema of the 
> filed will be retrieved by calling getField on the root schema which is not 
> defined when we deal with nested records.
> [AvroConversionHelper.scala#L291|https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L291]
> The proposed solution is to remove the dependency on the avro schema and 
> derive the particular avro schema for the decimal converter creator case only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-616) Parquet files not getting created on DFS docker instance but on local FS in TestHoodieDeltaStreamer

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-616:
---
Fix Version/s: 0.5.3

> Parquet files not getting created on DFS docker instance but on local FS in 
> TestHoodieDeltaStreamer
> ---
>
> Key: HUDI-616
> URL: https://issues.apache.org/jira/browse/HUDI-616
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer, Testing
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In TestHoodieDeltaStreamer, 
> PARQUET_SOURCE_ROOT gets initialised even before function annotated with 
> @BeforeClass gets called as below - 
> private static final String PARQUET_SOURCE_ROOT = dfsBasePath + 
> "/parquetFiles";
> At this point, dfsBasePath variable is null and as a result, parquet files 
> get created on local FS which need to be cleared manually after testing. This 
> needs to be rectified.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-742) Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-742:
---
Fix Version/s: 0.5.3

> Fix java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> -
>
> Key: HUDI-742
> URL: https://issues.apache.org/jira/browse/HUDI-742
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: lamber-ken
>Assignee: edwinguo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> *ISSUE* : https://github.com/apache/incubator-hudi/issues/1455
> {code:java}
> at org.apache.hudi.client.HoodieWriteClient.upsert(HoodieWriteClient.java:193)
> at org.apache.hudi.DataSourceUtils.doWriteOperation(DataSourceUtils.java:206)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:144)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
> ... 49 elided
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 44 in stage 11.0 failed 4 times, most recent failure: Lost task 44.3 in 
> stage 11.0 (TID 975, ip-10-81-135-85.ec2.internal, executor 6): 
> java.lang.NoSuchMethodError: java.lang.Math.floorMod(JI)I
> at 
> org.apache.hudi.index.bloom.BucketizedBloomCheckPartitioner.getPartition(BucketizedBloomCheckPartitioner.java:148)
> at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:123)
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2041)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2029)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2028)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2028)
> at 
> org.apache.spar

[jira] [Updated] (HUDI-795) HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks when aux is not present

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-795:
---
Fix Version/s: 0.5.3

> HoodieCommitArchiveLog.deleteAllInstantsOlderorEqualsInAuxMetaFolder breaks 
> when aux is not present 
> 
>
> Key: HUDI-795
> URL: https://issues.apache.org/jira/browse/HUDI-795
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Seeing this on GCS. Something removes aux folder and then delta streamer 
> fails on this call.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-852) Add validation to check Table name when Append Mode is used in DataSource writer

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-852:
---
Fix Version/s: 0.5.3

> Add validation to check Table name when Append Mode is used in DataSource 
> writer
> 
>
> Key: HUDI-852
> URL: https://issues.apache.org/jira/browse/HUDI-852
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie, Writer Core
>Reporter: Bhavani Sudha
>Assignee: Aakash Pradeep
>Priority: Minor
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> Copied from user's description in mailing list:
> Table name is not respected while inserting record with different table name 
> with Append mode
>  
> {code:java}
> // While running commands from Hudi quick start guide, I found that the
> library does not check for the table name in the request against the table
> name in the metadata available in the base path, I think it should throw
> TableAlreadyExist, In case of Save mode: *overwrite *it warns.
> *spark-2.4.4-bin-hadoop2.7/bin/spark-shell   --packages
> org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
>  --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'*
> scala> df.write.format("hudi").
>      |     options(getQuickstartWriteConfigs).
>      |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> *     |     option(TABLE_NAME, "test_table").*
>      |     mode(*Append*).
>      |     save(basePath)
> 20/04/29 17:23:42 WARN DefaultSource: Snapshot view not supported yet via
> data source, for MERGE_ON_READ tables. Please query the Hive table
> registered using Spark SQL.
> scala>
> No exception is thrown if we run this
> scala> df.write.format("hudi").
>      |     options(getQuickstartWriteConfigs).
>      |     option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |     option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |     option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
> *     |     option(TABLE_NAME, "foo_table").*
>      |     mode(*Append*).
>      |     save(basePath)
> 20/04/29 17:24:37 WARN DefaultSource: Snapshot view not supported yet via
> data source, for MERGE_ON_READ tables. Please query the Hive table
> registered using Spark SQL.
> scala>
> scala> df.write.format("hudi").
>      |   options(getQuickstartWriteConfigs).
>      |   option(PRECOMBINE_FIELD_OPT_KEY, "ts").
>      |   option(RECORDKEY_FIELD_OPT_KEY, "uuid").
>      |   option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath").
>      |   option(TABLE_NAME, *tableName*).
>      |   mode(*Overwrite*).
>      |   save(basePath)
> *20/04/29 22:25:16 WARN HoodieSparkSqlWriter$: hoodie table at
> file:/tmp/hudi_trips_cow already exists. Deleting existing data &
> overwriting with new data.*
> 20/04/29 22:25:18 WARN DefaultSource: Snapshot view not supported yet via
> data source, for MERGE_ON_READ tables. Please query the Hive table
> registered using Spark SQL.
> scala>
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-717) Fix HudiHiveClient for Hive 2.x

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-717:
---
Fix Version/s: 0.5.3

> Fix HudiHiveClient for Hive 2.x
> ---
>
> Key: HUDI-717
> URL: https://issues.apache.org/jira/browse/HUDI-717
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>
> When using the HiveDriver mode in HudiHiveClient, Hive 2.x DDL operations 
> like ALTER may fail. This is because Hive 2.x doesn't like `db`.`table_name` 
> for operations.
> There are two ways to fix this:
> 1. Precede all DDL statements by "USE ;"
> 2. Set the name of the database in the SessionState create for the Driver.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-727) Copy default values of fields if not present when rewriting incoming record with new schema

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-727:
---
Fix Version/s: 0.5.3

> Copy default values of fields if not present when rewriting incoming record 
> with new schema
> ---
>
> Key: HUDI-727
> URL: https://issues.apache.org/jira/browse/HUDI-727
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Utilities
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently we recommend users to evolve schema in backwards compatible way. 
> When one is trying to evolve schema in backwards compatible way, one of the 
> most significant things to do is to define default value for newly added 
> columns so that records published with previous schema also can be consumed 
> properly. 
>  
> However just before actually writing record to Hudi dataset, we try to 
> rewrite record with new Avro schema which has Hudi metadata columns [1]. In 
> this function, we are only trying to get the values from record without 
> considering field's default value. As a result, schema validation fails. 
> IMO, this piece of code should take into account default value as well in 
> case field's actual value is null. 
>  
> [1] 
> [https://github.com/apache/incubator-hudi/blob/078d4825d909b2c469398f31c97d2290687321a8/hudi-common/src/main/java/org/apache/hudi/common/util/HoodieAvroUtils.java#L205].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-400) Add more checks to TestCompactionUtils#testUpgradeDowngrade

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-400:
---
Fix Version/s: 0.5.3
   0.6.0

> Add more checks to TestCompactionUtils#testUpgradeDowngrade
> ---
>
> Key: HUDI-400
> URL: https://issues.apache.org/jira/browse/HUDI-400
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: newbie, Testing
>Reporter: leesf
>Assignee: jerry
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Currently, the TestCompactionUtils#testUpgradeDowngrade does not check 
> upgrade from old plan to new plan, it is proper to add some checks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-716) Exception: Not an Avro data file when running HoodieCleanClient.runClean

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-716:
---
Fix Version/s: 0.5.3

> Exception: Not an Avro data file when running HoodieCleanClient.runClean
> 
>
> Key: HUDI-716
> URL: https://issues.apache.org/jira/browse/HUDI-716
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
> Attachments: image-2020-03-21-02-45-25-099.png, 
> image-2020-03-21-13-37-17-039.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Just upgraded to upstream master from 0.5 and seeing an issue at the end of 
> the delta sync run: 
> 20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error running delta sync 
> once. Shutting down20/03/17 02:13:49 ERROR HoodieDeltaStreamer: Got error 
> running delta sync once. Shutting 
> downorg.apache.hudi.exception.HoodieIOException: Not an Avro data file at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:144) 
> at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580) 
> at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86) 
> at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843) 
> at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:395)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) 
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at 
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: 
> java.io.IOException: Not an Avro data file at 
> org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50) at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87) 
> at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141) 
> ... 24 more
>  
> It is attempting to read an old cleanup file (2 month old) and crashing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-782) Add support for aliyun OSS

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-782:
---
Fix Version/s: 0.5.3

> Add support for aliyun OSS
> --
>
> Key: HUDI-782
> URL: https://issues.apache.org/jira/browse/HUDI-782
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: leesf
>Assignee: Hong Shen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> aliyun OSS is a wide used Object Storage Service, and many users use OSS as 
> the backend storage system, so we could support the OSS



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-607) Hive sync fails to register tables partitioned by Date Type column

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-607:
---
Fix Version/s: 0.5.3

> Hive sync fails to register tables partitioned by Date Type column
> --
>
> Key: HUDI-607
> URL: https://issues.apache.org/jira/browse/HUDI-607
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> h2. Issue Description
> As part of spark to avro conversion, Spark's *Date* type is represented as 
> corresponding *Date Logical Type* in Avro, which is underneath represented in 
> Avro by physical *Integer* type. For this reason when forming the Avro 
> records from Spark rows, it is converted to corresponding *Epoch day* to be 
> stored as corresponding *Integer* value in the parquet files.
> However, this manifests into a problem that when a *Date Type* column is 
> chosen as partition column. In this case, Hudi's partition column 
> *_hoodie_partition_path* also gets the corresponding *epoch day integer* 
> value when reading the partition field from the avro record, and as a result 
> syncing partitions in hudi table issues a command like the following, where 
> the date is an integer:
> {noformat}
> ALTER TABLE uditme_hudi.uditme_hudi_events_cow_feb05_00 ADD IF NOT EXISTS   
> PARTITION (event_date='17897') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17897'
>PARTITION (event_date='17898') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17898'
>PARTITION (event_date='17899') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17899'
>PARTITION (event_date='17900') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_00/17900'{noformat}
> Hive is not able to make sense of the partition field values like *17897* as 
> it is not able to convert it to corresponding date from this string. It 
> actually expects the actual date to be represented in string form.
> So, we need to make sure that Hudi's partition field gets the actual date 
> value in string form, instead of the integer. This change makes sure that 
> when a fields value is retrieved from the Avro record, we check that if its 
> *Date Logical Type* we return the actual date value, instead of the epoch. 
> After this change the command for sync partitions issues is like:
> {noformat}
> ALTER TABLE `uditme_hudi`.`uditme_hudi_events_cow_feb05_01` ADD IF NOT EXISTS 
>   PARTITION (`event_date`='2019-01-01') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-01'
>PARTITION (`event_date`='2019-01-02') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-02'
>PARTITION (`event_date`='2019-01-03') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-03'
>PARTITION (`event_date`='2019-01-04') LOCATION 
> 's3://emr-users/uditme/hudi/tables/events/uditme_hudi_events_cow_feb05_01/2019-01-04'{noformat}
> h2. Stack Trace
> {noformat}
> 20/01/13 23:28:04 INFO HoodieHiveClient: Last commit time synced is not 
> known, listing all partitions in 
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar,FS
>  :com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem@1f0c8e1f
> 20/01/13 23:28:08 INFO HiveSyncTool: Storage partitions scan complete. Found 
> 31
> 20/01/13 23:28:08 INFO HiveSyncTool: New Partitions [18206, 18207, 18208, 
> 18209, 18210, 18211, 18212, 18213, 18214, 18215, 18216, 18217, 18218, 18219, 
> 18220, 18221, 18222, 18223, 18224, 18225, 18226, 18227, 18228, 18229, 18230, 
> 18231, 18232, 18233, 18234, 18235, 18236]
> 20/01/13 23:28:08 INFO HoodieHiveClient: Adding partitions 31 to table 
> fact_hourly_search_term_conversions_hudi_mor_hudi_jar
> 20/01/13 23:28:08 INFO HoodieHiveClient: Executing SQL ALTER TABLE 
> default.fact_hourly_search_term_conversions_hudi_mor_hudi_jar ADD IF NOT 
> EXISTS   PARTITION (dim_date='18206') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18206'
>PARTITION (dim_date='18207') LOCATION $
> s3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18207'
>PARTITION (dim_date='18208') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_on_read_aws_hudi_jar/18208'
>PARTITION (dim_date='18209') LOCATION 
> 's3://feichi-test/fact_hourly_search_term_conversions/merge_$
> n_read_aws_hudi_jar/18209'   PARTITION (d

[jira] [Updated] (HUDI-539) RO Path filter does not pick up hadoop configs from the spark context

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-539:
---
Fix Version/s: 0.5.3

> RO Path filter does not pick up hadoop configs from the spark context
> -
>
> Key: HUDI-539
> URL: https://issues.apache.org/jira/browse/HUDI-539
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.5.1
> Environment: Spark version : 2.4.4
> Hadoop version : 2.7.3
> Databricks Runtime: 6.1
>Reporter: Sam Somuah
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hi,
>  I'm trying to use hudi to write to one of the Azure storage container file 
> systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file 
> schemes. The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries 
> to get a file path passing in a blank hadoop configuration. This manifests as 
> {{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
> have any of the configuration in the environment.
> The problematic line is
> [https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]
>  
> {code:java}
>  Stacktrace
>  java.io.IOException: No FileSystem for scheme: abfss
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>  at 
> org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
>  at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-850) Avoid unnecessary listings in incremental cleaning mode

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-850:
---
Fix Version/s: 0.5.3

> Avoid unnecessary listings in incremental cleaning mode
> ---
>
> Key: HUDI-850
> URL: https://issues.apache.org/jira/browse/HUDI-850
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Cleaner, Performance
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>
> Came up during https://github.com/apache/incubator-hudi/issues/1552 
> Even with incremental cleaning turned on, we would have a scenario where 
> there are no commits yet to clean, but we end up listing needlessly 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-724) Parallelize GetSmallFiles For Partitions

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-724:
---
Fix Version/s: 0.5.3

> Parallelize GetSmallFiles For Partitions
> 
>
> Key: HUDI-724
> URL: https://issues.apache.org/jira/browse/HUDI-724
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Feichi Feng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
> Attachments: gap.png, nogapAfterImprovement.png
>
>   Original Estimate: 48h
>  Time Spent: 40m
>  Remaining Estimate: 47h 20m
>
> When writing data, a gap was observed between spark stages. By tracking down 
> where the time was spent on the spark driver, it's get-small-files operation 
> for partitions.
> When creating the UpsertPartitioner and trying to assign insert records, it 
> uses a normal for-loop for get the list of small files for all partitions 
> that the load is going to load data to, and the process is very slow when 
> there are a lot of partitions to go through. While the operation is running 
> on spark driver process, all other worker nodes are sitting idle waiting for 
> tasks.
> For all those partitions, they don't affect each other, so the 
> get-small-files operations can be parallelized. The change I made is to pass 
> the JavaSparkContext to the UpsertPartitioner, and create RDD for the 
> partitions and eventually send the get small files operations to multiple 
> tasks.
>  
> screenshot attached for 
> the gap without the improvement
> the spark stage with the improvement (no gap)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-713) Datasource Writer throws error on resolving array of struct fields

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-713:
---
Fix Version/s: 0.5.3

> Datasource Writer throws error on resolving array of struct fields
> --
>
> Key: HUDI-713
> URL: https://issues.apache.org/jira/browse/HUDI-713
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Similar to [https://issues.apache.org/jira/browse/HUDI-530]. With migration 
> of Hudi to spark 2.4.4 and using Spark's native spark-avro module, this issue 
> now exists in Hudi master.
> Reproduce steps:
> Run following script
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.hive.MultiPartKeysValueExtractor
> import org.apache.spark.sql.SaveMode
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.types._
> import spark.implicits._
> val sample = """
> [{
>   "partition": 0,
>   "offset": 5,
>   "timestamp": "1581508884",
>   "value": {
> "prop1": "val1",
> "prop2": [{"withinProp1": "val2", "withinProp2": 1}]
>   }
> }, {
>   "partition": 1,
>   "offset": 10,
>   "timestamp": "1581108884",
>   "value": {
> "prop1": "val4",
> "prop2": [{"withinProp1": "val5", "withinProp2": 2}]
>   }
> }]
> """
> val df = spark.read.option("dropFieldIfAllNull", 
> "true").json(Seq(sample).toDS)
> val dfcol1 = df.withColumn("op_ts", from_unixtime(col("timestamp")))
> val dfcol2 = dfcol1.withColumn("year_partition", 
> year(col("op_ts"))).withColumn("id", concat($"partition", lit("-"), 
> $"offset"))
> val dfcol3 = dfcol2.drop("timestamp")
> val hudiOptions: Map[String, String] =
> Map[String, String](
> DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY -> "test",
> DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL,
> DataSourceWriteOptions.OPERATION_OPT_KEY -> 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL,
> DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "op_ts",
> DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true",
> DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
> classOf[MultiPartKeysValueExtractor].getName,
> "hoodie.parquet.max.file.size" -> String.valueOf(1024 * 1024 * 
> 1024),
> "hoodie.parquet.compression.ratio" -> String.valueOf(0.5),
> "hoodie.insert.shuffle.parallelism" -> String.valueOf(2)
>   )
> dfcol3.write.format("org.apache.hudi")
>   .options(hudiOptions)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "id")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "year_partition")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "year_partition")
>   .option(HoodieWriteConfig.TABLE_NAME, "AWS_TEST")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, "AWS_TEST")
>   .mode(SaveMode.Append).save("s3://xxx/AWS_TEST/")
> {code}
> Will throw not in union exception:
> {code:java}
> Caused by: org.apache.avro.UnresolvedUnionException: Not in union 
> [{"type":"record","name":"prop2","namespace":"hoodie.AWS_TEST.AWS_TEST_record.value","fields":[{"name":"withinProp1","type":["string","null"]},{"name":"withinProp2","type":["long","null"]}]},"null"]:
>  {"withinProp1": "val2", "withinProp2": 1}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-738) Add error msg in DeltaStreamer if `filterDupes=true` is enabled for `operation=UPSERT`.

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-738:
---
Fix Version/s: 0.5.3

> Add error msg in DeltaStreamer if  `filterDupes=true` is enabled for 
> `operation=UPSERT`. 
> -
>
> Key: HUDI-738
> URL: https://issues.apache.org/jira/browse/HUDI-738
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: DeltaStreamer, newbie, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This checks for dedupes with existing records in the table and thus ignores 
> updates. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-799) DeltaStreamer must use appropriate FS when loading configs

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-799:
---
Fix Version/s: 0.5.3

> DeltaStreamer must use appropriate FS when loading configs
> --
>
> Key: HUDI-799
> URL: https://issues.apache.org/jira/browse/HUDI-799
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-681) Remove the dependency of EmbeddedTimelineService from HoodieReadClient

2020-05-17 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-681:
---
Fix Version/s: 0.5.3
   0.6.0

> Remove the dependency of EmbeddedTimelineService from HoodieReadClient
> --
>
> Key: HUDI-681
> URL: https://issues.apache.org/jira/browse/HUDI-681
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.5.3
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> After decoupling {{HoodieReadClient}} and {{AbstractHoodieClient}}, we can 
> remove the {{EmbeddedTimelineService}} from {{HoodieReadClient}} so that we 
> can remove {{HoodieReadClient}} into hudi-spark module.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1639: [MINOR] Fix apache-rat violations

2020-05-17 Thread GitBox


codecov-io edited a comment on pull request #1639:
URL: https://github.com/apache/incubator-hudi/pull/1639#issuecomment-629859704


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=h1) 
Report
   > Merging 
[#1639](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25a0080b2f6ddce0e528b2a72aea33a565f0e565&el=desc)
 will **increase** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1639/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1639  +/-   ##
   
   + Coverage 16.71%   16.72%   +0.01% 
   - Complexity  795  796   +1 
   
 Files   340  340  
 Lines 1503015030  
 Branches   1499 1499  
   
   + Hits   2512 2514   +2 
   + Misses1218812186   -2 
 Partials330  330  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/common/util/ObjectSizeCalculator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1639/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvT2JqZWN0U2l6ZUNhbGN1bGF0b3IuamF2YQ==)
 | `77.61% <ø> (ø)` | `25.00 <0.00> (ø)` | |
   | 
[...apache/hudi/common/fs/HoodieWrapperFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1639/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0hvb2RpZVdyYXBwZXJGaWxlU3lzdGVtLmphdmE=)
 | `22.69% <0.00%> (+0.70%)` | `29.00% <0.00%> (+1.00%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=footer).
 Last update 
[25a0080...b50fc84](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io commented on pull request #1639: [MINOR] Fix apache-rat violations

2020-05-17 Thread GitBox


codecov-io commented on pull request #1639:
URL: https://github.com/apache/incubator-hudi/pull/1639#issuecomment-629859704


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=h1) 
Report
   > Merging 
[#1639](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25a0080b2f6ddce0e528b2a72aea33a565f0e565&el=desc)
 will **increase** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1639/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1639  +/-   ##
   
   + Coverage 16.71%   16.72%   +0.01% 
   - Complexity  795  796   +1 
   
 Files   340  340  
 Lines 1503015030  
 Branches   1499 1499  
   
   + Hits   2512 2514   +2 
   + Misses1218812186   -2 
 Partials330  330  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/common/util/ObjectSizeCalculator.java](https://codecov.io/gh/apache/incubator-hudi/pull/1639/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvT2JqZWN0U2l6ZUNhbGN1bGF0b3IuamF2YQ==)
 | `77.61% <ø> (ø)` | `25.00 <0.00> (ø)` | |
   | 
[...apache/hudi/common/fs/HoodieWrapperFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1639/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0hvb2RpZVdyYXBwZXJGaWxlU3lzdGVtLmphdmE=)
 | `22.69% <0.00%> (+0.70%)` | `29.00% <0.00%> (+1.00%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=footer).
 Last update 
[25a0080...b50fc84](https://codecov.io/gh/apache/incubator-hudi/pull/1639?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] jfrazee opened a new pull request #1639: [MINOR] Fix apache-rat violations

2020-05-17 Thread GitBox


jfrazee opened a new pull request #1639:
URL: https://github.com/apache/incubator-hudi/pull/1639


   This fixes a few apache-rat violations and adds exclusions for the GitHub PR 
template type stuff. Note there is already a general attribution for Twitter in 
the NOTICE so I don't think we need to add another.
   
   ```

hudi-common/src/main/java/org/apache/hudi/common/util/ObjectSizeCalculator.java

hudi-utilities/src/main/java/org/apache/hudi/utilities/exception/HoodieSnapshotExporterException.java

hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestInputBatch.java
   .github/**
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


codecov-io edited a comment on pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#issuecomment-629848790


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=h1) 
Report
   > Merging 
[#1633](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25e0b75b3d03b6d460dc18d1a5fce7b881b0e019&el=desc)
 will **decrease** coverage by `55.07%`.
   > The diff coverage is `41.17%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1633/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1633   +/-   ##
   =
   - Coverage 71.81%   16.73%   -55.08% 
   + Complexity 1092  798  -294 
   =
 Files   386  340   -46 
 Lines 1660815042 -1566 
 Branches   1667 1501  -166 
   =
   - Hits  11927 2518 -9409 
   - Misses 395512192 +8237 
   + Partials726  332  -394 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...che/hudi/table/action/commit/BulkInsertHelper.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CdWxrSW5zZXJ0SGVscGVyLmphdmE=)
 | `0.00% <0.00%> (-85.00%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `42.97% <25.00%> (-41.87%)` | `48.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=)
 | `28.49% <44.44%> (-54.40%)` | `17.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[.../table/action/commit/BaseCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CYXNlQ29tbWl0QWN0aW9uRXhlY3V0b3IuamF2YQ==)
 | `46.01% <100.00%> (-38.81%)` | `14.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [310 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `

[GitHub] [incubator-hudi] codecov-io commented on pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


codecov-io commented on pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#issuecomment-629848790


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=h1) 
Report
   > Merging 
[#1633](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/25e0b75b3d03b6d460dc18d1a5fce7b881b0e019&el=desc)
 will **decrease** coverage by `55.07%`.
   > The diff coverage is `41.17%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1633/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1633   +/-   ##
   =
   - Coverage 71.81%   16.73%   -55.08% 
   + Complexity 1092  798  -294 
   =
 Files   386  340   -46 
 Lines 1660815042 -1566 
 Branches   1667 1501  -166 
   =
   - Hits  11927 2518 -9409 
   - Misses 395512192 +8237 
   + Partials726  332  -394 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...che/hudi/table/action/commit/BulkInsertHelper.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CdWxrSW5zZXJ0SGVscGVyLmphdmE=)
 | `0.00% <0.00%> (-85.00%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `42.97% <25.00%> (-41.87%)` | `48.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[...di/common/table/timeline/HoodieActiveTimeline.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZUFjdGl2ZVRpbWVsaW5lLmphdmE=)
 | `28.49% <44.44%> (-54.40%)` | `17.00 <1.00> (+1.00)` | :arrow_down: |
   | 
[.../table/action/commit/BaseCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9CYXNlQ29tbWl0QWN0aW9uRXhlY3V0b3IuamF2YQ==)
 | `46.01% <100.00%> (-38.81%)` | `14.00 <0.00> (ø)` | |
   | 
[...n/java/org/apache/hudi/io/AppendHandleFactory.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vQXBwZW5kSGFuZGxlRmFjdG9yeS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [310 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1633/diff?src=pr&el=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1633?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = mis

[GitHub] [incubator-hudi] bvaradar commented on pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


bvaradar commented on pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#issuecomment-629844386


   @leesf : Addressed review comments.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


bvaradar commented on a change in pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#discussion_r426293139



##
File path: 
hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientOnCopyOnWriteStorage.java
##
@@ -988,6 +988,44 @@ public void testRollbackAfterConsistencyCheckFailure() 
throws Exception {
 return Pair.of(markerFilePath, result);
   }
 
+  @Test
+  public void testMultiOperationsPerCommit() throws IOException {
+HoodieWriteConfig cfg = getConfigBuilder().withAutoCommit(false)
+.withAllowUnsafeMultiOperationsPerCommit(true)
+.build();
+HoodieWriteClient client = getHoodieWriteClient(cfg);
+String firstInstantTime = "";
+client.startCommitWithTime(firstInstantTime);
+int numRecords = 200;
+JavaRDD writeRecords = 
jsc.parallelize(dataGen.generateInserts(firstInstantTime, numRecords), 1);
+JavaRDD result = client.bulkInsert(writeRecords, 
firstInstantTime);
+assertTrue(client.commit(firstInstantTime, result), "Commit should 
succeed");
+assertTrue(HoodieTestUtils.doesCommitExist(basePath, firstInstantTime),
+"After explicit commit, commit file should be created");
+
+// Check the entire dataset has all records still
+String[] fullPartitionPaths = new 
String[dataGen.getPartitionPaths().length];
+for (int i = 0; i < fullPartitionPaths.length; i++) {
+  fullPartitionPaths[i] = String.format("%s/%s/*", basePath, 
dataGen.getPartitionPaths()[i]);
+}
+assertEquals(numRecords,
+HoodieClientTestUtils.read(jsc, basePath, sqlContext, fs, 
fullPartitionPaths).count(),
+"Must contain " + numRecords + " records");
+
+String nextInstantTime = "0001";
+client.startCommitWithTime(nextInstantTime);
+JavaRDD updateRecords = 
jsc.parallelize(dataGen.generateUpdates(nextInstantTime, numRecords), 1);
+JavaRDD insertRecords = 
jsc.parallelize(dataGen.generateInserts(nextInstantTime, numRecords), 1);
+JavaRDD inserts = client.bulkInsert(insertRecords, 
nextInstantTime);
+JavaRDD upserts = client.upsert(updateRecords, 
nextInstantTime);

Review comment:
   Fixed.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1633: [HUDI-858] Allow multiple operations to be executed within a single commit

2020-05-17 Thread GitBox


bvaradar commented on a change in pull request #1633:
URL: https://github.com/apache/incubator-hudi/pull/1633#discussion_r426290846



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##
@@ -322,7 +327,15 @@ private void transitionState(HoodieInstant fromInstant, 
HoodieInstant toInstant,
 ValidationUtils.checkArgument(metaClient.getFs().exists(new 
Path(metaClient.getMetaPath(),
 fromInstant.getFileName(;
 // Use Write Once to create Target File
-createImmutableFileInPath(new Path(metaClient.getMetaPath(), 
toInstant.getFileName()), data);
+if (allowRedundantTransitions) {
+  createFileInPath(new Path(metaClient.getMetaPath(), 
toInstant.getFileName()), data);
+} else {
+  if (allowRedundantTransitions) {
+createFileInPath(new Path(metaClient.getMetaPath(), 
toInstant.getFileName()), data);

Review comment:
   Thanks for catching it. It was a bad merge.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-110) Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer

2020-05-17 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-110:

Status: In Progress  (was: Open)

> Better defaults for Partition extractor for Spark DataSOurce and DeltaStreamer
> --
>
> Key: HUDI-110
> URL: https://issues.apache.org/jira/browse/HUDI-110
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, Spark Integration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: bug-bash-0.6.0
>
> Currently
> SlashEncodedDayPartitionValueExtractor is the default being used. This is not 
> a common format outside Uber.
>  
> Also, Spark DataSource provides partitionedBy clauses which has not been 
> integrated for Hudi Data Source.  We need to investigate how we can leverage 
> partitionBy clause for partitioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io edited a comment on pull request #1624: [HUDI-706]Add unit test for SavepointsCommand

2020-05-17 Thread GitBox


codecov-io edited a comment on pull request #1624:
URL: https://github.com/apache/incubator-hudi/pull/1624#issuecomment-629604493


   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=h1) 
Report
   > Merging 
[#1624](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/2ada2ef50fc373ed3083d0e7a96e5e644be52bfb&el=desc)
 will **decrease** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1624/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1624  +/-   ##
   
   - Coverage 16.72%   16.71%   -0.02% 
   + Complexity  796  795   -1 
   
 Files   340  340  
 Lines 1503015030  
 Branches   1499 1499  
   
   - Hits   2514 2512   -2 
   - Misses1218612188   +2 
 Partials330  330  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/common/fs/HoodieWrapperFileSystem.java](https://codecov.io/gh/apache/incubator-hudi/pull/1624/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0hvb2RpZVdyYXBwZXJGaWxlU3lzdGVtLmphdmE=)
 | `21.98% <0.00%> (-0.71%)` | `28.00% <0.00%> (-1.00%)` | |
   | 
[...src/main/java/org/apache/hudi/DataSourceUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1624/diff?src=pr&el=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9EYXRhU291cmNlVXRpbHMuamF2YQ==)
 | `32.32% <0.00%> (ø)` | `0.00% <0.00%> (ø%)` | |
   | 
[...in/java/org/apache/hudi/table/WorkloadProfile.java](https://codecov.io/gh/apache/incubator-hudi/pull/1624/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvV29ya2xvYWRQcm9maWxlLmphdmE=)
 | `87.50% <0.00%> (ø)` | `9.00% <0.00%> (ø%)` | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/incubator-hudi/pull/1624/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | `38.88% <0.00%> (ø)` | `4.00% <0.00%> (ø%)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1624/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `52.83% <0.00%> (ø)` | `13.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=footer).
 Last update 
[2ada2ef...91cbc59](https://codecov.io/gh/apache/incubator-hudi/pull/1624?src=pr&el=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] xushiyan commented on pull request #1592: [Hudi-69] Spark Datasource for MOR table

2020-05-17 Thread GitBox


xushiyan commented on pull request #1592:
URL: https://github.com/apache/incubator-hudi/pull/1592#issuecomment-629812714


   @garyli1019 sadly to see this weird NPE persists. It would be helpful to 
have debug mode in travis and ssh into the container and then investigate.
   
   
https://docs.travis-ci.com/user/running-build-in-debug-mode/#enabling-debug-mode
   
   > For public repositories, we have to enable it on a repository basis.
   To enable debug for your public repositories, please email us at 
supp...@travis-ci.com and let us know which repositories you want activated.
   
   @bhasudha @vinothchandar This is difficult to troubleshoot as local test is 
passing..as the docs described, could we try to enable this debugging feature?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1565: [HUDI-73]: implemented vanilla AvroKafkaSource

2020-05-17 Thread GitBox


pratyakshsharma commented on pull request #1565:
URL: https://github.com/apache/incubator-hudi/pull/1565#issuecomment-629811530


   So handling schema evolutions without schema-registry is going to be really 
tricky. I tried googling around this stuff, and found the below 2 links. These 
might be useful in what we want to achieve - 
   
   1. 
https://stackoverflow.com/questions/37290303/producing-and-consuming-avro-messages-from-kafka-without-confluent-components
   2. https://github.com/farmdawgnation/registryless-avro-converter
   
   Particularly the second repository aims at serializing and deserializing 
avro data without schema-registry using Confluent and Avro libraries. At a high 
level, it looks like they are also not handling schema evolution in their code. 
I would need some time to go through it in depth though. 
   Also if you see the description of jira 
(https://issues.apache.org/jira/browse/HUDI-73), it mentions integration of 
AvroKafkaSource with FilebasedSchemaProvider (which is what is done in this PR 
:) ). If we really want to integrate it with FilebasedSchemaProvider, then I do 
not think it is feasible to handle schema evolution, since as a user, one 
cannot keep on changing schema files on every evolution. Thoughts? 
@vinothchandar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] hddong commented on a change in pull request #1624: [HUDI-706]Add unit test for SavepointsCommand

2020-05-17 Thread GitBox


hddong commented on a change in pull request #1624:
URL: https://github.com/apache/incubator-hudi/pull/1624#discussion_r426268423



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##
@@ -281,6 +292,19 @@ private static int rollback(JavaSparkContext jsc, String 
instantTime, String bas
 }
   }
 
+  private static int createSavepoint(JavaSparkContext jsc, String commitTime, 
String user,
+  String comments, String basePath) throws Exception {
+HoodieWriteClient client = createHoodieClient(jsc, basePath);
+try {
+  client.savepoint(commitTime, user, comments);
+  LOG.info(String.format("The commit \"%s\" has been savepointed.", 
commitTime));
+  return 0;
+} catch (HoodieSavepointException se) {
+  LOG.info(String.format("Failed: Could not create savepoint \"%s\".", 
commitTime));

Review comment:
   > I think we could change to log.warn, wdyt?
   
   Yes, warn is better here.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan edited a comment on issue #1625: [SUPPORT] MOR upsert table grows in size when ingesting same records

2020-05-17 Thread GitBox


nsivabalan edited a comment on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932


   @rolandjohann : I couldn't repro the ever growing hudi table. May be I am 
missing something. Can you try my below code and let us know what do you see. 
   @bvaradar : Could you think of any reason why roland is seeing the every 
growing hudi table ?
   
   My initial insert (100k records) took 14Mb in hudi. 
   single batch of update(2k records) disk size if I write in parquet directly 
= 165kb. 
   
   Here are my disk sizes after same batch updates repeatedly.
   
   | Round No | Total disk size (du -s -h basePath)|
   |--|--|
   |1 | 23Mb |
   |2 | 24 Mb |
   | 3| 34 Mb |
   |4 | 35Mb |
   | 5 | 46Mb |
   | 6 | 46Mb |
   | 7 | 43Mb |
   | 8 | 44Mb |
   | 9 | 43Mb |
   | 10 | 44Mb |
   | 11 | 45Mb|
   | 12 | 46Mb |
   
   
   Code to reproduce: 
   
   ```
   // spark-shell
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_mor"
   val basePath = "file:///tmp/hudi_trips_mor"
   val basePathParquet = "file:///tmp/parquet"
   val dataGen = new DataGenerator
   
   val inserts = convertToStringList(dataGen.generateInserts(10))
   val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
   
dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
 "ts").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(STORAGE_TYPE_OPT_KEY, 
"MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   
   val updates = convertToStringList(dataGen.generateUpdates(2000))
   val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
   
   
dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   ```
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-794) Add support for maintaining separate table wise configs in HoodieDeltaStreamer similar to HoodieMultiTableDeltaStreamer

2020-05-17 Thread Pratyaksh Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pratyaksh Sharma closed HUDI-794.
-
Resolution: Fixed

This feature is only over complicating the flow. Closing it. 

> Add support for maintaining separate table wise configs in 
> HoodieDeltaStreamer similar to HoodieMultiTableDeltaStreamer
> ---
>
> Key: HUDI-794
> URL: https://issues.apache.org/jira/browse/HUDI-794
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Basically this requires introduction of --config-folder in 
> HoodieDeltaStreamer.Config class similar to how we have in 
> HoodieMultiTableDeltaStreamer.Config class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on issue #1625: [SUPPORT] MOR upsert table grows in size when ingesting same records

2020-05-17 Thread GitBox


nsivabalan commented on issue #1625:
URL: https://github.com/apache/incubator-hudi/issues/1625#issuecomment-629790932


   @bvaradar : Tried to reproduce locally and couldn't. Are there chance of 
some data skewness?
   @rolandjohann : I couldn't repro the ever growing hudi table. May be I am 
missing something. Can you try my below code and let us know what do you see. 
   
   My initial insert (100k records) took 14Mb in hudi. 
   single batch of update(2k records) disk size if I write in parquet directly 
= 165kb. 
   
   Here are my disk sizes after same batch updates repeatedly.
   
   | Round No | Total disk size (du -s -h basePath)|
   |--|--|
   |1 | 23Mb |
   |2 | 24 Mb |
   | 3| 34 Mb |
   |4 | 35Mb |
   | 5 | 46Mb |
   | 6 | 46Mb |
   | 7 | 43Mb |
   | 8 | 44Mb |
   | 9 | 43Mb |
   | 10 | 44Mb |
   | 11 | 45Mb|
   | 12 | 46Mb |
   
   
   Code to reproduce: 
   
   ```
   // spark-shell
   spark-2.4.4-bin-hadoop2.7/bin/spark-shell \
 --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
 \
 --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
   ```
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_trips_mor"
   val basePath = "file:///tmp/hudi_trips_mor"
   val basePathParquet = "file:///tmp/parquet"
   val dataGen = new DataGenerator
   
   val inserts = convertToStringList(dataGen.generateInserts(10))
   val dfInsert = spark.read.json(spark.sparkContext.parallelize(inserts, 10))
   
dfInsert.write.format("hudi").options(getQuickstartWriteConfigs).option(PRECOMBINE_FIELD_OPT_KEY,
 "ts").option(RECORDKEY_FIELD_OPT_KEY, 
"uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(STORAGE_TYPE_OPT_KEY, 
"MERGE_ON_READ").option(TABLE_NAME, tableName).mode(Append).save(basePath)
   
   val updates = convertToStringList(dataGen.generateUpdates(2000))
   val dfUpdates = spark.read.json(spark.sparkContext.parallelize(updates, 2))
   
   
dfUpdates.coalesce(1).write.format("parquet").mode(Append).save(basePathParquet)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   
   
dfUpdates.coalesce(1).write.format("org.apache.hudi").option("hoodie.insert.shuffle.parallelism",
 "2").option("hoodie.upsert.shuffle.parallelism", 
"2").option("hoodie.cleaner.commits.retained", 
"3").option("hoodie.cleaner.fileversions.retained", 
"2").option("hoodie.compact.inline", 
"true").option("hoodie.compact.inline.max.delta.commits", 
"2").option(OPERATION_OPT_KEY, 
UPSERT_OPERATION_OPT_VAL).option(TABLE_TYPE_OPT_KEY, MOR_TABLE_TYPE_OPT_VAL) 
.option(RECORDKEY_FIELD_OPT_KEY, "uuid").option(PARTITIONPATH_FIELD_OPT_KEY, 
"partitionpath").option(PRECOMBINE_FIELD_OPT_KEY, "ts").option(TABLE_NAME, 
tableName).mode(Append).save(basePath)
   ```
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-859) Improve documentation around key generators

2020-05-17 Thread Pratyaksh Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17109452#comment-17109452
 ] 

Pratyaksh Sharma commented on HUDI-859:
---

[~hongdongdong] Do you want to work on this, or I should get it to closure? If 
you want, let us connect on slack, that is a good platform for doing one to one 
discussions. :) 

> Improve documentation around key generators
> ---
>
> Key: HUDI-859
> URL: https://issues.apache.org/jira/browse/HUDI-859
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Pratyaksh Sharma
>Assignee: hong dongdong
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Proper documentation is required to help users understand what all key 
> generators are currently supported, how to use them etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1597: [WIP] Added a MultiFormatTimestampBasedKeyGenerator that allows for multipl…

2020-05-17 Thread GitBox


pratyakshsharma commented on pull request #1597:
URL: https://github.com/apache/incubator-hudi/pull/1597#issuecomment-629788395


   Have included the changes from this PR into 
https://github.com/apache/incubator-hudi/pull/1433. 
   I guess we can close this now @bhasudha @vinothchandar 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] pratyakshsharma commented on pull request #1433: [HUDI-728]: Implement custom key generator

2020-05-17 Thread GitBox


pratyakshsharma commented on pull request #1433:
URL: https://github.com/apache/incubator-hudi/pull/1433#issuecomment-629787509


   @nsivabalan I have tried to include the changes from 
https://github.com/apache/incubator-hudi/pull/1597 as well in this. Please take 
a pass.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-05-17 Thread GitBox


nsivabalan commented on a change in pull request #1469:
URL: https://github.com/apache/incubator-hudi/pull/1469#discussion_r426250288



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndexV2.java
##
@@ -0,0 +1,223 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieGlobalBloomIndexV2 extends 
HoodieBloomIndexV2 {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieGlobalBloomIndexV2.class);
+
+  public HoodieGlobalBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeAndBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)
+.map(Option::get);
+  }
+
+  /**
+   * This is not global, since we depend on the partitionPath to do the lookup.
+   */
+  @Override
+  public boolean isGlobal() {
+return true;
+  }
+
+  /**
+   * Given an iterator of hoodie records, returns a pair of candidate 
HoodieRecord, FileID pairs,
+   * by filtering for ranges and bloom for all records with all fileIds.
+   */
+  class LazyRangeAndBloomChecker extends
+  LazyIterableIterator, List, 
String>>> {
+
+private HoodieTable table;
+private List> partitionPathFileIDList;
+private IndexFileFilter indexFileFilter;
+private ExternalSpillableMap fileIDToBloomFilter;
+private HoodieTimer hoodieTimer;
+private long totalTimeMs;
+private long totalCount;
+private long totalMetadataReadTimeMs;
+private long totalRangeCheckTimeMs;
+private long totalBloomCheckTimeMs;
+private long totalMatches;
+
+public LazyRangeAndBloomChecker(Iterator> in, final 
HoodieTable table) {
+  super(in);
+  this.table = table;
+}
+
+@Override
+protected List, String>> computeNext() {
+
+  List, String>> candidates = new ArrayList<>();
+  if (!inputItr.hasNext()) {
+return candidates;
+  }
+
+  HoodieRecord record = inputItr.next();
+
+  // 
+  hoodieTimer.startTimer();
+  Set> matchingFiles = indexFileFilter

Review comment:
   bottom line, we need to consider all partitio

[GitHub] [incubator-hudi] bhasudha commented on pull request #1592: [Hudi-69] Spark Datasource for MOR table

2020-05-17 Thread GitBox


bhasudha commented on pull request #1592:
URL: https://github.com/apache/incubator-hudi/pull/1592#issuecomment-629781283


   @garyli1019  taking a look at the PR. will get back soon. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-714) Add javadoc, comments to hudi write method link

2020-05-17 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf resolved HUDI-714.

Fix Version/s: 0.6.0
   Resolution: Fixed

Fixed via master: 25a0080b2f6ddce0e528b2a72aea33a565f0e565

> Add javadoc, comments to hudi write method link
> ---
>
> Key: HUDI-714
> URL: https://issues.apache.org/jira/browse/HUDI-714
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 72h
>  Time Spent: 10m
>  Remaining Estimate: 71h 50m
>
> Add some java doc and comments to hudi write method link, to help understand 
> code logic
> 1. Add javadoc;
> 2. Add some comments in methods;
> 3. code clean



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-714) Add javadoc, comments to hudi write method link

2020-05-17 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-714.
--

> Add javadoc, comments to hudi write method link
> ---
>
> Key: HUDI-714
> URL: https://issues.apache.org/jira/browse/HUDI-714
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 72h
>  Time Spent: 10m
>  Remaining Estimate: 71h 50m
>
> Add some java doc and comments to hudi write method link, to help understand 
> code logic
> 1. Add javadoc;
> 2. Add some comments in methods;
> 3. code clean



--
This message was sent by Atlassian Jira
(v8.3.4#803005)