[GitHub] [hudi] bobgalvao opened a new issue #1723: [SUPPORT] - trouble using Apache Hudi with S3.

2020-06-09 Thread GitBox


bobgalvao opened a new issue #1723:
URL: https://github.com/apache/hudi/issues/1723


   Hi,
   
   I'm having a trouble using Apache Hudi with S3.
   
   **Steps to reproduce the behavior:**
   
   1. Produce messages to topic Kafka. (2000 records per window on average)
   2. Start streaming (sample code below).
   3. Intermittently errors start to occur
   4. It is necessary to leave the streaming consuming the message of Kafka for 
the error to occur. There is no standard.
   
   **Environment Description:**
   
   AWS EMR: emr-5.29.0
   Hudi version : 0.5.0-inc
   Spark version : 2.4.4
   Hive version : 2.3.6
   Hadoop version : 2.8.5
   Storage  : S3
   Running on Docker? : No
   
   The errors occur intermittently, making subsequent writing impossible for 
the error (error - 1) “Unrecognized token 'Objavro' ..” and for the error 
(error - 2 / 3) “Could not find any data file written for commit…” / "Failed to 
read schema from data...". In this last case, it normalize it in the next 
execution, but the streaming or batch processing ends with an error.
   
   Due to the problem in the use of S3, I started using HDFS with the same 
code, where I had no problems with inconsistencies caused by S3.
   
   I have already enabled EMRFS, but the same errors occur. I also enabled 
“hoodie.consistency.check.enabled” as recommended when using S3 storage. It 
seems to me to be related to the consistency of the S3.
   
   I often get the errors below:
   
   **Error – 1 (when this error occurs it is no longer possible to use the Hudi 
dataset.):**
   
   20/05/21 17:49:36 ERROR JobScheduler: Error running job streaming job 
159008334 ms.0
   org.apache.hudi.hive.HoodieHiveSyncException: Failed to get dataset schema 
for AWS_CASE
 at 
org.apache.hudi.hive.HoodieHiveClient.getDataSchema(HoodieHiveClient.java:414)
 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:93)
 at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:71)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.syncHive(HoodieSparkSqlWriter.scala:236)
 at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:169)
 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
 at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:83)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:84)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:165)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
 at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
 at br.com.agi.bigdata.awscase.Main$.processRDD(Main.scala:91)
 at br.com.agi.bigdata.awscase.Main$$anonfun$main$1.apply(Main.scala:117)
 at br.com.agi.bigdata.awscase.Main$$anonfun$main$1.apply(Main.scala:114)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
 at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:628)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
 at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
 at 

[GitHub] [hudi] garyli1019 commented on a change in pull request #1719: [HUDI-1006]deltastreamer use kafkaSource with offset reset strategy:latest can't consume data

2020-06-09 Thread GitBox


garyli1019 commented on a change in pull request #1719:
URL: https://github.com/apache/hudi/pull/1719#discussion_r437841744



##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroKafkaSource.java
##
@@ -57,10 +57,10 @@ public AvroKafkaSource(TypedProperties props, 
JavaSparkContext sparkContext, Spa
   protected InputBatch> fetchNewData(Option 
lastCheckpointStr, long sourceLimit) {
 OffsetRange[] offsetRanges = 
offsetGen.getNextOffsetRanges(lastCheckpointStr, sourceLimit);
 long totalNewMsgs = CheckpointUtils.totalNewMessages(offsetRanges);
+LOG.info("About to read " + totalNewMsgs + " from Kafka for topic :" + 
offsetGen.getTopicName());
 if (totalNewMsgs <= 0) {
-  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
lastCheckpointStr.get() : "");

Review comment:
   hmmm interesting... so right now if we use `LATEST` as reset key, then 
we will fall into a dead loop unless we are lucky enough to have message fall 
in between two `consumer.endOffsets(topicPartitions)` call.

##
File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroKafkaSource.java
##
@@ -57,10 +57,10 @@ public AvroKafkaSource(TypedProperties props, 
JavaSparkContext sparkContext, Spa
   protected InputBatch> fetchNewData(Option 
lastCheckpointStr, long sourceLimit) {
 OffsetRange[] offsetRanges = 
offsetGen.getNextOffsetRanges(lastCheckpointStr, sourceLimit);
 long totalNewMsgs = CheckpointUtils.totalNewMessages(offsetRanges);
+LOG.info("About to read " + totalNewMsgs + " from Kafka for topic :" + 
offsetGen.getTopicName());
 if (totalNewMsgs <= 0) {
-  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
lastCheckpointStr.get() : "");
-} else {
-  LOG.info("About to read " + totalNewMsgs + " from Kafka for topic :" + 
offsetGen.getTopicName());
+  return new InputBatch<>(Option.empty(),
+  lastCheckpointStr.isPresent() ? lastCheckpointStr.get() : 
CheckpointUtils.offsetsToStr(offsetRanges));

Review comment:
   could `lastCheckpointStr` be `""` here? 
   Also, can we add a test for this case?
   
https://github.com/apache/hudi/blob/master/hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestKafkaSource.java#L107





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1018:
-
Component/s: DeltaStreamer

> Handle empty checkpoint better in delta streamer
> 
>
> Key: HUDI-1018
> URL: https://issues.apache.org/jira/browse/HUDI-1018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Right now we are handling empty string checkpoint in 
> DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
>  and different 
> sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
>  
> We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1018:
-
Status: Open  (was: New)

> Handle empty checkpoint better in delta streamer
> 
>
> Key: HUDI-1018
> URL: https://issues.apache.org/jira/browse/HUDI-1018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Right now we are handling empty string checkpoint in 
> DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
>  and different 
> sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
>  
> We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1018) Handle empty checkpoint better in delta streamer

2020-06-09 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1018:


 Summary: Handle empty checkpoint better in delta streamer
 Key: HUDI-1018
 URL: https://issues.apache.org/jira/browse/HUDI-1018
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li


Right now we are handling empty string checkpoint in 
DeltaSync([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L260])
 and different 
sources([https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java#L58]).
 

We should clean this up and put all logics into one place.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-09 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1010:
-

Assignee: Nishith Agarwal

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Yanjia Gary Li
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: help-wanted
> Fix For: 0.6.0
>
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-994) Identify functional tests that are convertible to unit tests with mocks

2020-06-09 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-994:


Assignee: Prashant Wason

> Identify functional tests that are convertible to unit tests with mocks
> ---
>
> Key: HUDI-994
> URL: https://issues.apache.org/jira/browse/HUDI-994
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Prashant Wason
>Priority: Major
>
> * Identify convertible functional tests and re-implement by using mock
>  * remove/merge duplicate/overlapping functional tests if possible



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-999) Parallelize listing of Source dataset partitions

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-999:

Fix Version/s: 0.6.0

> Parallelize listing of Source dataset partitions 
> -
>
> Key: HUDI-999
> URL: https://issues.apache.org/jira/browse/HUDI-999
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Currently, we are using single thread in driver to list all partitions in 
> Source dataset. This is a bottleneck when doing metadata bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-807) Spark DS Support for incremental queries for bootstrapped tables

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-807:

Fix Version/s: 0.6.0

> Spark DS Support for incremental queries for bootstrapped tables
> 
>
> Key: HUDI-807
> URL: https://issues.apache.org/jira/browse/HUDI-807
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 120h
>  Remaining Estimate: 0h
>
> Investigate and figure out the changes required in Spark integration code to 
> make incremental queries work seamlessly for bootstrapped tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-954) Test COW : Presto Read Optimized Query with metadata bootstrap

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-954:

Fix Version/s: 0.6.0

> Test COW : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-954
> URL: https://issues.apache.org/jira/browse/HUDI-954
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-806) Implement support for bootstrapping via Spark datasource API

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-806:

Fix Version/s: 0.6.0

> Implement support for bootstrapping via Spark datasource API
> 
>
> Key: HUDI-806
> URL: https://issues.apache.org/jira/browse/HUDI-806
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> This Jira tracks the work required to perform bootstrapping through Spark 
> data source API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-956) Test COW : Presto Realtime Query with metadata bootstrap

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-956:

Fix Version/s: 0.6.0

> Test COW : Presto Realtime Query with metadata bootstrap
> 
>
> Key: HUDI-956
> URL: https://issues.apache.org/jira/browse/HUDI-956
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-781) Re-design test utilities

2020-06-09 Thread Nishith Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130049#comment-17130049
 ] 

Nishith Agarwal commented on HUDI-781:
--

[~pwason] Can you help with #2 ? Like we talked about, mocks can be helpful to 
reduce the build time especially for client tests. 

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-955) Test MOR : Presto Read Optimized Query with metadata bootstrap

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-955:

Fix Version/s: 0.6.0

> Test MOR : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-955
> URL: https://issues.apache.org/jira/browse/HUDI-955
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-619) Investigate and implement mechanism to have hive/presto/sparksql queries avoid stitching and return null values for hoodie columns

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-619:

Fix Version/s: 0.6.0

> Investigate and implement mechanism to have hive/presto/sparksql queries 
> avoid stitching and return null values for hoodie columns 
> ---
>
> Key: HUDI-619
> URL: https://issues.apache.org/jira/browse/HUDI-619
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration, Presto Integration, Spark Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> This idea is suggested by Vinoth during RFC review. This ticket is to track 
> the feasibility and implementation of it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-971:

Fix Version/s: 0.6.0

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-992:

Fix Version/s: 0.6.0

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-806) Implement support for bootstrapping via Spark datasource API

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-806:

Priority: Blocker  (was: Major)

> Implement support for bootstrapping via Spark datasource API
> 
>
> Key: HUDI-806
> URL: https://issues.apache.org/jira/browse/HUDI-806
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Time Spent: 336h
>  Remaining Estimate: 0h
>
> This Jira tracks the work required to perform bootstrapping through Spark 
> data source API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-999) Parallelize listing of Source dataset partitions

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-999:

Priority: Blocker  (was: Major)

> Parallelize listing of Source dataset partitions 
> -
>
> Key: HUDI-999
> URL: https://issues.apache.org/jira/browse/HUDI-999
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>
> Currently, we are using single thread in driver to list all partitions in 
> Source dataset. This is a bottleneck when doing metadata bootstrap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-807) Spark DS Support for incremental queries for bootstrapped tables

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-807:

Priority: Blocker  (was: Major)

> Spark DS Support for incremental queries for bootstrapped tables
> 
>
> Key: HUDI-807
> URL: https://issues.apache.org/jira/browse/HUDI-807
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
>  Time Spent: 120h
>  Remaining Estimate: 0h
>
> Investigate and figure out the changes required in Spark integration code to 
> make incremental queries work seamlessly for bootstrapped tables.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-971:

Priority: Blocker  (was: Major)

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Blocker
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-992) For hive-style partitioned source data, partition columns synced with Hive will always have String type

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-992:

Priority: Blocker  (was: Major)

> For hive-style partitioned source data, partition columns synced with Hive 
> will always have String type
> ---
>
> Key: HUDI-992
> URL: https://issues.apache.org/jira/browse/HUDI-992
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Priority: Blocker
>
> Currently bootstrap implementation is not able to handle partition columns 
> correctly when the source data has *hive-style partitioning*, as is also 
> mentioned in https://jira.apache.org/jira/browse/HUDI-915
> The schema inferred while performing bootstrap and stored in the commit 
> metadata does not have partition column schema(in case of hive partitioned 
> data). As a result during hive-sync when hudi tries to determine the type of 
> partition column from that schema, it would not find it and assume the 
> default data type *string*.
> Here is where partition column schema is determined for hive-sync:
> [https://github.com/apache/hudi/blob/master/hudi-hive-sync/src/main/java/org/apache/hudi/hive/util/HiveSchemaUtil.java#L417]
>  
> Thus no matter what the data type of partition column is in the source data 
> (atleast what spark infers it as from the path), it will always be synced as 
> string.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-956) Test COW : Presto Realtime Query with metadata bootstrap

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-956:

Priority: Blocker  (was: Major)

> Test COW : Presto Realtime Query with metadata bootstrap
> 
>
> Key: HUDI-956
> URL: https://issues.apache.org/jira/browse/HUDI-956
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-954) Test COW : Presto Read Optimized Query with metadata bootstrap

2020-06-09 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-954:

Priority: Blocker  (was: Major)

> Test COW : Presto Read Optimized Query with metadata bootstrap
> --
>
> Key: HUDI-954
> URL: https://issues.apache.org/jira/browse/HUDI-954
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Balaji Varadarajan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-06-09 Thread GitBox


vinothchandar commented on pull request #1722:
URL: https://github.com/apache/hudi/pull/1722#issuecomment-641708414


   @umehrot2 take a look as well? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1687: [WIP] [HUDI-684] Introduced abstraction for writing and reading different types of base file formats.

2020-06-09 Thread GitBox


vinothchandar commented on a change in pull request #1687:
URL: https://github.com/apache/hudi/pull/1687#discussion_r437846366



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieStorageWriterFactory.java
##
@@ -66,4 +67,21 @@
 
 return new HoodieParquetWriter<>(instantTime, path, parquetConfig, schema, 
sparkTaskContextSupplier);
   }
+
+  private static  
HoodieStorageWriter newHFileStorageWriter(
+  String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
HoodieTable hoodieTable,
+  SparkTaskContextSupplier sparkTaskContextSupplier) throws IOException {
+
+BloomFilter filter = createBloomFilter(config);

Review comment:
   okay. I was thinking about dynamic sizing as well . 2. 
   
   So IIUC, HFile does natively support serializing a bloom filter with the 
file? and thats why we take advantage of?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1687: [WIP] [HUDI-684] Introduced abstraction for writing and reading different types of base file formats.

2020-06-09 Thread GitBox


vinothchandar commented on a change in pull request #1687:
URL: https://github.com/apache/hudi/pull/1687#discussion_r437846366



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/storage/HoodieStorageWriterFactory.java
##
@@ -66,4 +67,21 @@
 
 return new HoodieParquetWriter<>(instantTime, path, parquetConfig, schema, 
sparkTaskContextSupplier);
   }
+
+  private static  
HoodieStorageWriter newHFileStorageWriter(
+  String instantTime, Path path, HoodieWriteConfig config, Schema schema, 
HoodieTable hoodieTable,
+  SparkTaskContextSupplier sparkTaskContextSupplier) throws IOException {
+
+BloomFilter filter = createBloomFilter(config);

Review comment:
   okay. I was thinking about dynamic sizing as well 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #304

2020-06-09 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.42 KB...]
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[jira] [Resolved] (HUDI-1005) NPE in HoodieWriteClient.clean

2020-06-09 Thread Hong Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen resolved HUDI-1005.
-
Resolution: Fixed

> NPE in HoodieWriteClient.clean 
> ---
>
> Key: HUDI-1005
> URL: https://issues.apache.org/jira/browse/HUDI-1005
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Hong Shen
>Assignee: Hong Shen
>Priority: Major
>  Labels: pull-request-available
>
> HoodieWriteClient.clean will throw NullPointerException, here is the error 
> message.
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:495)
>   at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:344)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:123)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:94)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:399)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:232)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:422)
>   ... 5 more
> {code}
> When metrics is on HoodieCleanMetadata metadata is null, it will throw NPE.
> {code}
>   public HoodieCleanMetadata clean(String cleanInstantTime) throws 
> HoodieIOException {
> LOG.info("Cleaner started");
> final Timer.Context context = metrics.getCleanCtx();
> HoodieCleanMetadata metadata = HoodieTable.create(config, 
> hadoopConf).clean(jsc, cleanInstantTime);
> if (context != null) {
>   long durationMs = metrics.getDurationInMs(context.stop());
>   metrics.updateCleanMetrics(durationMs, metadata.getTotalFilesDeleted());
>   LOG.info("Cleaned " + metadata.getTotalFilesDeleted() + " files"
>   + " Earliest Retained Instant :" + 
> metadata.getEarliestCommitToRetain()
>   + " cleanerElaspsedMs" + durationMs);
> }
> return metadata;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1005) NPE in HoodieWriteClient.clean

2020-06-09 Thread Hong Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated HUDI-1005:

Status: Open  (was: New)

> NPE in HoodieWriteClient.clean 
> ---
>
> Key: HUDI-1005
> URL: https://issues.apache.org/jira/browse/HUDI-1005
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Hong Shen
>Assignee: Hong Shen
>Priority: Major
>  Labels: pull-request-available
>
> HoodieWriteClient.clean will throw NullPointerException, here is the error 
> message.
> {code}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:495)
>   at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:344)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:123)
>   at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:94)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:399)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:232)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:422)
>   ... 5 more
> {code}
> When metrics is on HoodieCleanMetadata metadata is null, it will throw NPE.
> {code}
>   public HoodieCleanMetadata clean(String cleanInstantTime) throws 
> HoodieIOException {
> LOG.info("Cleaner started");
> final Timer.Context context = metrics.getCleanCtx();
> HoodieCleanMetadata metadata = HoodieTable.create(config, 
> hadoopConf).clean(jsc, cleanInstantTime);
> if (context != null) {
>   long durationMs = metrics.getDurationInMs(context.stop());
>   metrics.updateCleanMetrics(durationMs, metadata.getTotalFilesDeleted());
>   LOG.info("Cleaned " + metadata.getTotalFilesDeleted() + " files"
>   + " Earliest Retained Instant :" + 
> metadata.getEarliestCommitToRetain()
>   + " cleanerElaspsedMs" + durationMs);
> }
> return metadata;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] shenh062326 commented on pull request #1714: [HUDI-1005] fix NPE in HoodieWriteClient.clean

2020-06-09 Thread GitBox


shenh062326 commented on pull request #1714:
URL: https://github.com/apache/hudi/pull/1714#issuecomment-641677856


   > I was wondering if there was a way to just throw an exception or make it 
an Option.. merged.. let's punt on this for now
   
   When I try to run HoodieDeltaStreamer with metrics on, it's no clean action 
in hoodie meta, then  HoodieTable.create(config, hadoopConf).clean(jsc, 
cleanInstantTime) will return null.
   ```
   hongdeMacBook-Pro hongshen$ ll .hoodie/
   total 56
   drwxr-xr-x  13 hongshen  wheel   416 Jun 10 10:11 ./
   drwxr-xr-x   4 hongshen  wheel   128 Jun 10 10:11 ../
   -rw-r--r--   1 hongshen  wheel32 Jun 10 10:11 .20200610101112.commit.crc
   -rw-r--r--   1 hongshen  wheel 8 Jun 10 10:11 
.20200610101112.commit.requested.crc
   -rw-r--r--   1 hongshen  wheel12 Jun 10 10:11 
.20200610101112.inflight.crc
   drwxr-xr-x   2 hongshen  wheel64 Jun 10 10:11 .aux/
   -rw-r--r--   1 hongshen  wheel12 Jun 10 10:11 .hoodie.properties.crc
   drwxr-xr-x   2 hongshen  wheel64 Jun 10 10:11 .temp/
   -rw-r--r--   1 hongshen  wheel  2676 Jun 10 10:11 20200610101112.commit
   -rw-r--r--   1 hongshen  wheel 0 Jun 10 10:11 
20200610101112.commit.requested
   -rw-r--r--   1 hongshen  wheel   380 Jun 10 10:11 20200610101112.inflight
   drwxr-xr-x   2 hongshen  wheel64 Jun 10 10:11 archived/
   -rw-r--r--   1 hongshen  wheel   205 Jun 10 10:11 hoodie.properties
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1690: [HUDI-908] Add decimals to HoodieTestDataGenerator

2020-06-09 Thread GitBox


shenh062326 commented on pull request #1690:
URL: https://github.com/apache/hudi/pull/1690#issuecomment-641671278


   > @shenh062326 : It makes sense to cover other data-types in a single PR. 
Can you also add them to this PR. Also, Can you let us know what the missing 
data types are ?
   
   The missing spark types:
   ```
   BinaryType
   IntegerType
   LongType
   FloatType
   ByteType
   ShortType
   DecimalType
   TimestampType
   DateType
   ```
   The missing the avro types:
   ```
ENUM,
FIXED,
BYTES,
INT,
LONG,
FLOAT,
NULL
   ```
   But I am not sure if enum and null types are needed.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1016) [Minor] Code optimization

2020-06-09 Thread Hong Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen updated HUDI-1016:

Status: Open  (was: New)

> [Minor] Code optimization
> -
>
> Key: HUDI-1016
> URL: https://issues.apache.org/jira/browse/HUDI-1016
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hong Shen
>Assignee: Hong Shen
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2020-06-09-13-04-15-008.png
>
>
> Some code can be optimized.
>  !image-2020-06-09-13-04-15-008.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1016) [Minor] Code optimization

2020-06-09 Thread Hong Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen resolved HUDI-1016.
-
Resolution: Fixed

> [Minor] Code optimization
> -
>
> Key: HUDI-1016
> URL: https://issues.apache.org/jira/browse/HUDI-1016
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hong Shen
>Assignee: Hong Shen
>Priority: Minor
>  Labels: pull-request-available
> Attachments: image-2020-06-09-13-04-15-008.png
>
>
> Some code can be optimized.
>  !image-2020-06-09-13-04-15-008.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-commenter edited a comment on pull request #1721: Cache the explodeRecordRDDWithFileComparisons instead of commuting it…

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1721:
URL: https://github.com/apache/hudi/pull/1721#issuecomment-641622744


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=h1) Report
   > Merging 
[#1721](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/37838cea6094ddc66191df42e8b2c84f132d1623=desc)
 will **decrease** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1721/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1721  +/-   ##
   
   - Coverage 18.16%   18.15%   -0.01% 
 Complexity  860  860  
   
 Files   352  352  
 Lines 1541115410   -1 
 Branches   1525 1524   -1 
   
   - Hits   2799 2798   -1 
 Misses1225412254  
 Partials358  358  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/hudi/pull/1721/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `58.16% <100.00%> (-0.43%)` | `16.00 <0.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=footer). Last 
update 
[37838ce...ef08b1e](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #1721: Cache the explodeRecordRDDWithFileComparisons instead of commuting it…

2020-06-09 Thread GitBox


codecov-commenter commented on pull request #1721:
URL: https://github.com/apache/hudi/pull/1721#issuecomment-641622744


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=h1) Report
   > Merging 
[#1721](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/37838cea6094ddc66191df42e8b2c84f132d1623=desc)
 will **decrease** coverage by `0.00%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1721/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1721  +/-   ##
   
   - Coverage 18.16%   18.15%   -0.01% 
 Complexity  860  860  
   
 Files   352  352  
 Lines 1541115410   -1 
 Branches   1525 1524   -1 
   
   - Hits   2799 2798   -1 
 Misses1225412254  
 Partials358  358  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/index/bloom/HoodieBloomIndex.java](https://codecov.io/gh/apache/hudi/pull/1721/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vSG9vZGllQmxvb21JbmRleC5qYXZh)
 | `58.16% <100.00%> (-0.43%)` | `16.00 <0.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=footer). Last 
update 
[37838ce...ef08b1e](https://codecov.io/gh/apache/hudi/pull/1721?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sathyaprakashg commented on pull request #1664: HUDI-942 Increase default value number of delta commits for inline compaction

2020-06-09 Thread GitBox


sathyaprakashg commented on pull request #1664:
URL: https://github.com/apache/hudi/pull/1664#issuecomment-641619354


   Thanks @vinothchandar. @bhasudha Please refer here for the issue i am facing 
https://www.mail-archive.com/dev@hudi.apache.org/msg02967.html
   
   Please suggest on how to fix this. Thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1017) Integration test failure

2020-06-09 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1017:
-

 Summary: Integration test failure
 Key: HUDI-1017
 URL: https://issues.apache.org/jira/browse/HUDI-1017
 Project: Apache Hudi
  Issue Type: Bug
  Components: Testing
Affects Versions: 0.6.0
Reporter: sivabalan narayanan


Integration tests are failing intermittantly in my local machine. Happened 
twice in my 6 to 8 runs. 

 

3100295 [main] INFO  org.apache.hudi.integ.ITTestBase  - 

 

#

3100295 [main] INFO  org.apache.hudi.integ.ITTestBase  - Container : /adhoc-1, 
Running command :/var/hoodie/ws/hudi-cli/hudi-cli.sh --cmdfile 
/var/hoodie/ws/docker/demo/compaction.commands

3100295 [main] INFO  org.apache.hudi.integ.ITTestBase  - 

#

3104695 [dockerjava-jaxrs-async-79] INFO  org.apache.hudi.integ.ITTestBase  - 
onComplete called

3104700 [main] INFO  org.apache.hudi.integ.ITTestBase  - Exit code for command 
: 1

3104700 [main] ERROR org.apache.hudi.integ.ITTestBase  - 

 

 ## Stdout ###

Client jar location not set, please set it in conf/hudi-env.sh

 

3104700 [main] ERROR org.apache.hudi.integ.ITTestBase  - 

 

 ## Stderr ###

Error: Could not find or load main class 
.var.hoodie.ws.hudi-cli.target.hudi-cli-0.5.3-rc2.jar:

 

[*ERROR*] *Tests* *run: 1*, *Failures: 1*, Errors: 0, Skipped: 0, Time elapsed: 
452.147 s *<<< FAILURE!* - in org.apache.hudi.integ.*ITTestHoodieDemo*

[*ERROR*] testDemo(org.apache.hudi.integ.ITTestHoodieDemo)  Time elapsed: 
452.146 s  <<< FAILURE!

java.lang.AssertionError: Command ([/var/hoodie/ws/hudi-cli/hudi-cli.sh, 
--cmdfile, /var/hoodie/ws/docker/demo/compaction.commands]) expected to 
succeed. Exit (1) expected:<0> but was:<1>

 at 
org.apache.hudi.integ.ITTestHoodieDemo.scheduleAndRunCompaction(ITTestHoodieDemo.java:304)

 at org.apache.hudi.integ.ITTestHoodieDemo.testDemo(ITTestHoodieDemo.java:93)

 

[*INFO*] 

[*INFO*] Results:

[*INFO*] 

[*ERROR*] *Failures:* 

[*ERROR*]   
*ITTestHoodieDemo.testDemo:93->scheduleAndRunCompaction:304->ITTestBase.executeCommandStringInDocker:190->ITTestBase.executeCommandInDocker:168
 Command ([/var/hoodie/ws/hudi-cli/hudi-cli.sh, --cmdfile, 
/var/hoodie/ws/docker/demo/compaction.commands]) expected to succeed. Exit (1) 
expected:<0> but was:<1>*

[*INFO*] 

[*ERROR*] *Tests run: 7, Failures: 1, Errors: 0, Skipped: 0*

[*INFO*]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xushiyan commented on pull request #1095: [HUDI-210] Implement prometheus metrics reporter

2020-06-09 Thread GitBox


xushiyan commented on pull request #1095:
URL: https://github.com/apache/hudi/pull/1095#issuecomment-641542940


   > @xushiyan hello,how is the progress
   
   Unfortunately I have to de-prioritize this as the test improvements are more 
needed at the moment. I may only be able to come back to this after a while. 
Please feel free to pick up the ticket in the mean time. Thanks.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1719: [HUDI-1006]deltastreamer use kafkaSource with offset reset strategy:latest can't consume data

2020-06-09 Thread GitBox


garyli1019 commented on pull request #1719:
URL: https://github.com/apache/hudi/pull/1719#issuecomment-641539959


   @Litianye Thanks for making this PR. Will review soon.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-905) Support PrunedFilteredScan for Spark Datasource

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-905.
---
Resolution: Not A Problem

TableScan already supported filter and projection pushdown.

> Support PrunedFilteredScan for Spark Datasource
> ---
>
> Key: HUDI-905
> URL: https://issues.apache.org/jira/browse/HUDI-905
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>
> Hudi Spark Datasource incremental view currently is using 
> DataSourceReadOptions.PUSH_DOWN_INCR_FILTERS_OPT_KEY to push down the filter.
> If we wanna use Spark predicate pushdown in a native way, we need to 
> implement PrunedFilteredScan for Hudi Datasource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-610) MOR table Impala read support

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-610:

Summary: MOR table Impala read support  (was: Impala nea real time table 
support)

> MOR table Impala read support
> -
>
> Key: HUDI-610
> URL: https://issues.apache.org/jira/browse/HUDI-610
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Impala uses the JAVA based module call "frontend" to list all the files to 
> scan and let the C++ based "backend" to do all the file scanning. 
> Merge Avro and Parquet could be difficult because it might need to have a 
> custom merging logic like RealtimeCompactedRecordReader to be implemented in 
> backend using C++, but I think it will be doable to have something like 
> RealtimeUnmergedRecordReader which only need some changes in the frontend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-610) Impala nea real time table support

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li reassigned HUDI-610:
---

Assignee: (was: Yanjia Gary Li)

> Impala nea real time table support
> --
>
> Key: HUDI-610
> URL: https://issues.apache.org/jira/browse/HUDI-610
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Priority: Major
>
> Impala uses the JAVA based module call "frontend" to list all the files to 
> scan and let the C++ based "backend" to do all the file scanning. 
> Merge Avro and Parquet could be difficult because it might need to have a 
> custom merging logic like RealtimeCompactedRecordReader to be implemented in 
> backend using C++, but I think it will be doable to have something like 
> RealtimeUnmergedRecordReader which only need some changes in the frontend. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-494.
-
Resolution: Fixed

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-494) [DEBUGGING] Huge amount of tasks when writing files into HDFS

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-494.
---

> [DEBUGGING] Huge amount of tasks when writing files into HDFS
> -
>
> Key: HUDI-494
> URL: https://issues.apache.org/jira/browse/HUDI-494
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-01-02 at 8.53.24 PM.png, Screen Shot 
> 2020-01-02 at 8.53.44 PM.png, example2_hdfs.png, example2_sparkui.png, 
> image-2020-01-05-07-30-53-567.png
>
>
> I am using the manual build master after 
> [https://github.com/apache/incubator-hudi/commit/36b3b6f5dd913d3f1c9aa116aff8daf6540fed65]
>  commit. EDIT: tried with the latest master but got the same result
> I am seeing 3 million tasks when the Hudi Spark job writing the files into 
> HDFS. It seems like related to the input size. With 7.7 GB input it was 3.2 
> million tasks, with 9 GB input it was 3.7 million. Both with 10 parallelisms. 
> I am seeing a huge amount of 0 byte files being written into .hoodie/.temp/ 
> folder in my HDFS. In the Spark UI, each task only writes less than 10 
> records in
> {code:java}
> count at HoodieSparkSqlWriter{code}
>  All the stages before this seem normal. Any idea what happened here? My 
> first guess would be something related to the bloom filter index. Maybe 
> somewhere trigger the repartitioning with the bloom filter index? But I am 
> not really familiar with that part of the code. 
> Thanks
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-822.
-
Resolution: Fixed

> Decouple hoodie related methods with Hoodie Input Formats
> -
>
> Key: HUDI-822
> URL: https://issues.apache.org/jira/browse/HUDI-822
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>
> In order to support multiple query engines, we need to generalize the Hudi 
> input format and Hudi record merging logic. And decouple from 
> MapredParquetInputFormat, which is depending on Hive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-822) Decouple hoodie related methods with Hoodie Input Formats

2020-06-09 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li closed HUDI-822.
---

> Decouple hoodie related methods with Hoodie Input Formats
> -
>
> Key: HUDI-822
> URL: https://issues.apache.org/jira/browse/HUDI-822
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
>
> In order to support multiple query engines, we need to generalize the Hudi 
> input format and Hudi record merging logic. And decouple from 
> MapredParquetInputFormat, which is depending on Hive. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] garyli1019 closed pull request #1700: [Draft]Hudi 69 draft

2020-06-09 Thread GitBox


garyli1019 closed pull request #1700:
URL: https://github.com/apache/hudi/pull/1700


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 opened a new pull request #1722: [HUDI-69] Support Spark Datasource for MOR table

2020-06-09 Thread GitBox


garyli1019 opened a new pull request #1722:
URL: https://github.com/apache/hudi/pull/1722


   ## What is the purpose of the pull request
   
   This PR implement Spark Datasource for MOR table
   
   ## Brief change log
   
 - Implemented realtimeRelation
 - Implemented HoodieRealtimeInputFormat on top of Spark SQL internal 
ParquetFileFormat
 - Implemented HoodieParquetRecordReaderIterator and RecordReader
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - *Added TestRealtimeDataSource to verify this feature.*
 
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-06-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-69:
---
Labels: pull-request-available  (was: )

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-06-09 Thread Bhavani Sudha (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129734#comment-17129734
 ] 

Bhavani Sudha commented on HUDI-651:


I have pushed it to my repo in this branch - 
https://github.com/bhasudha/hudi/tree/incr-on-hive-via-spark-sql

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> 

[GitHub] [hudi] EdwinGuo opened a new pull request #1721: Cache the explodeRecordRDDWithFileComparisons instead of commuting it…

2020-06-09 Thread GitBox


EdwinGuo opened a new pull request #1721:
URL: https://github.com/apache/hudi/pull/1721


   … twice in lookUpIndex
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Cache the explodeRecordRDDWithFileComparisons instead of commuting it twice 
in lookUpIndex
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-781) Re-design test utilities

2020-06-09 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129642#comment-17129642
 ] 

Raymond Xu commented on HUDI-781:
-

[~vinoth] Make sense. I've paused #1 as it's targeting from a different angle. 
I've talked to [~garyli1019], as he tried to eradicate the leaking but it 
turned out to be difficult, probably due to too many resource init. We def. 
would keep watching the issues see whenever we can to fix some.

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar merged pull request #1592: [HUDI-822] decouple Hudi related logics from HoodieInputFormat

2020-06-09 Thread GitBox


vinothchandar merged pull request #1592:
URL: https://github.com/apache/hudi/pull/1592


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1720: [HUDI-1003] Handle partitions correctly for syncing hudi non-parititioned table to hive

2020-06-09 Thread GitBox


vinothchandar commented on a change in pull request #1720:
URL: https://github.com/apache/hudi/pull/1720#discussion_r437539185



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -247,7 +247,13 @@ private[hudi] object HoodieSparkSqlWriter {
 hiveSyncConfig.hivePass = parameters(HIVE_PASS_OPT_KEY)
 hiveSyncConfig.jdbcUrl = parameters(HIVE_URL_OPT_KEY)
 hiveSyncConfig.partitionFields =
-  
ListBuffer(parameters(HIVE_PARTITION_FIELDS_OPT_KEY).split(",").map(_.trim).filter(!_.isEmpty).toList:
 _*)
+  // Reset partition_fields to empty, when sync hudi non-parititioned 
table to hive.
+  if 
(classOf[NonPartitionedExtractor].getName.equals(parameters(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY)))
 {

Review comment:
   can we add a test case for this?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Litianye opened a new pull request #1719: [HUDI-1006]deltastreamer use kafkaSource with offset reset strategy:latest can't consume data

2020-06-09 Thread GitBox


Litianye opened a new pull request #1719:
URL: https://github.com/apache/hudi/pull/1719


   ## What is the purpose of the pull request
   This pull request fix deltastreamer use kafkasource (such as JsonKafkaSource 
/ AvroKafkaSource)  with offset reset strategy:latest can't consume data 
because the checkpoint string store in .commit file .commit file will always be 
an empty string.
   
   For example, i want to inject data from kafka into a new hudi table. 
   From `org.apache.hudi.utilities.deltastreamer.DeltaSync#readFromSource`, the 
first time consume `resumeCheckpointStr` will be `Option.empty()`, and the 
`lastCkptStr` used in `org.apache.hudi.utilities.sources.Source#fetchNewData` 
will also be `Option.empty()`.
   Fetch new data code like this:
   `protected InputBatch> fetchNewData(Option 
lastCheckpointStr, long sourceLimit) {
   OffsetRange[] offsetRanges = 
offsetGen.getNextOffsetRanges(lastCheckpointStr, sourceLimit);
   long totalNewMsgs = CheckpointUtils.totalNewMessages(offsetRanges);
   if (totalNewMsgs <= 0) {
 return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() 
? lastCheckpointStr.get() : "");
   } else {
 LOG.info("About to read " + totalNewMsgs + " from Kafka for topic :" + 
offsetGen.getTopicName());
   }
   JavaRDD newDataRDD = toRDD(offsetRanges);
   return new InputBatch<>(Option.of(newDataRDD), 
KafkaOffsetGen.CheckpointUtils.offsetsToStr(offsetRanges));
 }`
   
   When get next offset ranges in 
`org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen#getNextOffsetRanges` 
code like this:
   `// Determine the offset ranges to read from
 if (lastCheckpointStr.isPresent() && 
!lastCheckpointStr.get().isEmpty()) {
   fromOffsets = checkupValidOffsets(consumer, lastCheckpointStr, 
topicPartitions);
 } else {
   KafkaResetOffsetStrategies autoResetValue = 
KafkaResetOffsetStrategies
   .valueOf(props.getString("auto.offset.reset", 
Config.DEFAULT_AUTO_RESET_OFFSET.toString()).toUpperCase());
   switch (autoResetValue) {
 case EARLIEST:
   fromOffsets = consumer.beginningOffsets(topicPartitions);
   break;
 case LATEST:
   fromOffsets = consumer.endOffsets(topicPartitions);
   break;
 default:
   throw new HoodieNotSupportedException("Auto reset value must be 
one of 'earliest' or 'latest' ");
   }
 }
   
 // Obtain the latest offsets.
 toOffsets = consumer.endOffsets(topicPartitions);`
   
   Because `lastCkptStr` is `Option.empty()`, fromOffsets and toOffsets all 
will be consumer's endOffsets, `totalNewMsgs` size is 0 and first time 
checkpoint string return value is an empty string. Next consume operation will 
get this empty string checkpoint, in `KafkaOffsetGen` offset range will always 
be handled to reset as latest and return another empty string checkpoint.
   
   By watching, checkpoint will be normal only if kafka latest offset change 
between `fromOffsets` and `toOffsets` get end offset value.
   
   ## Brief change log
   - Modify checkpoint string return value of JsonKafkaSource & AvroKafkaSource 
method fetchNewData(), when offsetRanges total message count <= 0.
   
   ## Verify this pull request
   This pull request is already covered by existing tests, such as : 
   Run TestKafkaSource
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf merged pull request #1718: [HUDI-1016] [Minor] Code optimization

2020-06-09 Thread GitBox


leesf merged pull request #1718:
URL: https://github.com/apache/hudi/pull/1718


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1664: HUDI-942 Increase default value number of delta commits for inline compaction

2020-06-09 Thread GitBox


vinothchandar commented on pull request #1664:
URL: https://github.com/apache/hudi/pull/1664#issuecomment-640550723


   @bhasudha Will help you out with the integration test issue on local 
machine.. Must be something environmental.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #1592: [HUDI-822] decouple Hudi related logics from HoodieInputFormat

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1592:
URL: https://github.com/apache/hudi/pull/1592#issuecomment-632985999







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #1717: [HUDI-1012] Add unit test for snapshot reads

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1717:
URL: https://github.com/apache/hudi/pull/1717#issuecomment-640854839


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1717?src=pr=h1) Report
   > Merging 
[#1717](https://codecov.io/gh/apache/hudi/pull/1717?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/97ab97b72635164db5ac2a4f93e72e224603ffe0=desc)
 will **not change** coverage.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1717/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1717?src=pr=tree)
   
   ```diff
   @@Coverage Diff@@
   ## master#1717   +/-   ##
   =
 Coverage 18.18%   18.18%   
 Complexity  857  857   
   =
 Files   348  348   
 Lines 1536115361   
 Branches   1525 1525   
   =
 Hits   2794 2794   
 Misses1220912209   
 Partials358  358   
   ```
   
   
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1717?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1717?src=pr=footer). Last 
update 
[97ab97b...59b1d02](https://codecov.io/gh/apache/hudi/pull/1717?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nandini57 edited a comment on issue #1705: Tracking Hudi Data along transaction time and buisness time

2020-06-09 Thread GitBox


nandini57 edited a comment on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640599130


   Yes Balaji. Each record can have 4 columns (IN_Z,OUT_Z(system 
dimension),FROM_Z,THRU_Z(business dimension)) .If you see the code above,i am 
creating different unique keys  and splitting merge operation into delete + 
insert
   
   Hudi mege creates one commit timestamp,but delete and insert here will 
create 2 commit timestamps and if any one/both operation fails, i need to 
rollback the commits.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1711: [HUDI-974] fix fields out of order in MOR mode when using Hive

2020-06-09 Thread GitBox


vinothchandar commented on a change in pull request #1711:
URL: https://github.com/apache/hudi/pull/1711#discussion_r436633121



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
##
@@ -82,7 +82,7 @@ public RealtimeUnmergedRecordReader(HoodieRealtimeFileSplit 
split, JobConf job,
 false, jobConf.getInt(MAX_DFS_STREAM_BUFFER_SIZE_PROP, 
DEFAULT_MAX_DFS_STREAM_BUFFER_SIZE), record -> {
   // convert Hoodie log record to Hadoop AvroWritable and buffer
   GenericRecord rec = (GenericRecord) 
record.getData().getInsertValue(getReaderSchema()).get();
-  ArrayWritable aWritable = (ArrayWritable) avroToArrayWritable(rec, 
getWriterSchema());
+  ArrayWritable aWritable = (ArrayWritable) avroToArrayWritable(rec, 
getHiveSchema());

Review comment:
   what do we use in the merged record reader? cc @bvaradar 

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
##
@@ -82,7 +82,7 @@ public RealtimeUnmergedRecordReader(HoodieRealtimeFileSplit 
split, JobConf job,
 false, jobConf.getInt(MAX_DFS_STREAM_BUFFER_SIZE_PROP, 
DEFAULT_MAX_DFS_STREAM_BUFFER_SIZE), record -> {
   // convert Hoodie log record to Hadoop AvroWritable and buffer
   GenericRecord rec = (GenericRecord) 
record.getData().getInsertValue(getReaderSchema()).get();
-  ArrayWritable aWritable = (ArrayWritable) avroToArrayWritable(rec, 
getWriterSchema());
+  ArrayWritable aWritable = (ArrayWritable) avroToArrayWritable(rec, 
getHiveSchema());

Review comment:
   sg.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar edited a comment on issue #1550: Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT]

2020-06-09 Thread GitBox


vinothchandar edited a comment on issue #1550:
URL: https://github.com/apache/hudi/issues/1550#issuecomment-640542938


   @nsivabalan is driving the release.. We are planning to do a 0.5.3 this 
week. right siva ? This release will have the fix.. @nikitap95 if interested, 
you can join the mailing list and help validate the release candidate :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sbernauer commented on pull request #1647: [HUDI-867]: fixed IllegalArgumentException from graphite metrics in deltaStreamer continuous mode

2020-06-09 Thread GitBox


sbernauer commented on pull request #1647:
URL: https://github.com/apache/hudi/pull/1647#issuecomment-641278957


   If i read https://stackoverflow.com/a/55753138 correctly, normally you 
register an gauge only at startup (or first metric write) and than just update 
the value in every loop. Currently Deltastreamer tries to register the gauge in 
every loop. 
(https://github.com/apache/hudi/blob/3387b3841f784fef2bb8548375b8c2078bc03506/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamerMetrics.java#L70)
   It would be necessary to close the gauge at shutdown of the Deltastreamer, 
so if the Deltastreamer gets restarted the metric can be registered again.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1719: [HUDI-1006]deltastreamer use kafkaSource with offset reset strategy:latest can't consume data

2020-06-09 Thread GitBox


leesf commented on pull request #1719:
URL: https://github.com/apache/hudi/pull/1719#issuecomment-641222330


   @garyli1019 would you please review this one?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#issuecomment-632410484


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=h1) Report
   > Merging 
[#1602](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/97ab97b72635164db5ac2a4f93e72e224603ffe0=desc)
 will **increase** coverage by `0.04%`.
   > The diff coverage is `57.14%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1602/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1602  +/-   ##
   
   + Coverage 18.18%   18.23%   +0.04% 
   - Complexity  857  860   +3 
   
 Files   348  348  
 Lines 1536115348  -13 
 Branches   1525 1523   -2 
   
   + Hits   2794 2799   +5 
   + Misses1220912191  -18 
 Partials358  358  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/hudi/pull/1602/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `7.14% <ø> (+1.39%)` | `4.00 <0.00> (ø)` | |
   | 
[...org/apache/hudi/config/HoodieCompactionConfig.java](https://codecov.io/gh/apache/hudi/pull/1602/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUNvbXBhY3Rpb25Db25maWcuamF2YQ==)
 | `55.33% <33.33%> (-0.67%)` | `3.00 <0.00> (ø)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/hudi/pull/1602/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `55.39% <66.66%> (ø)` | `15.00 <1.00> (ø)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/hudi/pull/1602/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `40.71% <100.00%> (+0.63%)` | `50.00 <1.00> (+2.00)` | |
   | 
[...apache/hudi/common/fs/HoodieWrapperFileSystem.java](https://codecov.io/gh/apache/hudi/pull/1602/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL0hvb2RpZVdyYXBwZXJGaWxlU3lzdGVtLmphdmE=)
 | `22.69% <0.00%> (+0.70%)` | `29.00% <0.00%> (+1.00%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=footer). Last 
update 
[97ab97b...9099c0e](https://codecov.io/gh/apache/hudi/pull/1602?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #1719: [HUDI-1006]deltastreamer use kafkaSource with offset reset strategy:latest can't consume data

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1719:
URL: https://github.com/apache/hudi/pull/1719#issuecomment-641096930


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=h1) Report
   > Merging 
[#1719](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/9e07cebece3b4c8b964ddca2f40053734a392ce2=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1719/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1719  +/-   ##
   
   + Coverage 18.17%   18.21%   +0.03% 
   - Complexity  857  859   +2 
   
 Files   348  348  
 Lines 1536915356  -13 
 Branches   1525 1523   -2 
   
   + Hits   2794 2797   +3 
   + Misses1221712201  -16 
 Partials358  358  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/config/HoodieCompactionConfig.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUNvbXBhY3Rpb25Db25maWcuamF2YQ==)
 | `55.33% <0.00%> (-0.67%)` | `3.00% <0.00%> (ø%)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `55.39% <0.00%> (ø)` | `15.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `40.71% <0.00%> (+0.63%)` | `50.00% <0.00%> (+2.00%)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `7.14% <0.00%> (+1.39%)` | `4.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=footer). Last 
update 
[9e07ceb...565f9b4](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-896) Parallelize CI testing to reduce CI wait time

2020-06-09 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17129562#comment-17129562
 ] 

Raymond Xu commented on HUDI-896:
-

[~vinoth] There is a problem with the current codecov report generation. It 
depends on the last finished test task to cover all modules. We're good now as 
unit test task covers all modules and finishes later, but if unit test gets 
done faster then the intg test report is generated as the overall report.

My initial intention is to fix the coverage generation so as to enable 
parallelization on modularized tests. Then the CI feedback loop will be a lot 
shorter.

I do agree that shared spark context is targeting on the slowness at root cause 
and will be more effective. Will prioritize the spark context work then.

> Parallelize CI testing to reduce CI wait time
> -
>
> Key: HUDI-896
> URL: https://issues.apache.org/jira/browse/HUDI-896
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> - 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on a change in pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-06-09 Thread GitBox


vinothchandar commented on a change in pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#discussion_r436636150



##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
##
@@ -54,6 +54,12 @@
   public static final String PARQUET_SMALL_FILE_LIMIT_BYTES = 
"hoodie.parquet.small.file.limit";
   // By default, treat any file <= 100MB as a small file.
   public static final String DEFAULT_PARQUET_SMALL_FILE_LIMIT_BYTES = 
String.valueOf(104857600);
+  // Hudi will use the previous commit to calculate the estimated record size 
by totalBytesWritten/totalRecordsWritten.
+  // If the previous commit is too small to make an accurate estimation, Hudi 
will search commits in the reverse order,
+  // until find a commit has totalBytesWritten larger than 
(PARQUET_SMALL_FILE_LIMIT_BYTES * RECORD_SIZE_ESTIMATION_THRESHOLD)
+  public static final String RECORD_SIZE_ESTIMATION_THRESHOLD = 
"hoodie.record.size.estimation.threshold";

Review comment:
   rename:  RECORD_SIZE_ESTIMATION_THRESHOLD_PROP





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nikitap95 edited a comment on issue #1550: Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT]

2020-06-09 Thread GitBox


nikitap95 edited a comment on issue #1550:
URL: https://github.com/apache/hudi/issues/1550#issuecomment-640574748


   @vinothchandar Thanks for your prompt response. Will wait for the release in 
that case rather than using the patch. 
   Sure, I'll get myself added to it, would be great to be a part of it!
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1006) deltastreamer use kafkaSource with offset reset strategy: latest can't consume data

2020-06-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1006:
-
Labels: pull-request-available  (was: )

> deltastreamer use kafkaSource with offset reset strategy: latest can't 
> consume data
> ---
>
> Key: HUDI-1006
> URL: https://issues.apache.org/jira/browse/HUDI-1006
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: Tianye Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> org.apache.hudi.utilities.sources.JsonKafkaSource#fetchNewData
> if (totalNewMsgs <= 0) {
>  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
> lastCheckpointStr.get() : "");
> }
> I think it should not be empty here, it should be 
> if (totalNewMsgs <= 0) {
>  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
> lastCheckpointStr.get() : CheckpointUtils.offsetsToStr(offsetRanges));
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-commenter commented on pull request #1719: [HUDI-1006]deltastreamer use kafkaSource with offset reset strategy:latest can't consume data

2020-06-09 Thread GitBox


codecov-commenter commented on pull request #1719:
URL: https://github.com/apache/hudi/pull/1719#issuecomment-641096930


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=h1) Report
   > Merging 
[#1719](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/9e07cebece3b4c8b964ddca2f40053734a392ce2=desc)
 will **increase** coverage by `0.03%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1719/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1719  +/-   ##
   
   + Coverage 18.17%   18.21%   +0.03% 
   - Complexity  857  859   +2 
   
 Files   348  348  
 Lines 1536915356  -13 
 Branches   1525 1523   -2 
   
   + Hits   2794 2797   +3 
   + Misses1221712201  -16 
 Partials358  358  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...org/apache/hudi/config/HoodieCompactionConfig.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZUNvbXBhY3Rpb25Db25maWcuamF2YQ==)
 | `55.33% <0.00%> (-0.67%)` | `3.00% <0.00%> (ø%)` | |
   | 
[...he/hudi/table/action/commit/UpsertPartitioner.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9VcHNlcnRQYXJ0aXRpb25lci5qYXZh)
 | `55.39% <0.00%> (ø)` | `15.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/config/HoodieWriteConfig.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29uZmlnL0hvb2RpZVdyaXRlQ29uZmlnLmphdmE=)
 | `40.71% <0.00%> (+0.63%)` | `50.00% <0.00%> (+2.00%)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/hudi/pull/1719/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `7.14% <0.00%> (+1.39%)` | `4.00% <0.00%> (ø%)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=footer). Last 
update 
[9e07ceb...565f9b4](https://codecov.io/gh/apache/hudi/pull/1719?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1592: [HUDI-822] decouple Hudi related logics from HoodieInputFormat

2020-06-09 Thread GitBox


garyli1019 commented on pull request #1592:
URL: https://github.com/apache/hudi/pull/1592#issuecomment-640758035


   @vinothchandar this one passed with rebase too



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1690: [HUDI-908] Add decimals to HoodieTestDataGenerator

2020-06-09 Thread GitBox


bvaradar commented on pull request #1690:
URL: https://github.com/apache/hudi/pull/1690#issuecomment-641301829


   @shenh062326 : It makes sense to cover other data-types in a single PR. Can 
you also add them to this PR. Also, Can you let us know what the missing data 
types are ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nandini57 commented on issue #1705: Tracking Hudi Data along transaction time and buisness time

2020-06-09 Thread GitBox


nandini57 commented on issue #1705:
URL: https://github.com/apache/hudi/issues/1705#issuecomment-640599130


   Yes Balaji. Each record can have 4 columns (IN_Z,OUT_Z(system 
dimension),FROM_Z,THRU_Z(business dimension)) .If you see the code above,i am 
creating different unique keys  and splitting merge operation into delete + 
insert



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf merged pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-06-09 Thread GitBox


leesf merged pull request #1652:
URL: https://github.com/apache/hudi/pull/1652


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1714: [HUDI-1005] fix NPE in HoodieWriteClient.clean

2020-06-09 Thread GitBox


vinothchandar merged pull request #1714:
URL: https://github.com/apache/hudi/pull/1714


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 commented on pull request #1095: [HUDI-210] Implement prometheus metrics reporter

2020-06-09 Thread GitBox


UZi5136225 commented on pull request #1095:
URL: https://github.com/apache/hudi/pull/1095#issuecomment-641078938


   @xushiyan   hello,how is the progress



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1683: Updating release docs for release-0.5.3

2020-06-09 Thread GitBox


nsivabalan commented on a change in pull request #1683:
URL: https://github.com/apache/hudi/pull/1683#discussion_r436711094



##
File path: docs/_pages/releases.md
##
@@ -3,8 +3,40 @@ title: "Releases"
 permalink: /releases
 layout: releases
 toc: true
-last_modified_at: 2019-12-30T15:59:57-04:00
+last_modified_at: 2020-05-28T08:40:00-07:00
 ---
+## [Release 0.5.3](https://github.com/apache/hudi/releases/tag/release-0.5.3) 
([docs](/docs/0.5.3-quick-start-guide.html))
+
+### Download Information
+ * Source Release : [Apache Hudi 0.5.3 Source 
Release](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz.sha512))
+ * Apache Hudi jars corresponding to this release is available 
[here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+### Release Highlights
+ * Since this is a bug fix release, not a lot of new features as such. Will 
call out a few features and then will go over some of the improvements and some 
of the notable bug fixes. 
+ * `Features`: 
+ * Added support for `aliyun OSS` and `Presto MOR query` support to Hudi
+ 
+ * `Improvements`: 
+ * Improved write performance in creating Parquet DataSource after writes

Review comment:
   I didn't list all bugs as such. Probably need someone to assist me here. 
I can work with them to get the right ones. 
   
   btw, wrt upgrading, there is nothing required to be done to upgrade. So, I 
did have a line saying the same, but @leesf suggested to remove if there aren't 
anything to be done. So, in my updates to the patch, I removed them. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #1716: [HUDI-875] Introduce a new pom module named hudi-common-sync

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1716:
URL: https://github.com/apache/hudi/pull/1716#issuecomment-641229684







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] luoyajun526 opened a new pull request #1720: [HUDI-1003] Handle partitions correctly for syncing hudi non-parititioned table to hive

2020-06-09 Thread GitBox


luoyajun526 opened a new pull request #1720:
URL: https://github.com/apache/hudi/pull/1720


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   This PR aims to create the table normally without partitions, when sync hudi 
non-parititioned table to hive.
   JIRA: https://issues.apache.org/jira/browse/HUDI-1003
   
   ## Brief change log
   
 - Judge whether to reset partitionFields to empty or not according to 
HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, during building HiveSyncConfig.
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter edited a comment on pull request #1711: [HUDI-974] fix fields out of order in MOR mode when using Hive

2020-06-09 Thread GitBox


codecov-commenter edited a comment on pull request #1711:
URL: https://github.com/apache/hudi/pull/1711#issuecomment-640326551


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=h1) Report
   > Merging 
[#1711](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/acb1ada2f756b49d9f9a0aa152f99fcc9e86dde7=desc)
 will **decrease** coverage by `54.05%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1711/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1711   +/-   ##
   =
   - Coverage 72.25%   18.20%   -54.06% 
   - Complexity  294  858  +564 
   =
 Files   374  348   -26 
 Lines 1637115361 -1010 
 Branches   1654 1525  -129 
   =
   - Hits  11829 2796 -9033 
   - Misses 380612207 +8401 
   + Partials736  358  -378 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../hadoop/realtime/RealtimeUnmergedRecordReader.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3JlYWx0aW1lL1JlYWx0aW1lVW5tZXJnZWRSZWNvcmRSZWFkZXIuamF2YQ==)
 | `0.00% <0.00%> (-96.97%)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [336 
more](https://codecov.io/gh/apache/hudi/pull/1711/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=footer). Last 
update 
[acb1ada...132c811](https://codecov.io/gh/apache/hudi/pull/1711?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



[GitHub] [hudi] n3nash commented on pull request #1638: HUDI-515 Resolve API conflict for Hive 2 & Hive 3

2020-06-09 Thread GitBox


n3nash commented on pull request #1638:
URL: https://github.com/apache/hudi/pull/1638#issuecomment-640892145


   LGTM



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu closed pull request #1665: [HUDI-910]Introduce HoodieWriteInput for hudi write client

2020-06-09 Thread GitBox


wangxianghu closed pull request #1665:
URL: https://github.com/apache/hudi/pull/1665


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-06-09 Thread GitBox


garyli1019 commented on pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#issuecomment-640757660


   @vinothchandar CI passed with rebase.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1652: [HUDI-918] Fix kafkaOffsetGen can not read kafka data bug

2020-06-09 Thread GitBox


leesf commented on pull request #1652:
URL: https://github.com/apache/hudi/pull/1652#issuecomment-640580726


   merging this. cc @garyli1019 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #1716: [HUDI-875] Introduce a new pom module named hudi-common-sync

2020-06-09 Thread GitBox


codecov-commenter commented on pull request #1716:
URL: https://github.com/apache/hudi/pull/1716#issuecomment-641229684


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=h1) Report
   > Merging 
[#1716](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/acb1ada2f756b49d9f9a0aa152f99fcc9e86dde7=desc)
 will **decrease** coverage by `54.04%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1716/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#1716   +/-   ##
   =
   - Coverage 72.25%   18.21%   -54.05% 
   - Complexity  294  859  +565 
   =
 Files   374  348   -26 
 Lines 1637115356 -1015 
 Branches   1654 1523  -131 
   =
   - Hits  11829 2797 -9032 
   - Misses 380612201 +8395 
   + Partials736  358  -378 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/metrics/MetricsReporter.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0cmljcy9NZXRyaWNzUmVwb3J0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/common/model/ActionType.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0FjdGlvblR5cGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...java/org/apache/hudi/io/HoodieRangeInfoHandle.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vSG9vZGllUmFuZ2VJbmZvSGFuZGxlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[.../java/org/apache/hudi/hadoop/InputPathHandler.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL0lucHV0UGF0aEhhbmRsZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...a/org/apache/hudi/exception/HoodieIOException.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUlPRXhjZXB0aW9uLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...org/apache/hudi/table/action/commit/SmallFile.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NvbW1pdC9TbWFsbEZpbGUuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...rg/apache/hudi/index/bloom/KeyRangeLookupTree.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW5kZXgvYmxvb20vS2V5UmFuZ2VMb29rdXBUcmVlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/exception/HoodieInsertException.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUluc2VydEV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | 
[...g/apache/hudi/exception/HoodieUpsertException.java](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVVwc2VydEV4Y2VwdGlvbi5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (ø%)` | |
   | ... and [335 
more](https://codecov.io/gh/apache/hudi/pull/1716/diff?src=pr=tree-more) | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=footer). Last 
update 
[acb1ada...48f17bc](https://codecov.io/gh/apache/hudi/pull/1716?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



[GitHub] [hudi] leesf merged pull request #1711: [HUDI-974] fix fields out of order in MOR mode when using Hive

2020-06-09 Thread GitBox


leesf merged pull request #1711:
URL: https://github.com/apache/hudi/pull/1711


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-06-09 Thread GitBox


vinothchandar commented on pull request #1602:
URL: https://github.com/apache/hudi/pull/1602#issuecomment-640555982







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #1665: [HUDI-910]Introduce HoodieWriteInput for hudi write client

2020-06-09 Thread GitBox


wangxianghu commented on pull request #1665:
URL: https://github.com/apache/hudi/pull/1665#issuecomment-641275707







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #1550: Hudi 0.5.2 inability save complex type with nullable = true [SUPPORT]

2020-06-09 Thread GitBox


vinothchandar commented on issue #1550:
URL: https://github.com/apache/hudi/issues/1550#issuecomment-640542938


   @nsivabalan is driving the release.. We are planning to do a 0.5.3 this 
week. right siva ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #1707: [HUDI-988] fix more unit tests flakiness

2020-06-09 Thread GitBox


xushiyan commented on pull request #1707:
URL: https://github.com/apache/hudi/pull/1707#issuecomment-640766975







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1602: [HUDI-494] fix incorrect record size estimation

2020-06-09 Thread GitBox


vinothchandar merged pull request #1602:
URL: https://github.com/apache/hudi/pull/1602


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1712: Cherry picking HUDI-988 and HUDI-990 to release-0.5.3

2020-06-09 Thread GitBox


nsivabalan commented on a change in pull request #1712:
URL: https://github.com/apache/hudi/pull/1712#discussion_r436940846



##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/AbstractShellIntegrationTest.java
##
@@ -58,4 +58,13 @@ public void teardown() throws Exception {
   protected static JLineShellComponent getShell() {
 return shell;
   }
-}
\ No newline at end of file
+
+  /**
+   * Helper to prepare string for matching.
+   * @param str Input string.
+   * @return pruned string with non word characters removed.

Review comment:
   this method is not used only. This are used by TestClasses which are not 
added to 0.5.3. 

##
File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRepairsCommand.java
##
@@ -95,8 +95,9 @@ public void testAddPartitionMetaWithDryRun() throws 
IOException {
 .toArray(String[][]::new);
 String expected = HoodiePrintHelper.print(new String[] 
{HoodieTableHeaderFields.HEADER_PARTITION_PATH,
 HoodieTableHeaderFields.HEADER_METADATA_PRESENT, 
HoodieTableHeaderFields.HEADER_REPAIR_ACTION}, rows);
-
-assertEquals(expected, cr.getResult().toString());
+expected = removeNonWordAndStripSpace(expected);

Review comment:
   Actually I added this entire class TestRepairsCommand by mistake. 0.5.3 
doesn't pull in HUDI-704 and hence have I have deleted this file in my latest 
commit. 

##
File path: hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java
##
@@ -43,9 +44,11 @@ public void setUp() throws Exception {
   }
 
   @After
-  public void tearDown() {
+  public void tearDown() throws IOException {
 cleanupSparkContexts();
-cleanupMetaClient();
+cleanupFileSystem();

Review comment:
   cleanupFileSystem and cleanupTestDataGenerator was added as part of 
SimpleHoodieIndex. These are not required as per 0.5.3. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1690: [HUDI-908] Add decimals to HoodieTestDataGenerator

2020-06-09 Thread GitBox


shenh062326 commented on pull request #1690:
URL: https://github.com/apache/hudi/pull/1690#issuecomment-640978958


   @bvaradar Should I add all data types to this pr or open another pr. My 
original idea was that this pr fixes the bug of parsing decimal type, and 
another pr is added to add other data types



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-commenter commented on pull request #1720: [HUDI-1003] Handle partitions correctly for syncing hudi non-parititioned table to hive

2020-06-09 Thread GitBox


codecov-commenter commented on pull request #1720:
URL: https://github.com/apache/hudi/pull/1720#issuecomment-641194386


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=h1) Report
   > Merging 
[#1720](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=desc) into 
[master](https://codecov.io/gh/apache/hudi/commit/22cd824d993bf43d88121ea89bad3a1f23a28518=desc)
 will **decrease** coverage by `0.00%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/1720/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1720  +/-   ##
   
   - Coverage 18.21%   18.21%   -0.01% 
 Complexity  859  859  
   
 Files   348  348  
 Lines 1535615359   +3 
 Branches   1523 1524   +1 
   
 Hits   2797 2797  
   - Misses1220112204   +3 
 Partials358  358  
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala](https://codecov.io/gh/apache/hudi/pull/1720/diff?src=pr=tree#diff-aHVkaS1zcGFyay9zcmMvbWFpbi9zY2FsYS9vcmcvYXBhY2hlL2h1ZGkvSG9vZGllU3BhcmtTcWxXcml0ZXIuc2NhbGE=)
 | `46.74% <0.00%> (-0.85%)` | `0.00 <0.00> (ø)` | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[Codecov](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=footer). Last 
update 
[22cd824...d92ce16](https://codecov.io/gh/apache/hudi/pull/1720?src=pr=lastupdated).
 Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1714: [HUDI-1005] fix NPE in HoodieWriteClient.clean

2020-06-09 Thread GitBox


shenh062326 commented on pull request #1714:
URL: https://github.com/apache/hudi/pull/1714#issuecomment-640974416


   @vinothchandar can you take a look at this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1711: [HUDI-974] fix fields out of order in MOR mode when using Hive

2020-06-09 Thread GitBox


leesf commented on a change in pull request #1711:
URL: https://github.com/apache/hudi/pull/1711#discussion_r436636431



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeUnmergedRecordReader.java
##
@@ -82,7 +82,7 @@ public RealtimeUnmergedRecordReader(HoodieRealtimeFileSplit 
split, JobConf job,
 false, jobConf.getInt(MAX_DFS_STREAM_BUFFER_SIZE_PROP, 
DEFAULT_MAX_DFS_STREAM_BUFFER_SIZE), record -> {
   // convert Hoodie log record to Hadoop AvroWritable and buffer
   GenericRecord rec = (GenericRecord) 
record.getData().getInsertValue(getReaderSchema()).get();
-  ArrayWritable aWritable = (ArrayWritable) avroToArrayWritable(rec, 
getWriterSchema());
+  ArrayWritable aWritable = (ArrayWritable) avroToArrayWritable(rec, 
getHiveSchema());

Review comment:
   @vinothchandar This is a bug when using RealtimeUnmergedRecordReader, 
merged record reader works fine. unmerged record reader should use hive schema 
rather than writer schema here. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 opened a new pull request #1718: [HUDI-1016] [Minor] Code optimization

2020-06-09 Thread GitBox


shenh062326 opened a new pull request #1718:
URL: https://github.com/apache/hudi/pull/1718


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   * Code optimization in MergeOnReadRollbackActionExecutor
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1683: Updating release docs for release-0.5.3

2020-06-09 Thread GitBox


vinothchandar commented on a change in pull request #1683:
URL: https://github.com/apache/hudi/pull/1683#discussion_r434991598



##
File path: docs/_pages/releases.md
##
@@ -3,8 +3,40 @@ title: "Releases"
 permalink: /releases
 layout: releases
 toc: true
-last_modified_at: 2019-12-30T15:59:57-04:00
+last_modified_at: 2020-05-28T08:40:00-07:00
 ---
+## [Release 0.5.3](https://github.com/apache/hudi/releases/tag/release-0.5.3) 
([docs](/docs/0.5.3-quick-start-guide.html))
+
+### Download Information
+ * Source Release : [Apache Hudi 0.5.3 Source 
Release](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz.sha512))
+ * Apache Hudi jars corresponding to this release is available 
[here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+### Release Highlights
+ * Since this is a bug fix release, not a lot of new features as such. Will 
call out a few features and then will go over some of the improvements and some 
of the notable bug fixes. 
+ * `Features`: 
+ * Added support for `aliyun OSS` and `Presto MOR query` support to Hudi

Review comment:
   we should probably not announce Presto MOR support now? @bhasudha my 
understanding is that its not ready.. 

##
File path: docs/_pages/releases.md
##
@@ -3,8 +3,40 @@ title: "Releases"
 permalink: /releases
 layout: releases
 toc: true
-last_modified_at: 2019-12-30T15:59:57-04:00
+last_modified_at: 2020-05-28T08:40:00-07:00
 ---
+## [Release 0.5.3](https://github.com/apache/hudi/releases/tag/release-0.5.3) 
([docs](/docs/0.5.3-quick-start-guide.html))
+
+### Download Information
+ * Source Release : [Apache Hudi 0.5.3 Source 
Release](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz) 
([asc](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz.asc), 
[sha512](https://downloads.apache.org/hudi/0.5.3/hudi-0.5.3.src.tgz.sha512))
+ * Apache Hudi jars corresponding to this release is available 
[here](https://repository.apache.org/#nexus-search;quick~hudi)
+
+### Release Highlights
+ * Since this is a bug fix release, not a lot of new features as such. Will 
call out a few features and then will go over some of the improvements and some 
of the notable bug fixes. 
+ * `Features`: 
+ * Added support for `aliyun OSS` and `Presto MOR query` support to Hudi
+ 
+ * `Improvements`: 
+ * Improved write performance in creating Parquet DataSource after writes

Review comment:
   Can we make this user facing and understandable like the previous notes? 
 We need not list all the bugs/commit message here per se.. 
   
   We also need a section on upgrading, which asks the user to consider the 
steps in upgrading from previous versions.. i.e if upgrading from 0.5.0 to 
0.5.3 then the 0.5.1 and 0.5.2 upgrade steps need to be taken into consideration





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >