[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1126: Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1126: Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126#discussion_r361095141
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -109,7 +109,7 @@ public HoodieCopyOnWriteTable(HoodieWriteConfig config, 
JavaSparkContext jsc) {
 Tuple2 partitionDelFileTuple = iter.next();
 String partitionPath = partitionDelFileTuple._1();
 String delFileName = partitionDelFileTuple._2();
-Path deletePath = new Path(new Path(basePath, partitionPath), 
delFileName);
+Path deletePath = 
FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), 
delFileName);
 
 Review comment:
   @vinothchandar
   Yeah, I will open a JIRA issue first next time and then decide whether to 
submit PR according to the question.
   If by name, it should be better to just 
modify`FSUtils.getPartitionPath(basePath, partitionPath)`. Maybe I should 
revert and put another PR?
   In hadoop 2.7+, I found that `new Path()` api use more stringent checks than 
ever before, if user use null to use `new Path(path, null)` in hadoop2.7-, it 
doesn't report error. 
   If it's for API compatibility, is it better to adjust its name to 
`buildMultiPath()` or else?So that all similar methods can be used, thus to 
ensure that no exception will happen.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1126: Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1126: Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126#discussion_r361095141
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -109,7 +109,7 @@ public HoodieCopyOnWriteTable(HoodieWriteConfig config, 
JavaSparkContext jsc) {
 Tuple2 partitionDelFileTuple = iter.next();
 String partitionPath = partitionDelFileTuple._1();
 String delFileName = partitionDelFileTuple._2();
-Path deletePath = new Path(new Path(basePath, partitionPath), 
delFileName);
+Path deletePath = 
FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), 
delFileName);
 
 Review comment:
   @vinothchandar
   If by name, it should be better to just 
modify`FSUtils.getPartitionPath(basePath, partitionPath)`. Maybe I should 
revert and put another PR?
   In hadoop 2.7+, I found that `new Path()` api use more stringent checks than 
ever before, if user use null to use `new Path(path, null)` in hadoop2.7-, it 
doesn't report error. 
   If it's for API compatibility, is it better to adjust its name to 
`buildMultiPath()` or else?So that all similar methods can be used, thus to 
ensure that no exception will happen.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1126: Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1126: Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126#discussion_r361095095
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -109,7 +109,7 @@ public HoodieCopyOnWriteTable(HoodieWriteConfig config, 
JavaSparkContext jsc) {
 Tuple2 partitionDelFileTuple = iter.next();
 String partitionPath = partitionDelFileTuple._1();
 String delFileName = partitionDelFileTuple._2();
-Path deletePath = new Path(new Path(basePath, partitionPath), 
delFileName);
+Path deletePath = 
FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), 
delFileName);
 
 Review comment:
   @vinothchandar
   If by name, it should be better to just 
modify`FSUtils.getPartitionPath(basePath, partitionPath)`. Maybe I should 
revert and put another PR?
   In hadoop 2.7+, I found that `new Path()` api use more stringent checks than 
ever before, if user use null to use `new Path(path, null)` in hadoop2.7-, it 
doesn't report error. 
   If it's for API compatibility, is it better to adjust its name to 
`buildMultiPath()` or else?So that all similar methods can be used, thus to 
ensure that no exception will happen.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch hudi_test_suite_refactor updated (66463ff -> 09c34a0)

2019-12-23 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 66463ff  [MINOR] Fix compile error about the deletion of 
HoodieActiveTimeline#createNewCommitTime
 add 09c34a0  [HUDI-442] Fix 
TestComplexKeyGenerator#testSingleValueKeyGenerator and 
testMultipleValueKeyGenerator NPE

No new revisions were added by this update.

Summary of changes:
 hudi-spark/src/main/java/org/apache/hudi/ComplexKeyGenerator.java | 5 -
 1 file changed, 5 deletions(-)



[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1127: [MINOR] Set info servity for ImportOrder temporarily

2019-12-23 Thread GitBox
lamber-ken opened a new pull request #1127: [MINOR] Set info servity for 
ImportOrder temporarily
URL: https://github.com/apache/incubator-hudi/pull/1127
 
 
   ## What is the purpose of the pull request
   
   Many developers in the community feel uncomfortable with this rule, although 
this rule ordering/grouping of imports. We need disable it before we find a 
good way to solve it.
   
   ## Brief change log
   
 - Set info servity for ImportOrder rule
   
   ## Verify this pull request
   
   This pull request is a code cleanup without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1126: Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
yanghua commented on a change in pull request #1126: Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126#discussion_r361082265
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -109,7 +109,7 @@ public HoodieCopyOnWriteTable(HoodieWriteConfig config, 
JavaSparkContext jsc) {
 Tuple2 partitionDelFileTuple = iter.next();
 String partitionPath = partitionDelFileTuple._1();
 String delFileName = partitionDelFileTuple._2();
-Path deletePath = new Path(new Path(basePath, partitionPath), 
delFileName);
+Path deletePath = 
FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), 
delFileName);
 
 Review comment:
   > Does this simply need a `new Path(FSUtils.getPartitionPath(basePath, 
partitionPath), delFileName)` ? We are overloading calling of 
`FSUtils.getPartitionPath`. for somethings that's not a partitionpath?
   
   I saw `FsUtils#getPartitionPath` can only avoid the second arg to be null. 
However, the outer `new Path(..., delFileName);` The `delFileName` can not be 
null?
   
   > Also can we please ensure there is a JIRA for fix.. or we follow the 
`[MINOR] ...` convention?
   > 
   
   Agree, I have mentioned in the approvement comment. Not sure whether 
`[MINOR]`  is a normal prefix if there is not a Jira issue to track. IMO, We 
should clearly record where. Or does it exist but I don't know?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1126: Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
vinothchandar commented on a change in pull request #1126: Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126#discussion_r361079699
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
 ##
 @@ -109,7 +109,7 @@ public HoodieCopyOnWriteTable(HoodieWriteConfig config, 
JavaSparkContext jsc) {
 Tuple2 partitionDelFileTuple = iter.next();
 String partitionPath = partitionDelFileTuple._1();
 String delFileName = partitionDelFileTuple._2();
-Path deletePath = new Path(new Path(basePath, partitionPath), 
delFileName);
+Path deletePath = 
FSUtils.getPartitionPath(FSUtils.getPartitionPath(basePath, partitionPath), 
delFileName);
 
 Review comment:
   Does this simply need a `new Path(FSUtils.getPartitionPath(basePath, 
partitionPath), delFileName)` ? We are overloading calling of 
`FSUtils.getPartitionPath`. for somethings that's not a partitionpath? 
   
   Also can we please ensure there is a JIRA for fix.. or we follow the 
`[MINOR] ...` convention?
   
   @cdmikechen @yanghua wdyt 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1119: Fix: 
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361078012
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash 
   It may be in 
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java#L165
   ```java
 HoodieWriteStat stat = new HoodieWriteStat();
 stat.setPartitionPath(writeStatus.getPartitionPath());
 stat.setNumWrites(recordsWritten);
 stat.setNumDeletes(recordsDeleted);
 stat.setNumInserts(insertRecordsWritten);
 stat.setPrevCommit(HoodieWriteStat.NULL_COMMIT);
 stat.setFileId(writeStatus.getFileId());
 stat.setPath(new Path(config.getBasePath()), path);
 long fileSizeInBytes = FSUtils.getFileSize(fs, path);
 stat.setTotalWriteBytes(fileSizeInBytes);
 stat.setFileSizeInBytes(fileSizeInBytes);
 stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
 RuntimeStats runtimeStats = new RuntimeStats();
 runtimeStats.setTotalCreateTime(timer.endTimer());
 stat.setRuntimeStats(runtimeStats);
 writeStatus.setStat(stat);
   ```
   In `org.apache.hudi.common.model.HoodieWriteStat` it is a "null", in other 
cases prevCommit will be set a real commit time.
   ```java
   public static final String NULL_COMMIT = "null";
   ```
   So that `fetchTotalFilesInsert` can recognize first commit file and pass the 
condition, meanwhile `fetchTotalInsertRecordsWritten` can not.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1119: Fix: 
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361078012
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash 
   It may be in 
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java#L165
   ```java
 HoodieWriteStat stat = new HoodieWriteStat();
 stat.setPartitionPath(writeStatus.getPartitionPath());
 stat.setNumWrites(recordsWritten);
 stat.setNumDeletes(recordsDeleted);
 stat.setNumInserts(insertRecordsWritten);
 stat.setPrevCommit(HoodieWriteStat.NULL_COMMIT);
 stat.setFileId(writeStatus.getFileId());
 stat.setPath(new Path(config.getBasePath()), path);
 long fileSizeInBytes = FSUtils.getFileSize(fs, path);
 stat.setTotalWriteBytes(fileSizeInBytes);
 stat.setFileSizeInBytes(fileSizeInBytes);
 stat.setTotalWriteErrors(writeStatus.getTotalErrorRecords());
 RuntimeStats runtimeStats = new RuntimeStats();
 runtimeStats.setTotalCreateTime(timer.endTimer());
 stat.setRuntimeStats(runtimeStats);
 writeStatus.setStat(stat);
   ```
   In `org.apache.hudi.common.model.HoodieWriteStat` it is a "null", in other 
cases prevCommit will be set a real commit time.
   ```java
   public static final String NULL_COMMIT = "null";
   ```
   So that `fetchTotalFilesInsert` can recognize first commit file and pass the 
condition.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (350b0ec -> 8172197)

2019-12-23 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 350b0ec  [HUDI-311] : Support for AWS Database Migration Service in 
DeltaStreamer
 add 8172197  Fix Error: java.lang.IllegalArgumentException: Can not create 
a Path from an empty string in HoodieCopyOnWrite#deleteFilesFunc (#1126)

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [incubator-hudi] yanghua merged pull request #1126: Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
yanghua merged pull request #1126: Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-311) Support AWS DMS source on DeltaStreamer

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-311:

Status: Closed  (was: Patch Available)

> Support AWS DMS source on DeltaStreamer
> ---
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change 
> logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-311) Support AWS DMS source on DeltaStreamer

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-311:

Status: Patch Available  (was: In Progress)

> Support AWS DMS source on DeltaStreamer
> ---
>
> Key: HUDI-311
> URL: https://issues.apache.org/jira/browse/HUDI-311
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://aws.amazon.com/dms/ seems like a one-stop shop for database change 
> logs. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-445) Refactor the codes based on scala codestyle BlockImportChecker rule

2019-12-23 Thread Jiaqi Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002683#comment-17002683
 ] 

Jiaqi Li commented on HUDI-445:
---

Ok. Thanks!:)

> Refactor the codes based on scala codestyle BlockImportChecker rule
> ---
>
> Key: HUDI-445
> URL: https://issues.apache.org/jira/browse/HUDI-445
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Critical
>
> Refactor the codes based on scala codestyle BlockImportChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch hudi_test_suite_refactor updated (1d2ecbc -> 66463ff)

2019-12-23 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 1d2ecbc  [HUDI-391] Rename module name from hudi-bench to 
hudi-test-suite and fix some checkstyle issues (#1102)
 add 66463ff  [MINOR] Fix compile error about the deletion of 
HoodieActiveTimeline#createNewCommitTime

No new revisions were added by this update.

Summary of changes:
 .../src/main/java/org/apache/hudi/testsuite/writer/DeltaWriter.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[jira] [Updated] (HUDI-383) Introduce TransactionHandle abstraction to manage state transitions in hudi clients

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-383:

Component/s: code cleanup

> Introduce TransactionHandle abstraction to manage state transitions in hudi 
> clients
> ---
>
> Key: HUDI-383
> URL: https://issues.apache.org/jira/browse/HUDI-383
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Cleaner, code cleanup, Compaction, Write Client
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Minor
>
> Came up in review comment. 
> https://github.com/apache/incubator-hudi/pull/1009/files#r347705820



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-48) Re-factor/clean up lazyBlockReading use in HoodieCompactedLogScanner #339

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-48:
---
Component/s: code cleanup

> Re-factor/clean up lazyBlockReading use in HoodieCompactedLogScanner #339
> -
>
> Key: HUDI-48
> URL: https://issues.apache.org/jira/browse/HUDI-48
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: code cleanup, Compaction, Storage Management, Write 
> Client
>Reporter: Vinoth Chandar
>Priority: Major
>
> https://github.com/uber/hudi/issues/339



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-284) Need Tests for Hudi handling of schema evolution

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-284:

Component/s: newbie

> Need  Tests for Hudi handling of schema evolution
> -
>
> Key: HUDI-284
> URL: https://issues.apache.org/jira/browse/HUDI-284
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Common Core, newbie
>Reporter: Balaji Varadarajan
>Priority: Major
>
> Context in : 
> https://github.com/apache/incubator-hudi/pull/927#pullrequestreview-293449514



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-401) Remove unnecessary use of spark in savepoint timeline

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-401.
---
Resolution: Fixed

> Remove unnecessary use of spark in savepoint timeline
> -
>
> Key: HUDI-401
> URL: https://issues.apache.org/jira/browse/HUDI-401
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI, Write Client
>Reporter: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, javasparkcontext was inited when savepoint create, but it is not 
> necessary.  Javasparkcontext's whole work is provide hadoopconfig, but need 
> time and resources to init it. 
> So we can use hadoop config instead of jsc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-248) CLI doesn't allow rolling back a Delta commit

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-248:

Component/s: Usability

> CLI doesn't allow rolling back a Delta commit
> -
>
> Key: HUDI-248
> URL: https://issues.apache.org/jira/browse/HUDI-248
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: CLI, Usability
>Reporter: Rahul Bhartia
>Priority: Minor
>  Labels: aws-emr
> Fix For: 0.5.1
>
>
> [https://github.com/apache/incubator-hudi/blob/master/hudi-cli/src/main/java/org/apache/hudi/cli/commands/CommitsCommand.java#L128]
>  
> When trying to find a match for passed in commit value, the "commit rollback" 
> command is always default to using HoodieTimeline.COMMIT_ACTION - and hence 
> doesn't allow rolling back delta commits.
> Note: Delta Commits can be rolled back using a HoodieWriteClient, so seems 
> like it's a just a matter of having to match against both COMMIT_ACTION and 
> DELTA_COMMIT_ACTION in the CLI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2019-12-23 Thread GitBox
yanghua commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-568665535
 
 
   > @yanghua Can you check why the build is failing now ?
   
   I saw the `HoodieActiveTimeline#createNewCommitTime` has been renamed to 
`HoodieActiveTimeline#createNewInstantTime` after merging HUDI-308.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-376) AWS Glue dependency issue for EMR 5.28.0

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-376:

Component/s: (was: CLI)
 Usability

> AWS Glue dependency issue for EMR 5.28.0
> 
>
> Key: HUDI-376
> URL: https://issues.apache.org/jira/browse/HUDI-376
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Xing Pan
>Priority: Minor
> Fix For: 0.5.1
>
>
> Hi hudi team, it's really encouraging that Hudi is finally officially 
> supported application on AWS EMR. Great job!
> I found a *ClassNotFound* exception when using:
> {code:java}
> /usr/lib/hudi/bin/run_sync_tool.sh
> {code}
> in emr master.
> And I think is due to demand of aws glue data sdk dependency. (I used aws 
> glue as hive meta data)
> So I added a line to run_sync_tool.sh to get a quick fix for this:
> {code:java}
> HIVE_JARS=$HIVE_JARS:/usr/lib/hive/auxlib/aws-glue-datacatalog-hive2-client.jar:/usr/share/aws/emr/emr-metrics-collector/lib/aws-java-sdk-glue-1.11.475.jar{code}
> not sure if any more jars needed, but these two jar fixed my problem.
>  
> I think it would be great if take glue in consideration for emr scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2019-12-23 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-238:

Component/s: Usability

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: asf-migration, Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +- io.confluent:kafka-schema-registry-client:jar:3.0.0:compile
> [INFO] \- 

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1119: Fix: 
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361074375
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   sorry I may misunderstand your mind, I'll take a look at what I've tested 
later.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-445) Refactor the codes based on scala codestyle BlockImportChecker rule

2019-12-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002678#comment-17002678
 ] 

lamber-ken commented on HUDI-445:
-

You can fix it first. Follow [http://hudi.apache.org/contributing.html]

> Refactor the codes based on scala codestyle BlockImportChecker rule
> ---
>
> Key: HUDI-445
> URL: https://issues.apache.org/jira/browse/HUDI-445
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Critical
>
> Refactor the codes based on scala codestyle BlockImportChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-445) Refactor the codes based on scala codestyle BlockImportChecker rule

2019-12-23 Thread Jiaqi Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002677#comment-17002677
 ] 

Jiaqi Li commented on HUDI-445:
---

Yes, I have applied contributor permissions by email. No reply yet.;)

> Refactor the codes based on scala codestyle BlockImportChecker rule
> ---
>
> Key: HUDI-445
> URL: https://issues.apache.org/jira/browse/HUDI-445
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Critical
>
> Refactor the codes based on scala codestyle BlockImportChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch hudi_test_suite_refactor updated (dcfbab1 -> 1d2ecbc)

2019-12-23 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


omit dcfbab1  [HUDI-441] Rename WorkflowDagGenerator and some class names 
in test package
omit 9151ccf  [HUDI-442] Fix 
TestComplexKeyGenerator#testSingleValueKeyGenerator and 
testMultipleValueKeyGenerator NPE (#1118)
omit ae5bd06  [HUDI-391] Rename module name from hudi-bench to 
hudi-test-suite and fix some checkstyle issues (#1102)
omit eaaf3f6  [HUDI-394] Provide a basic implementation of test suite
 add f324057  [MINOR] Unify Lists import (#1103)
 add 8963a68  [HUDI-398]Add spark env set/get for spark launcher (#1096)
 add 9a1f698  [HUDI-308] Avoid Renames for tracking state transitions of 
all actions on dataset
 add 7498ca7  [MINOR] Add slack invite icon in README (#1108)
 add 14881e9  [HUDI-106] Adding support for DynamicBloomFilter (#976)
 add 36b3b6f  [HUDI-415] Get commit time when Spark start (#1113)
 add b284091  [HUDI-386] Refactor hudi scala checkstyle rules (#1099)
 add 313fab5  [HUDI-444] Refactor the codes based on scala codestyle 
ReturnChecker rule (#1121)
 add 350b0ec  [HUDI-311] : Support for AWS Database Migration Service in 
DeltaStreamer
 add 9b55d37  [HUDI-394] Provide a basic implementation of test suite
 add 1d2ecbc  [HUDI-391] Rename module name from hudi-bench to 
hudi-test-suite and fix some checkstyle issues (#1102)

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (dcfbab1)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (1d2ecbc)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 LICENSE|  28 +++
 README.md  |   1 +
 .../main/java/org/apache/hudi/cli/HoodieCLI.java   |  15 +-
 .../hudi/cli/commands/CompactionCommand.java   |  30 +--
 .../apache/hudi/cli/commands/DatasetsCommand.java  |  10 +-
 .../apache/hudi/cli/commands/SparkEnvCommand.java  |  68 ++
 .../java/org/apache/hudi/cli/utils/SparkUtil.java  |   6 +-
 .../scala/org/apache/hudi/cli/DedupeSparkJob.scala |   4 +-
 .../scala/org/apache/hudi/cli/SparkHelpers.scala   |  10 +-
 .../org/apache/hudi/CompactionAdminClient.java |   8 +-
 .../java/org/apache/hudi/HoodieCleanClient.java|  26 ++-
 .../java/org/apache/hudi/HoodieWriteClient.java| 125 ++-
 .../org/apache/hudi/client/utils/ClientUtils.java  |   4 +-
 .../org/apache/hudi/config/HoodieIndexConfig.java  |  11 +-
 .../org/apache/hudi/config/HoodieWriteConfig.java  |  29 ++-
 .../org/apache/hudi/io/HoodieCommitArchiveLog.java |  42 +++-
 .../org/apache/hudi/io/HoodieKeyLookupHandle.java  |   8 +-
 .../io/storage/HoodieStorageWriterFactory.java |   8 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  | 128 ++-
 .../apache/hudi/table/HoodieMergeOnReadTable.java  |  42 ++--
 .../java/org/apache/hudi/table/HoodieTable.java|  10 +-
 .../java/org/apache/hudi/TestAsyncCompaction.java  |  10 +-
 .../src/test/java/org/apache/hudi/TestCleaner.java |  31 ++-
 .../java/org/apache/hudi/TestClientRollback.java   |   6 +-
 .../hudi/TestHoodieClientOnCopyOnWriteStorage.java |  47 +++-
 .../apache/hudi/common/HoodieClientTestUtils.java  |  18 +-
 .../hudi/common/HoodieTestDataGenerator.java   |  37 +--
 .../hudi/func/TestBoundedInMemoryExecutor.java |   2 +-
 .../apache/hudi/func/TestBoundedInMemoryQueue.java |   2 +-
 .../java/org/apache/hudi/index/TestHbaseIndex.java |  11 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   9 +-
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |  34 +--
 .../org/apache/hudi/io/TestHoodieCompactor.java|   6 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   2 +-
 .../apache/hudi/table/TestMergeOnReadTable.java|  12 +-
 hudi-common/pom.xml|   1 +
 .../src/main/avro/HoodieArchivedMetaEntry.avsc |  22 ++
 .../apache/hudi/avro/HoodieAvroWriteSupport.java   |   7 +-
 .../hudi/common/bloom/filter/BloomFilter.java  |  35 +--
 .../common/bloom/filter/BloomFilterFactory.java|  63 ++
 .../common/bloom/filter/BloomFilterTypeCode.java   |  10 +-
 .../filter/BloomFilterUtils.java}  |  34 +--
 .../filter/HoodieDynamicBoundedBloomFilter.java| 109 +
 

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1119: Fix: 
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   I think the processing logic of this method is the same as that of 
`fetchTotalUpdateRecordsWritten `
   ```
 public long fetchTotalUpdateRecordsWritten() {
   long totalUpdateRecordsWritten = 0;
   for (List stats : partitionToWriteStats.values()) {
 for (HoodieWriteStat stat : stats) {
   totalUpdateRecordsWritten += stat.getNumUpdateWrites();
 }
   }
   return totalUpdateRecordsWritten;
 }
   ```
   Let me give you an example in my test. I create a hudi data with 3 rows 
first, and then insert a row, at last insert a row and delete a row. I print 
all method return number and test the number to check.
   ```java
   for (HoodieInstant commit : commits) {
   HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
   HoodieCommitMetadata.class);
   
   JSONObject data = new JSONObject();
   data.put("timestamp", commit.getTimestamp());
   data.put("fetchTotalBytesWritten", 
commitMetadata.fetchTotalBytesWritten());
   data.put("fetchTotalFilesInsert", 
commitMetadata.fetchTotalFilesInsert());
   data.put("fetchTotalFilesUpdated", 
commitMetadata.fetchTotalFilesUpdated());
   data.put("fetchTotalPartitionsWritten", 
commitMetadata.fetchTotalPartitionsWritten());
   data.put("fetchTotalRecordsWritten", 
commitMetadata.fetchTotalRecordsWritten());
   data.put("fetchTotalUpdateRecordsWritten", 
commitMetadata.fetchTotalUpdateRecordsWritten());
   data.put("fetchTotalInsertRecordsWritten", 
commitMetadata.fetchTotalInsertRecordsWritten());
   data.put("fetchTotalWriteErrors", 
commitMetadata.fetchTotalWriteErrors());
   
   long totalDeleteRecordsWritten= 0;
   for (List stats : 
commitMetadata.getPartitionToWriteStats().values()) {
   for (HoodieWriteStat stat : stats) {
   if (stat.getPrevCommit() != null) {
   totalDeleteRecordsWritten += stat.getNumDeletes();
   }
   }
   }
   data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
   
   datas.add(data);
   }
   log.debug("datas = {}", datas);
   ```
   If I use old if `(stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
   ```json
   [
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923172508"
},
{
"fetchTotalBytesWritten": 435227,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160439"
},
{
"fetchTotalBytesWritten": 435478,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 1,
"fetchTotalFilesUpdated": 0,
"fetchTotalInsertRecordsWritten": 3,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 3,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160340"
}
   ]
   ```
   If I use modified if `(stat.getPrevCommit() != null)` it will print:
   ```json
   [
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 1,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
 

[jira] [Updated] (HUDI-453) Throw failed to archive commits error when writing data to MOR/COW table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-453:

Description: 
Throw failed to archive commits error when writing data to table, here are 
reproduce steps.

*1, Build from latest source*
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
{code}
*2, Write Data*
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6
${SPARK_HOME}/bin/spark-shell --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar` --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'

import org.apache.spark.sql.SaveMode._

var datas = List("{ \"name\": \"kenken\", \"ts\": 1574297893836, \"age\": 12, 
\"location\": \"latitude\"}")
val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", "hudi_mor_table").
mode(Overwrite).
save("file:///tmp/hudi_mor_table")
{code}
*3, Append Data*
{code:java}
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
option("hoodie.table.name", "hudi_mor_table").
mode(Append).
save("file:///tmp/hudi_mor_table")

{code}
*4, Repeat about six times Append Data operation(above), will get the 
stackstrace*
{code:java}
19/12/23 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at 

[jira] [Updated] (HUDI-453) Throw failed to archive commits error when writing data to MOR/COW table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-453:

Description: 
Throw failed to archive commits error when writing data to table, here are 
reproduce steps.

*1, Build from latest source*

 
{code:java}
mvn clean package -DskipTests -DskipITs -Dcheckstyle.skip=true -Drat.skip=true
{code}
*2, Write Data*

 
{code:java}
export SPARK_HOME=/work/BigData/install/spark/spark-2.3.3-bin-hadoop2.6
${SPARK_HOME}/bin/spark-shell --jars `ls 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar` --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'

import org.apache.spark.sql.SaveMode._

var datas = List("{ \"name\": \"kenken\", \"ts\": 1574297893836, \"age\": 12, 
\"location\": \"latitude\"}")
val df = spark.read.json(spark.sparkContext.parallelize(datas, 2))
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.table.name", "hudi_mor_table").
mode(Overwrite).
save("file:///tmp/hudi_mor_table")
{code}
*3, Append Data*
{code:java}
df.write.format("org.apache.hudi").
option("hoodie.insert.shuffle.parallelism", "10").
option("hoodie.upsert.shuffle.parallelism", "10").
option("hoodie.delete.shuffle.parallelism", "10").
option("hoodie.bulkinsert.shuffle.parallelism", "10").
option("hoodie.datasource.write.recordkey.field", "name").
option("hoodie.datasource.write.partitionpath.field", "location").
option("hoodie.datasource.write.precombine.field", "ts").
option("hoodie.keep.max.commits", "5").
option("hoodie.keep.min.commits", "4").
option("hoodie.cleaner.commits.retained", "3").
option("hoodie.table.name", "hudi_mor_table").
mode(Append).
save("file:///tmp/hudi_mor_table")

{code}
*4, Repeat about six times Append Data operation(above), will get the 
stackstrace*
{code:java}
19/12/23 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at 

[GitHub] [incubator-hudi] yanghua commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2019-12-23 Thread GitBox
yanghua commented on issue #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-568662558
 
 
   > @yanghua IIUC, @bvaradar uses it actually to run a test job that generates 
random data on the cluster.. 
   
   I did not see any place where use `DistributedTestDataSource` in the master 
branch.
   
   > So, may be leave it in `hoodie-utilities` so that the bundle also has it.. 
Its in general, I nice way to start running deltastreamer with some fake data.
   
   We can leave it in `hoodie-utilities` module. However, it exists in the test 
package. As @n3nash mentioned, we would better avoid using test code in another 
module.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-445) Refactor the codes based on scala codestyle BlockImportChecker rule

2019-12-23 Thread lamber-ken (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002652#comment-17002652
 ] 

lamber-ken commented on HUDI-445:
-

Feel free to take over it, :) you need contributor permissions first. 

> Refactor the codes based on scala codestyle BlockImportChecker rule
> ---
>
> Key: HUDI-445
> URL: https://issues.apache.org/jira/browse/HUDI-445
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Critical
>
> Refactor the codes based on scala codestyle BlockImportChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-445) Refactor the codes based on scala codestyle BlockImportChecker rule

2019-12-23 Thread Jiaqi Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002647#comment-17002647
 ] 

Jiaqi Li commented on HUDI-445:
---

I can try to finish this job. Can assigned to me?

> Refactor the codes based on scala codestyle BlockImportChecker rule
> ---
>
> Key: HUDI-445
> URL: https://issues.apache.org/jira/browse/HUDI-445
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: lamber-ken
>Priority: Critical
>
> Refactor the codes based on scala codestyle BlockImportChecker rule



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-23 Thread GitBox
nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r361068247
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieGlobalBloomIndex.java
 ##
 @@ -98,26 +92,34 @@ public HoodieGlobalBloomIndex(HoodieWriteConfig config) {
   String recordKey = partitionRecordKeyPair._2();
   String partitionPath = partitionRecordKeyPair._1();
 
-  return indexFileFilter.getMatchingFiles(partitionPath, 
recordKey).stream()
-  .map(file -> new Tuple2<>(file, new HoodieKey(recordKey, 
indexToPartitionMap.get(file
+  return indexFileFilter.getMatchingFilesAndPartition(partitionPath, 
recordKey).stream()
+  .map(partitionFileIdPair -> new 
Tuple2<>(partitionFileIdPair.getRight(),
+  new HoodieKey(recordKey, partitionFileIdPair.getLeft(
   .collect(Collectors.toList());
 }).flatMap(List::iterator);
   }
 
-
   /**
* Tagging for global index should only consider the record key.
*/
   @Override
   protected JavaRDD> tagLocationBacktoRecords(
   JavaPairRDD keyFilenamePairRDD, 
JavaRDD> recordRDD) {
-JavaPairRDD> rowKeyRecordPairRDD =
+
+JavaPairRDD> incomingRowKeyRecordPairRDD =
 recordRDD.mapToPair(record -> new Tuple2<>(record.getRecordKey(), 
record));
 
-// Here as the recordRDD might have more data than rowKeyRDD (some 
rowKeys' fileId is null),
-// so we do left outer join.
-return rowKeyRecordPairRDD.leftOuterJoin(keyFilenamePairRDD.mapToPair(p -> 
new Tuple2<>(p._1.getRecordKey(), p._2)))
-.values().map(value -> getTaggedRecord(value._1, 
Option.ofNullable(value._2.orNull(;
+JavaPairRDD> 
existingRecordKeyToRecordLocationHoodieKeyMap =
+keyFilenamePairRDD.mapToPair(p -> new Tuple2<>(p._1.getRecordKey(), 
new Tuple2<>(p._2, p._1)));
+
+return 
incomingRowKeyRecordPairRDD.leftOuterJoin(existingRecordKeyToRecordLocationHoodieKeyMap).values().map(record
 -> {
+  if (record._2().isPresent()) {
+// Record key matched to file
+return getTaggedRecord(new HoodieRecord<>(record._2.get()._2, 
record._1.getData()), Option.ofNullable(record._2.get()._1));
 
 Review comment:
   yeah. thats what we are doing. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] cdmikechen opened a new pull request #1126: - Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string

2019-12-23 Thread GitBox
cdmikechen opened a new pull request #1126: - Fix Error: 
java.lang.IllegalArgumentException: Can not create a Path from an empty string
URL: https://github.com/apache/incubator-hudi/pull/1126
 
 
   same link in https://github.com/apache/incubator-hudi/pull/771
   this time is `deleteFilesFunc` method in 
`org.apache.hudi.table.HoodieCopyOnWrite.java`.
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an 
empty string, when upsert a hudi table with non-partition data.
   
   ## Brief change log
   change `HoodieCopyOnWriteTable` in line 112 use 
`FSUtils.getPartitionPath(Path basePath, String partitionPath)`
   
   ## Verify this pull request
   This pull request is already covered by existing tests, such as 
*org.apache.hudi.table.TestCopyOnWriteTable*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer

2019-12-23 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 350b0ec  [HUDI-311] : Support for AWS Database Migration Service in 
DeltaStreamer
350b0ec is described below

commit 350b0ecb4d137411c6231a1568add585c6d7b7d5
Author: vinoth chandar 
AuthorDate: Sun Dec 22 23:33:35 2019 -0800

[HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer

 - Add a transformer class, that adds `Op` fiels if not found in input frame
 - Add a payload implementation, that issues deletes when Op=D
 - Remove Parquet as a top level source type, consolidate with RowSource
 - Made delta streamer work without a property file, simply using 
overridden cli options
 - Unit tests for transformer/payload classes
---
 .../common/util/DFSPropertiesConfiguration.java|  15 ++-
 .../org/apache/hudi/payload/AWSDmsAvroPayload.java |  68 +
 .../org/apache/hudi/utilities/UtilHelpers.java |  18 +++-
 .../deltastreamer/SourceFormatAdapter.java |  15 ---
 .../utilities/schema/RowBasedSchemaProvider.java   |   6 ++
 .../hudi/utilities/sources/ParquetDFSSource.java   |  20 ++--
 .../org/apache/hudi/utilities/sources/Source.java  |   2 +-
 .../AWSDmsTransformer.java}|  32 --
 .../TestAWSDatabaseMigrationServiceSource.java | 107 +
 .../apache/hudi/utilities/UtilitiesTestBase.java   |   1 -
 .../hudi/utilities/sources/TestDFSSource.java  |   2 +-
 11 files changed, 239 insertions(+), 47 deletions(-)

diff --git 
a/hudi-common/src/main/java/org/apache/hudi/common/util/DFSPropertiesConfiguration.java
 
b/hudi-common/src/main/java/org/apache/hudi/common/util/DFSPropertiesConfiguration.java
index 838d4b8..f535cac 100644
--- 
a/hudi-common/src/main/java/org/apache/hudi/common/util/DFSPropertiesConfiguration.java
+++ 
b/hudi-common/src/main/java/org/apache/hudi/common/util/DFSPropertiesConfiguration.java
@@ -60,6 +60,17 @@ public class DFSPropertiesConfiguration {
 visitFile(rootFile);
   }
 
+  public DFSPropertiesConfiguration(FileSystem fs, Path rootFile) {
+this(fs, rootFile, new TypedProperties());
+  }
+
+  public DFSPropertiesConfiguration() {
+this.fs = null;
+this.rootFile = null;
+this.props = new TypedProperties();
+this.visitedFiles = new HashSet<>();
+  }
+
   private String[] splitProperty(String line) {
 int ind = line.indexOf('=');
 String k = line.substring(0, ind).trim();
@@ -106,10 +117,6 @@ public class DFSPropertiesConfiguration {
 }
   }
 
-  public DFSPropertiesConfiguration(FileSystem fs, Path rootFile) {
-this(fs, rootFile, new TypedProperties());
-  }
-
   public TypedProperties getConfig() {
 return props;
   }
diff --git 
a/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java 
b/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java
new file mode 100644
index 000..09898ec
--- /dev/null
+++ b/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.payload;
+
+import org.apache.hudi.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.common.util.Option;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+
+import java.io.IOException;
+
+/**
+ * Provides support for seamlessly applying changes captured via Amazon 
Database Migration Service onto S3.
+ *
+ * Typically, we get the following pattern of full change records 
corresponding to DML against the
+ * source database
+ *
+ * - Full load records with no `Op` field
+ * - For inserts against the source table, records contain full after image 
with `Op=I`
+ * - For updates against the source table, records contain full after image 
with `Op=U`
+ * - For deletes against the source table, records contain full before image 
with `Op=D`
+ *
+ * This payload implementation will issue matching insert, delete, updates 
against the hudi 

[GitHub] [incubator-hudi] vinothchandar merged pull request #1123: [HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer

2019-12-23 Thread GitBox
vinothchandar merged pull request #1123: [HUDI-311] : Support for AWS Database 
Migration Service in DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/1123
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #138

2019-12-23 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.21 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle

[jira] [Updated] (HUDI-453) Throw failed to archive commits error when writing data to MOR/COW table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-453:

Summary: Throw failed to archive commits error when writing data to MOR/COW 
table  (was: Throw failed to archive commits error when writing data to MOR 
table)

> Throw failed to archive commits error when writing data to MOR/COW table
> 
>
> Key: HUDI-453
> URL: https://issues.apache.org/jira/browse/HUDI-453
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Throw failed to archive commits error when writing data to MOR table
> {code:java}
> 19/12/23 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
> .commit file: 20191224004558.clean.requested
> java.io.IOException: Not an Avro data file
> at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-453) Throw failed to archive commits error when writing data to MOR table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-453:

Description: 
Throw failed to archive commits error when writing data to MOR table
{code:java}
19/12/23 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
{code}

  was:
Throw failed to archive commits error when writing data to MOR table
{code:java}
19/12/24 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 

[jira] [Created] (HUDI-465) Make Hive Sync via Spark painless

2019-12-23 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-465:
---

 Summary: Make Hive Sync via Spark painless
 Key: HUDI-465
 URL: https://issues.apache.org/jira/browse/HUDI-465
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: Hive Integration, Spark datasource, Usability
Reporter: Vinoth Chandar


Currently, we require many configs to be passed in for the Hive sync.. this has 
to be simplified and experience should be close to how regular 
spark.write.parquet registers into Hive.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-377) Add Delete() support to HoodieDeltaStreamer

2019-12-23 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002592#comment-17002592
 ] 

Vinoth Chandar commented on HUDI-377:
-

we can iterate based on feedback we get for `_hoodie_delete_marker` approach?

> Add Delete() support to HoodieDeltaStreamer
> ---
>
> Key: HUDI-377
> URL: https://issues.apache.org/jira/browse/HUDI-377
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>   Original Estimate: 72h
>  Time Spent: 10m
>  Remaining Estimate: 71h 50m
>
> Add Delete() support to HoodieDeltaStreamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-233) Redo log statements using SLF4J

2019-12-23 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002591#comment-17002591
 ] 

Vinoth Chandar commented on HUDI-233:
-

I dont have the actual stacktraces.. but mostly you could run into issues with 
version conflicts or slf4j finding multiple bindings.. 

 

In general, the approach I took when redoing all the bundles, were to try and 
use the logging classes in the underlying system (spark, presto, hive etc) as 
much as possible,  and only shade if its absolutely needed.. May be we can 
follow the same approach? 

> Redo log statements using SLF4J 
> 
>
> Key: HUDI-233
> URL: https://issues.apache.org/jira/browse/HUDI-233
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: newbie
>Affects Versions: 0.5.0
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
> Fix For: 0.5.2
>
>
> Currently we are not employing variable substitution aggresively in the 
> project.  ala 
> {code:java}
> LogManager.getLogger(SomeName.class.getName()).info("Message: {}, Detail: 
> {}", message, detail);
> {code}
> This can improve performance since the string concatenation is deferrable to 
> when the logging is actually in effect.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-452) Create a hudi-ci github organizations to support hudi CI via azure pipeline

2019-12-23 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-452.
-
Resolution: Done

> Create a hudi-ci github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new organizations to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-452) Create a hudi CI github organizations to support hudi CI via azure pipeline

2019-12-23 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-452:
--
Summary: Create a hudi CI github organizations to support hudi CI via azure 
pipeline  (was: Create a hudi-ci github organizations to support hudi CI via 
azure pipeline)

> Create a hudi CI github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new organizations to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-452) Create a hudi-ci github organizations to support hudi CI via azure pipeline

2019-12-23 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002589#comment-17002589
 ] 

Vinoth Chandar commented on HUDI-452:
-

once you verify the org I created works.. yes.!

> Create a hudi-ci github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new organizations to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-452) Create a hudi-ci github organizations to support hudi CI via azure pipeline

2019-12-23 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002587#comment-17002587
 ] 

vinoyang commented on HUDI-452:
---

[~vinoth] IMO, we can close this issue. WDYT?

> Create a hudi-ci github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new organizations to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1119: Fix: 
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   I think the processing logic of this method is the same as that of 
`fetchTotalFilesUpdated`
   ```
 public long fetchTotalUpdateRecordsWritten() {
   long totalUpdateRecordsWritten = 0;
   for (List stats : partitionToWriteStats.values()) {
 for (HoodieWriteStat stat : stats) {
   totalUpdateRecordsWritten += stat.getNumUpdateWrites();
 }
   }
   return totalUpdateRecordsWritten;
 }
   ```
   Let me give you an example in my test. I create a hudi data with 3 rows 
first, and then insert a row, at last insert a row and delete a row. I print 
all method return number and test the number to check.
   ```java
   for (HoodieInstant commit : commits) {
   HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
   HoodieCommitMetadata.class);
   
   JSONObject data = new JSONObject();
   data.put("timestamp", commit.getTimestamp());
   data.put("fetchTotalBytesWritten", 
commitMetadata.fetchTotalBytesWritten());
   data.put("fetchTotalFilesInsert", 
commitMetadata.fetchTotalFilesInsert());
   data.put("fetchTotalFilesUpdated", 
commitMetadata.fetchTotalFilesUpdated());
   data.put("fetchTotalPartitionsWritten", 
commitMetadata.fetchTotalPartitionsWritten());
   data.put("fetchTotalRecordsWritten", 
commitMetadata.fetchTotalRecordsWritten());
   data.put("fetchTotalUpdateRecordsWritten", 
commitMetadata.fetchTotalUpdateRecordsWritten());
   data.put("fetchTotalInsertRecordsWritten", 
commitMetadata.fetchTotalInsertRecordsWritten());
   data.put("fetchTotalWriteErrors", 
commitMetadata.fetchTotalWriteErrors());
   
   long totalDeleteRecordsWritten= 0;
   for (List stats : 
commitMetadata.getPartitionToWriteStats().values()) {
   for (HoodieWriteStat stat : stats) {
   if (stat.getPrevCommit() != null) {
   totalDeleteRecordsWritten += stat.getNumDeletes();
   }
   }
   }
   data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
   
   datas.add(data);
   }
   log.debug("datas = {}", datas);
   ```
   If I use old if `(stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
   ```json
   [
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923172508"
},
{
"fetchTotalBytesWritten": 435227,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160439"
},
{
"fetchTotalBytesWritten": 435478,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 1,
"fetchTotalFilesUpdated": 0,
"fetchTotalInsertRecordsWritten": 3,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 3,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160340"
}
   ]
   ```
   If I use modified if `(stat.getPrevCommit() != null)` it will print:
   ```json
   [
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 1,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
  

[GitHub] [incubator-hudi] cdmikechen commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
cdmikechen commented on a change in pull request #1119: Fix: 
HoodieCommitMetadata only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r361044029
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   @n3nash
   I think the processing logic of this method is the same as that of 
`fetchTotalFilesUpdated`
   ```
 public long fetchTotalUpdateRecordsWritten() {
   long totalUpdateRecordsWritten = 0;
   for (List stats : partitionToWriteStats.values()) {
 for (HoodieWriteStat stat : stats) {
   totalUpdateRecordsWritten += stat.getNumUpdateWrites();
 }
   }
   return totalUpdateRecordsWritten;
 }
   ```
   Let me give you an example in my test. I create a hudi data with 3 rows 
first, and then insert a row, at last insert a row and delete a row. I print 
all method return number and test the number to check.
   ```java
   for (HoodieInstant commit : commits) {
   HoodieCommitMetadata commitMetadata = 
HoodieCommitMetadata.fromBytes(timeline.getInstantDetails(commit).get(),
   HoodieCommitMetadata.class);
   
   JSONObject data = new JSONObject();
   data.put("timestamp", commit.getTimestamp());
   data.put("fetchTotalBytesWritten", 
commitMetadata.fetchTotalBytesWritten());
   data.put("fetchTotalFilesInsert", 
commitMetadata.fetchTotalFilesInsert());
   data.put("fetchTotalFilesUpdated", 
commitMetadata.fetchTotalFilesUpdated());
   data.put("fetchTotalPartitionsWritten", 
commitMetadata.fetchTotalPartitionsWritten());
   data.put("fetchTotalRecordsWritten", 
commitMetadata.fetchTotalRecordsWritten());
   data.put("fetchTotalUpdateRecordsWritten", 
commitMetadata.fetchTotalUpdateRecordsWritten());
   data.put("fetchTotalInsertRecordsWritten", 
commitMetadata.fetchTotalInsertRecordsWritten());
   data.put("fetchTotalWriteErrors", 
commitMetadata.fetchTotalWriteErrors());
   
   long totalDeleteRecordsWritten= 0;
   for (List stats : 
commitMetadata.getPartitionToWriteStats().values()) {
   for (HoodieWriteStat stat : stats) {
   if (stat.getPrevCommit() != null) {
   totalDeleteRecordsWritten += stat.getNumDeletes();
   }
   }
   }
   data.put("fetchTotalDeleteRecordsWritten", totalDeleteRecordsWritten);
   
   datas.add(data);
   }
   log.debug("datas = {}", datas);
   ```
   If I user old if `(stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null"))` it will print:
   ```json
   [
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923172508"
},
{
"fetchTotalBytesWritten": 435227,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 0,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160439"
},
{
"fetchTotalBytesWritten": 435478,
"fetchTotalDeleteRecordsWritten": 0,
"fetchTotalFilesInsert": 1,
"fetchTotalFilesUpdated": 0,
"fetchTotalInsertRecordsWritten": 3,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 3,
"fetchTotalUpdateRecordsWritten": 0,
"fetchTotalWriteErrors": 0,
"timestamp": "20190923160340"
}
   ]
   ```
   If I user modified if `(stat.getPrevCommit() != null)` it will print:
   ```json
   [
{
"fetchTotalBytesWritten": 435258,
"fetchTotalDeleteRecordsWritten": 1,
"fetchTotalFilesInsert": 0,
"fetchTotalFilesUpdated": 1,
"fetchTotalInsertRecordsWritten": 1,
"fetchTotalPartitionsWritten": 1,
"fetchTotalRecordsWritten": 4,
"fetchTotalUpdateRecordsWritten": 0,

[jira] [Updated] (HUDI-464) Move to using classifier "core" for hive-exec for hive 2.x

2019-12-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-464:

Labels: pull-request-available  (was: )

> Move to using classifier "core" for hive-exec for hive 2.x
> --
>
> Key: HUDI-464
> URL: https://issues.apache.org/jira/browse/HUDI-464
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] modi95 opened a new pull request #1125: [HUDI-464] Use Hive Exec Core

2019-12-23 Thread GitBox
modi95 opened a new pull request #1125: [HUDI-464] Use Hive Exec Core
URL: https://github.com/apache/incubator-hudi/pull/1125
 
 
   ## What is the purpose of the pull request
   
   HUDI depends on `hive-exec` which contains a number of dependencies which 
have version conflicts with the dependencies required by Spark. Specifically, 
we are going to upgrade HUDI to `spark 2.4.4` which depends on `avro 1.8.2`, 
whereas `hive-exec` brings in `avro 1.7.7`. 
   
   We have run into many issues arising form the way that `hive-exec` packages 
its dependencies. To address these issues, we are choosing to move to 
`hive-exec:core` which does not include these dependencies. 
   
   ## Brief change log
   
   Change poms to use hive exec with the `core` classifier. 
   
   ## Verify this pull request
   
   Verified by ensuring spark and hive queries in 
https://hudi.apache.org/docker_demo.html#step-4-a-run-hive-queries continue to 
work as expected. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-464) Move to using classifier "core" for hive-exec for hive 2.x

2019-12-23 Thread Nishith Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002572#comment-17002572
 ] 

Nishith Agarwal commented on HUDI-464:
--

[~akmodi] Can you please claim this ticket and add the necessary details ?

> Move to using classifier "core" for hive-exec for hive 2.x
> --
>
> Key: HUDI-464
> URL: https://issues.apache.org/jira/browse/HUDI-464
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: Nishith Agarwal
>Priority: Major
> Fix For: 0.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-464) Move to using classifier "core" for hive-exec for hive 2.x

2019-12-23 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-464:


 Summary: Move to using classifier "core" for hive-exec for hive 2.x
 Key: HUDI-464
 URL: https://issues.apache.org/jira/browse/HUDI-464
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Hive Integration
Reporter: Nishith Agarwal
 Fix For: 0.5.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2019-12-23 Thread GitBox
n3nash commented on a change in pull request #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#discussion_r361026575
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/common/HoodieTestDataGenerator.java
 ##
 @@ -114,20 +114,20 @@ public static void writePartitionMetadata(FileSystem fs, 
String[] partitionPaths
* Generates a new avro record of the above schema format, retaining the key 
if optionally provided.
*/
   public static TestRawTripPayload generateRandomValue(HoodieKey key, String 
commitTime) throws IOException {
-GenericRecord rec = generateGenericRecord(key.getRecordKey(), "rider-" + 
commitTime, "driver-" + commitTime, 0.0);
+GenericRecord rec = generateGenericRecord(key.getRecordKey(), "rider-" + 
commitTime, "driver-" + commitTime, 0);
 return new TestRawTripPayload(rec.toString(), key.getRecordKey(), 
key.getPartitionPath(), TRIP_EXAMPLE_SCHEMA);
   }
 
   /**
* Generates a new avro record of the above schema format, retaining the key 
if optionally provided.
*/
   public static HoodieAvroPayload generateAvroPayload(HoodieKey key, String 
commitTime) throws IOException {
-GenericRecord rec = generateGenericRecord(key.getRecordKey(), "rider-" + 
commitTime, "driver-" + commitTime, 0.0);
+GenericRecord rec = generateGenericRecord(key.getRecordKey(), "rider-" + 
commitTime, "driver-" + commitTime, 0);
 return new HoodieAvroPayload(Option.of(rec));
   }
 
   public static GenericRecord generateGenericRecord(String rowKey, String 
riderName, String driverName,
-  double timestamp) {
+  long timestamp) {
 
 Review comment:
   Okay, we can fix that, shouldn't be difficult


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (b284091 -> 313fab5)

2019-12-23 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from b284091  [HUDI-386] Refactor hudi scala checkstyle rules (#1099)
 add 313fab5  [HUDI-444] Refactor the codes based on scala codestyle 
ReturnChecker rule (#1121)

No new revisions were added by this update.

Summary of changes:
 hudi-cli/src/main/scala/org/apache/hudi/cli/DedupeSparkJob.scala | 4 ++--
 hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala | 4 ++--
 style/scalastyle.xml | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)



[GitHub] [incubator-hudi] leesf merged pull request #1121: [HUDI-444] Refactor the codes based on scala codestyle NullChecker rule

2019-12-23 Thread GitBox
leesf merged pull request #1121: [HUDI-444] Refactor the codes based on scala 
codestyle NullChecker rule
URL: https://github.com/apache/incubator-hudi/pull/1121
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] TheApacheHudi closed pull request #1124: [MINOR] fix typo

2019-12-23 Thread GitBox
TheApacheHudi closed pull request #1124: [MINOR] fix typo
URL: https://github.com/apache/incubator-hudi/pull/1124
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (HUDI-456) Redo hudi-common log statements using SLF4J

2019-12-23 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-456.
--
Resolution: Duplicate

> Redo hudi-common log statements using SLF4J
> ---
>
> Key: HUDI-456
> URL: https://issues.apache.org/jira/browse/HUDI-456
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: leesf
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-463) Redo hudi-utilities log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-463:
--

 Summary: Redo hudi-utilities log statements using SLF4J
 Key: HUDI-463
 URL: https://issues.apache.org/jira/browse/HUDI-463
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-462) Redo hudi-timeline-service log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-462:
--

 Summary: Redo hudi-timeline-service log statements using SLF4J
 Key: HUDI-462
 URL: https://issues.apache.org/jira/browse/HUDI-462
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-460) Redo hudi-integ-test log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-460:
--

 Summary: Redo hudi-integ-test log statements using SLF4J
 Key: HUDI-460
 URL: https://issues.apache.org/jira/browse/HUDI-460
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-461) Redo hudi-spark log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-461:
--

 Summary: Redo hudi-spark log statements using SLF4J
 Key: HUDI-461
 URL: https://issues.apache.org/jira/browse/HUDI-461
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-459) Redo hudi-hive log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-459:
--

 Summary: Redo hudi-hive log statements using SLF4J
 Key: HUDI-459
 URL: https://issues.apache.org/jira/browse/HUDI-459
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-458) Redo hudi-hadoop-mr log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-458:
--

 Summary: Redo hudi-hadoop-mr log statements using SLF4J
 Key: HUDI-458
 URL: https://issues.apache.org/jira/browse/HUDI-458
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-457) Redo hudi-common log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-457:
--

 Summary: Redo hudi-common log statements using SLF4J
 Key: HUDI-457
 URL: https://issues.apache.org/jira/browse/HUDI-457
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-456) Redo hudi-common log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-456:
--

 Summary: Redo hudi-common log statements using SLF4J
 Key: HUDI-456
 URL: https://issues.apache.org/jira/browse/HUDI-456
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-455) Redo hudi-client log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-455:
--

 Summary: Redo hudi-client log statements using SLF4J
 Key: HUDI-455
 URL: https://issues.apache.org/jira/browse/HUDI-455
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-454) Redo hudi-cli log statements using SLF4J

2019-12-23 Thread leesf (Jira)
leesf created HUDI-454:
--

 Summary: Redo hudi-cli log statements using SLF4J
 Key: HUDI-454
 URL: https://issues.apache.org/jira/browse/HUDI-454
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] TheApacheHudi opened a new pull request #1124: [MINOR] fix typo

2019-12-23 Thread GitBox
TheApacheHudi opened a new pull request #1124: [MINOR] fix typo
URL: https://github.com/apache/incubator-hudi/pull/1124
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Fix typo*
   
   ## Brief change log
   
   *Fix typo*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-23 Thread GitBox
n3nash commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#issuecomment-568569075
 
 
   @nbalajee Can we first fix the failing build ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-23 Thread GitBox
n3nash commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#issuecomment-568567242
 
 
   @vinothchandar Yup, sounds good


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2019-12-23 Thread GitBox
n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-568566995
 
 
   @yanghua Can you check why the build is failing now ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2019-12-23 Thread GitBox
n3nash commented on a change in pull request #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#discussion_r360987666
 
 

 ##
 File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/generator/DeltaGenerator.java
 ##
 @@ -108,6 +114,17 @@ public DeltaGenerator(DeltaConfig deltaOutputConfig, 
JavaSparkContext jsc, Spark
 return inputBatch;
   }
 
+  public JavaRDD generateUpsertsWithDistributedSource(Config 
operation) {
 
 Review comment:
   @yanghua Yes, we should refactor those parts. 
   
   For (5), what I mean is that when we perform 
distributedTestDataSource.fetchNext(Option.empty(), 1000) does it return a 
bunch of updates + inserts (or just inserts) ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1119: Fix: HoodieCommitMetadata only show first commit insert rows.

2019-12-23 Thread GitBox
n3nash commented on a change in pull request #1119: Fix: HoodieCommitMetadata 
only show first commit insert rows.
URL: https://github.com/apache/incubator-hudi/pull/1119#discussion_r360986944
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieCommitMetadata.java
 ##
 @@ -175,7 +175,9 @@ public long fetchTotalInsertRecordsWritten() {
 long totalInsertRecordsWritten = 0;
 for (List stats : partitionToWriteStats.values()) {
   for (HoodieWriteStat stat : stats) {
-if (stat.getPrevCommit() != null && 
stat.getPrevCommit().equalsIgnoreCase("null")) {
 
 Review comment:
   For the first commit, the prevCommmit value is actually "null" which is the 
reason why this check is here. 
   Can you inspect and see if stat.getPrevCommit() returns null in some cases ? 
Then, the above condition will not be satisfied.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2019-12-23 Thread GitBox
vinothchandar commented on issue #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-568557318
 
 
   @yanghua IIUC, @bvaradar uses it actually to run a test job that generates 
random data on the cluster.. So, may be leave it in `hoodie-utilities` so that 
the bundle also has it.. Its in general, I nice way to start running 
deltastreamer with some fake data.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-453) Throw failed to archive commits error when writing data to MOR table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-453:

Status: In Progress  (was: Open)

> Throw failed to archive commits error when writing data to MOR table
> 
>
> Key: HUDI-453
> URL: https://issues.apache.org/jira/browse/HUDI-453
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Throw failed to archive commits error when writing data to MOR table
> {code:java}
> 19/12/24 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
> .commit file: 20191224004558.clean.requested
> java.io.IOException: Not an Avro data file
> at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
> at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
> at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
> at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
> at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
> at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
> at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
> at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-453) Throw failed to archive commits error when writing data to MOR table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-453:

Description: 
Throw failed to archive commits error when writing data to MOR table
{code:java}
19/12/24 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested
java.io.IOException: Not an Avro data file
at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
at org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
{code}

  was:
Throw failed to archive commits error when writing data to MOR table

```

19/12/24 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested

java.io.IOException: Not an Avro data file

 at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)

 at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)

 at 
org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)

 at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)

 at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)

 at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)

 at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)

 at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)

 at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)

 at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)

 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)

 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)

 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)

 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)

 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)

 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

 at 

[jira] [Assigned] (HUDI-453) Throw failed to archive commits error when writing data to MOR table

2019-12-23 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken reassigned HUDI-453:
---

Assignee: lamber-ken

> Throw failed to archive commits error when writing data to MOR table
> 
>
> Key: HUDI-453
> URL: https://issues.apache.org/jira/browse/HUDI-453
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Throw failed to archive commits error when writing data to MOR table
> ```
> 19/12/24 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
> .commit file: 20191224004558.clean.requested
> java.io.IOException: Not an Avro data file
>  at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)
>  at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)
>  at 
> org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)
>  at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)
>  at 
> org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)
>  at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)
>  at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)
>  at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)
>  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-453) Throw failed to archive commits error when writing data to MOR table

2019-12-23 Thread lamber-ken (Jira)
lamber-ken created HUDI-453:
---

 Summary: Throw failed to archive commits error when writing data 
to MOR table
 Key: HUDI-453
 URL: https://issues.apache.org/jira/browse/HUDI-453
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: lamber-ken


Throw failed to archive commits error when writing data to MOR table

```

19/12/24 01:30:48 ERROR HoodieCommitArchiveLog: Failed to archive commits, 
.commit file: 20191224004558.clean.requested

java.io.IOException: Not an Avro data file

 at org.apache.avro.file.DataFileReader.openReader(DataFileReader.java:50)

 at 
org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:147)

 at 
org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:88)

 at 
org.apache.hudi.io.HoodieCommitArchiveLog.convertToAvroRecord(HoodieCommitArchiveLog.java:294)

 at 
org.apache.hudi.io.HoodieCommitArchiveLog.archive(HoodieCommitArchiveLog.java:253)

 at 
org.apache.hudi.io.HoodieCommitArchiveLog.archiveIfRequired(HoodieCommitArchiveLog.java:122)

 at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:562)

 at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:523)

 at org.apache.hudi.HoodieWriteClient.commit(HoodieWriteClient.java:514)

 at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:159)

 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)

 at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)

 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)

 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)

 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)

 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)

 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)

 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)

 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

 at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)

 at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)

 at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)

 at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)

 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)

 at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)

 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)

 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)

 at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)

 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)

 at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:225)

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-452) Create a hudi-ci github organizations to support hudi CI via azure pipeline

2019-12-23 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002380#comment-17002380
 ] 

Vinoth Chandar commented on HUDI-452:
-

we can create this using another organization we already have... from the 
pre-asf days.. its tied to an account thats shared with private@.. let me see 
if I can revive that and add you as an admin etc.. 

 

> Create a hudi-ci github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new organizations to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nbalajee commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-23 Thread GitBox
nbalajee commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#issuecomment-568491020
 
 
   > We already have a `BufferedFSInputStream` in the codebase. Can we see if 
this can be reused instead? We anyway just read bytes..
   > 
   > ```
   > SerializationUtils
   >   .deserialize(SpillableMapUtils.readBytesFromDisk(file, 
entry.getOffsetOfValue(), entry.getSizeOfValue()));
   > ```
   
   Thanks for your suggestion Vinod.  BufferedFSInputStream allows seek forward 
(file pointer to make forward progress) and limited seek backwards 
functionality (seek to a previously set mark on the stream).   Does not support 
seek anywhere, for random access of records (get() functionality).

   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on issue #1115: [HUDI-392] Introduce DIstributedTestDataSource to generate test data

2019-12-23 Thread GitBox
yanghua commented on issue #1115: [HUDI-392] Introduce 
DIstributedTestDataSource to generate test data
URL: https://github.com/apache/incubator-hudi/pull/1115#issuecomment-568455662
 
 
   Hi @vinothchandar , WDYT about `DIstributedTestDataSource `? It seems this 
class has not been used anywhere. It's only be tested in 
`TestHoodieDeltaStreamer`. Can we move it into `hudi-test-suite` module? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-452) Create a hudi-ci github organizations to support hudi CI via azure pipeline

2019-12-23 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-452:
--
Summary: Create a hudi-ci github organizations to support hudi CI via azure 
pipeline  (was: Create a hudi-ci github repository to support hudi CI via azure 
pipeline)

> Create a hudi-ci github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new repository to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-452) Create a hudi-ci github organizations to support hudi CI via azure pipeline

2019-12-23 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-452:
--
Description: Apache does not allow github integration with azure pipelines. 
So we should create a new organizations to integrate with azure pipelines like 
Flink has done. https://github.com/flink-ci  (was: Apache does not allow github 
integration with azure pipelines. So we should create a new repository to 
integrate with azure pipelines like Flink has done. https://github.com/flink-ci)

> Create a hudi-ci github organizations to support hudi CI via azure pipeline
> ---
>
> Key: HUDI-452
> URL: https://issues.apache.org/jira/browse/HUDI-452
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>
> Apache does not allow github integration with azure pipelines. So we should 
> create a new organizations to integrate with azure pipelines like Flink has 
> done. https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-452) Create a hudi-ci github repository to support hudi CI via azure pipeline

2019-12-23 Thread vinoyang (Jira)
vinoyang created HUDI-452:
-

 Summary: Create a hudi-ci github repository to support hudi CI via 
azure pipeline
 Key: HUDI-452
 URL: https://issues.apache.org/jira/browse/HUDI-452
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
Reporter: vinoyang
Assignee: vinoyang


Apache does not allow github integration with azure pipelines. So we should 
create a new repository to integrate with azure pipelines like Flink has done. 
https://github.com/flink-ci



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on a change in pull request #1111: [HUDI-331]Fix java docs for all public apis in HoodieWriteClient

2019-12-23 Thread GitBox
hddong commented on a change in pull request #: [HUDI-331]Fix java docs for 
all public apis in HoodieWriteClient
URL: https://github.com/apache/incubator-hudi/pull/#discussion_r360811318
 
 

 ##
 File path: hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java
 ##
 @@ -509,13 +540,21 @@ private Partitioner getPartitioner(HoodieTable table, 
boolean isUpsert, Workload
 
   /**
* Commit changes performed at the given commitTime marker.
+   *
+   * @param commitTime Commit Time handle
+   * @param writeStatuses RDD of WriteStatus to inspect errors and counts
+   * @return {@code true} if commit is successful. {@code false} otherwise
*/
   public boolean commit(String commitTime, JavaRDD writeStatuses) 
{
 return commit(commitTime, writeStatuses, Option.empty());
   }
 
   /**
* Commit changes performed at the given commitTime marker.
+   * @param commitTime Commit Time handle
+   * @param writeStatuses RDD of WriteStatus to inspect errors and counts
 
 Review comment:
   > Again, not sure if we need to describe the data type in english for the 
parameters.. lets keep it high level?
   
   I wanted to keep context consistent, and fix all.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services