[jira] [Closed] (HUDI-394) Provide a basic implementation of test suite

2019-12-16 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-394.
-
Resolution: Implemented

> Provide a basic implementation of test suite
> 
>
> Key: HUDI-394
> URL: https://issues.apache.org/jira/browse/HUDI-394
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It provides:
>  * Flexible schema payload generation
>  * Different types of workload generation such as inserts, upserts etc
>  * Post process actions to perform validations
>  * Interoperability of test suite to use HoodieWriteClient and 
> HoodieDeltaStreamer so both code paths can be tested
>  * Custom workload sequence generator
>  * Ability to perform parallel operations, such as upsert and compaction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-392) Introduce DIstributedTestDataSource to generate test data

2019-12-16 Thread vinoyang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997254#comment-16997254
 ] 

vinoyang commented on HUDI-392:
---

Working on this issue. [~vinoth]  [~nishith29] FYI

> Introduce DIstributedTestDataSource to generate test data
> -
>
> Key: HUDI-392
> URL: https://issues.apache.org/jira/browse/HUDI-392
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2019-12-16 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-251:
---

Assignee: Purushotham Pushpavanthar

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Taher Koitawala
>Assignee: Purushotham Pushpavanthar
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar closed issue #857: http://hudi.apache.org/comparison.html# should mention Iceberg and DeltaLake

2019-12-16 Thread GitBox
vinothchandar closed issue #857: http://hudi.apache.org/comparison.html# should 
mention Iceberg and DeltaLake
URL: https://github.com/apache/incubator-hudi/issues/857
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2019-12-16 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997317#comment-16997317
 ] 

Vinoth Chandar commented on HUDI-251:
-

[~pushpavanthar] you have the ticket now

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Taher Koitawala
>Assignee: Purushotham Pushpavanthar
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nbalajee commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-16 Thread GitBox
nbalajee commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#issuecomment-566294609
 
 
   > Is there a JIRA for this work?
   Sorry.  Missed your comment.  Updated header with JIRA id.  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-335) Improvements to DiskBasedMap

2019-12-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-335:

Labels: Hoodie pull-request-available  (was: Hoodie)

> Improvements to DiskBasedMap
> 
>
> Key: HUDI-335
> URL: https://issues.apache.org/jira/browse/HUDI-335
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balajee Nagasubramaniam
>Priority: Major
>  Labels: Hoodie, pull-request-available
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot 
> 2019-11-13 at 2.56.53 PM.png
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the 
> record on disk.
> This change improves the performance of the record get/read operation to 
> disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising.   Before the write performance improvement, 
> spilling/writing 1 million records (record size ~ 350 bytes) to the file took 
> about 104 seconds. 
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records 
> (size ~350 bytes) from the spill file took about 23 seconds.  After the 
> improvement, same operation can be performed in under 4 seconds.
> {{without read/write performance improvements 
> 
> RecordsHandled:   1   totalTestTime:  3145writeTime:  1176
> readTime:   255
> RecordsHandled:   5   totalTestTime:  5775writeTime:  4187
> readTime:   1175
> RecordsHandled:   10  totalTestTime:  10570   writeTime:  7718
> readTime:   2203
> RecordsHandled:   50  totalTestTime:  59723   writeTime:  45618   
> readTime:   11093
> RecordsHandled:   100 totalTestTime:  120022  writeTime:  87918   
> readTime:   22355
> RecordsHandled:   200 totalTestTime:  258627  writeTime:  187185  
> readTime:   56431}}
> {{With write improvement:
> RecordsHandled:   1   totalTestTime:  2013writeTime:  700 
> readTime:   503
> RecordsHandled:   5   totalTestTime:  2525writeTime:  390 
> readTime:   1247
> RecordsHandled:   10  totalTestTime:  3583writeTime:  464 
> readTime:   2352
> RecordsHandled:   50  totalTestTime:  22934   writeTime:  3731
> readTime:   15778
> RecordsHandled:   100 totalTestTime:  42415   writeTime:  4816
> readTime:   30332
> RecordsHandled:   200 totalTestTime:  74158   writeTime:  10192   
> readTime:   53195}}
> {{With read improvements:
> RecordsHandled:   1   totalTestTime:  2473writeTime:  1562
> readTime:   87
> RecordsHandled:   5   totalTestTime:  6169writeTime:  5151
> readTime:   438
> RecordsHandled:   10  totalTestTime:  9967writeTime:  8636
> readTime:   252
> RecordsHandled:   50  totalTestTime:  50889   writeTime:  46766   
> readTime:   1014
> RecordsHandled:   100 totalTestTime:  114482  writeTime:  104353  
> readTime:   3776
> RecordsHandled:   200 totalTestTime:  239251  writeTime:  219041  
> readTime:   8127}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] nbalajee closed pull request #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-16 Thread GitBox
nbalajee closed pull request #1077: [HUDI-335] : Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nbalajee opened a new pull request #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-16 Thread GitBox
nbalajee opened a new pull request #1077: [HUDI-335] : Improvements to 
DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077
 
 
   ## What is the purpose of the pull request
   DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a 
file,
 keeping the (K, fileMetadata) in memory, to reduce the foot print of the 
record on disk.
   
 This change improves the performance of the record get/read (random 
read/sequential read) and put/write operations from/to disk, by introducing a 
data buffer/cache.
   
 Before the performance improvement:
   RecordsHandled:  1   totalTestTime:  3145writeTime:  1176
readTime:   255
   RecordsHandled:  5   totalTestTime:  5775writeTime:  4187
readTime:   1175
   RecordsHandled:  10  totalTestTime:  10570   writeTime:  7718
readTime:   2203
   RecordsHandled:  50  totalTestTime:  59723   writeTime:  45618   
readTime:   11093
   RecordsHandled:  100 totalTestTime:  120022  writeTime:  87918   
readTime:   22355
   RecordsHandled:  200 totalTestTime:  258627  writeTime:  187185  
readTime:   56431
   
 After the improvement:
   RecordsHandled: 1 totalTestTime: 1551 writeTime: 531 seqReadTime: 122 
randReadTime: 125
   RecordsHandled: 5 totalTestTime: 1371 writeTime: 420 seqReadTime: 179 
randReadTime: 250
   RecordsHandled: 10 totalTestTime: 1895 writeTime: 535 seqReadTime: 181 
randReadTime: 512
   RecordsHandled: 50 totalTestTime: 8838 writeTime: 2031 seqReadTime: 1128 
randReadTime: 2580
   RecordsHandled: 100 totalTestTime: 16147 writeTime: 4059 seqReadTime: 
1634 randReadTime: 5293
   RecordsHandled: 200 totalTestTime: 34090 writeTime: 8337 seqReadTime: 
3163 randReadTime: 10694
   
   
   ## Brief change log
   
   - Using BufferedRandomAccessFile instead of RandomAccessFile, in read path.
   - Using BufferedOutputStream in the write path. 
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as 
   TestDiskBasedMap:testSimpleInsert
   
   ## Committer checklist
   
- [x ] Has a corresponding JIRA in PR title & commit .  
   https://issues.apache.org/jira/browse/HUDI-335
   
- [x ] Commit message is descriptive of the change

- [ x] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nbalajee commented on a change in pull request #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-16 Thread GitBox
nbalajee commented on a change in pull request #1077: [HUDI-335] : Improvements 
to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#discussion_r358524668
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,344 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.common.util;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.util.Arrays;
+
+import org.apache.log4j.Logger;
+
+/**
+ * A BufferedRandomAccessFile is like a
+ * RandomAccessFile, but it uses a private buffer so that most
+ * operations do not require a disk access.
+ * 
+ *
+ * Note: The operations on this class are unmonitored. Also, the correct
 
 Review comment:
   Added a note that this file is adopted from 
org.apache.cassandra.io.BufferedRandomAccessFile


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nbalajee commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-16 Thread GitBox
nbalajee commented on issue #1077: [HUDI-335] : Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#issuecomment-566301347
 
 
   > Is there a JIRA for this work?
   https://issues.apache.org/jira/browse/HUDI-335


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2019-12-16 Thread GitBox
n3nash commented on issue #1100: [HUDI-289] Implement a test suite to support 
long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#issuecomment-566306717
 
 
   @yanghua great, thanks. Can you please take a pass at the current PR and 
leave any comments that we should address ? Also, lets start running the test 
suite (I'm doing that too) to see if it works as expected in a longer duration.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] wojustme commented on issue #1104: [HUDI-404] fix the error of compiling project.

2019-12-16 Thread GitBox
wojustme commented on issue #1104: [HUDI-404] fix the error of compiling 
project.
URL: https://github.com/apache/incubator-hudi/pull/1104#issuecomment-566319717
 
 
   @lamber-ken @leesf 
   I am sorry for replying so late. 
   But, In my dev's development environment, It doesn't work.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1104: [HUDI-404] fix the error of compiling project.

2019-12-16 Thread GitBox
lamber-ken commented on issue #1104: [HUDI-404] fix the error of compiling 
project.
URL: https://github.com/apache/incubator-hudi/pull/1104#issuecomment-566323965
 
 
   > @lamber-ken @leesf
   > I am sorry for replying so late.
   > But, In my dev's development environment, It doesn't work.
   
   No problem. Can you describe the version of scala, java, maven and os system.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-331) Fix java docs for all public apis (HoodieWriteClient)

2019-12-16 Thread hong dongdong (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997780#comment-16997780
 ] 

hong dongdong commented on HUDI-331:


I will working on it. [~xleesf]

> Fix java docs for all public apis (HoodieWriteClient)
> -
>
> Key: HUDI-331
> URL: https://issues.apache.org/jira/browse/HUDI-331
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: newbie
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Some public apis in HoodieWriteClient need to be fixed with sufficient info. 
> Creating this ticket to get it fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-415) HoodieSparkSqlWriter Commit time not representing the Spark job starting time

2019-12-16 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-415:
---

 Summary: HoodieSparkSqlWriter Commit time not representing the 
Spark job starting time
 Key: HUDI-415
 URL: https://issues.apache.org/jira/browse/HUDI-415
 Project: Apache Hudi (incubating)
  Issue Type: Bug
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


Hudi records the commit time after the first action complete. If there is a 
heavy transformation before isEmpty(), then the commit time could be inaccurate.
{code:java}
if (hoodieRecords.isEmpty()) { 
log.info("new batch has no new records, skipping...") 
return (true, common.util.Option.empty()) 
} 
commitTime = client.startCommit() 
writeStatuses = DataSourceUtils.doWriteOperation(client, hoodieRecords, 
commitTime, operation)
{code}
For example, I start the spark job at 20190101, but *isEmpty()* ran for 2 
hours, then the commit time in the .hoodie folder will be 201901010*2*00. If I 
use the commit time to ingest data starting from 201901010200(from HDFS, not 
using deltastreamer), then I will miss 2 hours of data.

Is this set up intended? Can we move the commit time before isEmpty()?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf opened a new pull request #1107: [MINOR] add committers info

2019-12-16 Thread GitBox
leesf opened a new pull request #1107: [MINOR] add committers info
URL: https://github.com/apache/incubator-hudi/pull/1107
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   add committers info.
   
   ## Brief change log
   
   add committers info.
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-331) Fix java docs for all public apis (HoodieWriteClient)

2019-12-16 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997800#comment-16997800
 ] 

leesf commented on HUDI-331:


[~hongdongdong] Thanks.

> Fix java docs for all public apis (HoodieWriteClient)
> -
>
> Key: HUDI-331
> URL: https://issues.apache.org/jira/browse/HUDI-331
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: sivabalan narayanan
>Priority: Major
>  Labels: newbie
> Fix For: 0.5.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Some public apis in HoodieWriteClient need to be fixed with sufficient info. 
> Creating this ticket to get it fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1100: [HUDI-289] Implement a test suite to support long running test for Hudi writing and querying end-end

2019-12-16 Thread GitBox
yanghua commented on a change in pull request #1100: [HUDI-289] Implement a 
test suite to support long running test for Hudi writing and querying end-end
URL: https://github.com/apache/incubator-hudi/pull/1100#discussion_r358569570
 
 

 ##
 File path: 
hudi-test-suite/src/main/java/org/apache/hudi/testsuite/dag/WorkflowDagGenerator.java
 ##
 @@ -0,0 +1,75 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.testsuite.dag;
+
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.testsuite.configuration.DeltaConfig.Config;
+import org.apache.hudi.testsuite.dag.nodes.DagNode;
+import org.apache.hudi.testsuite.dag.nodes.HiveQueryNode;
+import org.apache.hudi.testsuite.dag.nodes.InsertNode;
+import org.apache.hudi.testsuite.dag.nodes.UpsertNode;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * An example of how to generate a workflow dag programmatically. This is also 
used as the default workflow dag if
+ * none is provided.
+ */
+public class WorkflowDagGenerator {
 
 Review comment:
   Since this is an example generator. Shall we rename the class name to 
`SimpleWorkflowDagGenerator` or `DefaultWorkflowDagGenerator`? WDYT? @n3nash 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #131

2019-12-16 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.21 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle[j

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1077: [HUDI-335] : Improvements to DiskbasedMap

2019-12-16 Thread GitBox
vinothchandar commented on a change in pull request #1077: [HUDI-335] : 
Improvements to DiskbasedMap
URL: https://github.com/apache/incubator-hudi/pull/1077#discussion_r358592257
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/BufferedRandomAccessFile.java
 ##
 @@ -0,0 +1,350 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package com.uber.hoodie.common.util;
+
+import java.io.File;
+import java.io.FileNotFoundException;
+import java.io.IOException;
+import java.io.RandomAccessFile;
+import java.util.Arrays;
+
+import org.apache.log4j.Logger;
+
+/**
+ * This product includes code from Apache Casendra.
 
 Review comment:
   typo: Apache Cassandra


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1105: [HUDI-405] Fix sync no hive partition at first time

2019-12-16 Thread GitBox
vinothchandar commented on a change in pull request #1105: [HUDI-405] Fix sync 
no hive partition at first time
URL: https://github.com/apache/incubator-hudi/pull/1105#discussion_r358599122
 
 

 ##
 File path: hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -632,24 +632,33 @@ public void close() {
 if (!lastCommitTimeSynced.isPresent()) {
   LOG.info("Last commit time synced is not known, listing all partitions 
in " + syncConfig.basePath + ",FS :" + fs);
   try {
-return FSUtils.getAllPartitionPaths(fs, syncConfig.basePath, 
syncConfig.assumeDatePartitioning);
+
+List fsPartitions = FSUtils.getAllPartitionPaths(fs, 
syncConfig.basePath, syncConfig.assumeDatePartitioning);
+List tlPartitions = findPartitionsAfter("0");
 
 Review comment:
   Is there a better way to get the only commit that may be present? lets try 
to avoid hardcoding `0` 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing

2019-12-16 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997899#comment-16997899
 ] 

Vinoth Chandar commented on HUDI-259:
-

I believe we will get some eyes on this after the holidays :)

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf merged pull request #1107: [MINOR] add committers info

2019-12-16 Thread GitBox
leesf merged pull request #1107: [MINOR] add committers info
URL: https://github.com/apache/incubator-hudi/pull/1107
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch asf-site updated: [MINOR] add committers info (#1107)

2019-12-16 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new e3bce1f  [MINOR] add committers info (#1107)
e3bce1f is described below

commit e3bce1f38576a502a00c3f4ef8b26b281462e5a6
Author: leesf <490081...@qq.com>
AuthorDate: Tue Dec 17 13:34:32 2019 +0800

[MINOR] add committers info (#1107)
---
 docs/community.cn.md | 20 +++-
 docs/community.md| 20 +++-
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/docs/community.cn.md b/docs/community.cn.md
index aba54b2..853cfce 100644
--- a/docs/community.cn.md
+++ b/docs/community.cn.md
@@ -49,6 +49,24 @@ Committers are chosen by a majority vote of the Apache Hudi 
[PMC](https://www.ap
  - Great citizenship in helping with all peripheral (but very critical) work 
like site maintenance, wiki/jira cleanups and so on.
  - Proven commitment to the project by way of upholding all agreed upon 
processes, conventions and principles of the community.
 
+### The Committers
+
+| Image| Name  
   | Role| Apache ID
|
+|  | 
 | --- 
|  |
+| https://avatars.githubusercontent.com/alunarbeach"; width = "100" 
height = "100" alt="alunarbeach" align=center /> | [Anbu 
Cheeralan](https://github.com/alunarbeach) | PPMC, Committer | 
anchee   |
+| https://avatars.githubusercontent.com/bhasudha"; width = "100" 
height = "100" alt="bhasudha" align=center /> | [Bhavani 
Sudha](https://github.com/bhasudha) | Committer   | 
bhavanisudha |
+| https://avatars.githubusercontent.com/bvaradar"; width = "100" 
height = "100" alt="bvaradar" align=center /> | [Balaji 
Varadarajan](https://github.com/bvaradar)| PPMC, Committer | 
vbalaji  |
+| https://avatars.githubusercontent.com/kishoreg"; width = "100" 
height = "100" alt="kishoreg" align=center /> | [Kishore 
Gopalakrishna](https://github.com/kishoreg) | PPMC, Committer | 
kishoreg |
+| https://avatars.githubusercontent.com/leesf"; width = "100" height 
= "100" alt="leesf" align=center /> | [Shaofeng Li](https://github.com/leesf)   
   | Committer   | leesf|
+| https://avatars.githubusercontent.com/lresende"; width = "100" 
height = "100" alt="lresende" align=center /> | [Luciano 
Resende](https://github.com/lresende)   | PPMC, Committer | 
lresende |
+| https://avatars.githubusercontent.com/n3nash"; width = "100" height 
= "100" alt="n3nash" align=center /> | [Nishith 
Agarwal](https://github.com/n3nash) | PPMC, Committer | 
nagarwal |
+| https://avatars.githubusercontent.com/prasannarajaperumal"; width = 
"100" height = "100" alt="prasannarajaperumal" align=center /> | [Prasanna 
Rajaperumal](https://github.com/prasannarajaperumal) | PPMC, Committer | 
prasanna |
+| https://avatars.githubusercontent.com/smarthi"; width = "100" 
height = "100" alt="smarthi" align=center /> | [Suneel 
Marthi](https://github.com/smarthi)  | PPMC, Committer | 
smarthi  |
+| https://avatars.githubusercontent.com/tweise"; width = "100" height 
= "100" alt="tweise" align=center /> | [Thomas 
Weise](https://github.com/tweise)| PPMC, Committer | thw
  |
+| https://avatars.githubusercontent.com/vinothchandar"; width = "100" 
height = "100" alt="vinothchandar" align=center /> | [vinoth 
chandar](https://github.com/vinothchandar)   | PPMC, Committer | vinoth 
  |
+| https://avatars.githubusercontent.com/yanghua"; width = "100" 
height = "100" alt="yanghua" align=center /> | 
[vinoyang](https://github.com/yanghua)   | Committer   
| vinoyang |
+| https://avatars.githubusercontent.com/zqureshi"; width = "100" 
height = "100" alt="zqureshi" align=center /> | [Zeeshan 
Qureshi](https://github.com/zqureshi)   | PPMC, Committer | 
zqureshi |
+
 
 ### Code Contributions
 
@@ -61,4 +79,4 @@ It's useful to obtain few accounts to be able to effectively 
contribute to Hudi.
  
  - Github account is needed to send pull requests to Hudi
  - Sign-up/in to the Apache [JIRA](https://issues.apache.org/jira). Then 
please email the dev mailing list with your username, asking to be added as a 
contributor to the project. This enables you to assign/be-assigned tickets and 
comment on them. 
- - Sign-up/in to the Apache 
[cWiki](https://cwiki.apache.org/confluence/signup.action), to be able to 
contribute to the wiki pages/HIPs. 
\ No newline at end of file
+ - Sign-up/in to the Apache 
[cWiki](https://cwiki.apache.org/confluence/signup.action)

[GitHub] [incubator-hudi] leesf commented on issue #1095: [HUDI-210] Implement prometheus metrics reporter

2019-12-16 Thread GitBox
leesf commented on issue #1095: [HUDI-210] Implement prometheus metrics reporter
URL: https://github.com/apache/incubator-hudi/pull/1095#issuecomment-566389322
 
 
   @lamber-ken @XuQianJin-Stars. Is it ready?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1105: [HUDI-405] Fix sync no hive partition at first time

2019-12-16 Thread GitBox
lamber-ken commented on issue #1105: [HUDI-405] Fix sync no hive partition at 
first time
URL: https://github.com/apache/incubator-hudi/pull/1105#issuecomment-566390544
 
 
   > Don't follow why the partitions are not visible after the commit? Can we 
first layout the root cause for that?
   
   ### Why the first time can't get the data
   
   At the first time, the `lastCommitTimeSynced` of the target table is not 
present, HoodieHiveClient gets all partition paths by 
`FSUtils.getAllPartitionPaths`. If `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` is set 
true, the fsutil can only match `basePath + /*/*/*`, but actually partition is 
`basePath + /-MM-dd`. 
   
   
![image](https://user-images.githubusercontent.com/20113411/70967797-5cc72f00-20d2-11ea-8004-6d910879d1ac.png)
   
   ### Two ways to solve this problem
   1, Set `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` to `false`. After that, 
HoodieHiveClient will get all folder partitions, for detail, you can visit 
`FSUtils#getAllPartitionPaths`.
   
   2, If user custom the partition extractor, HiveSyncTool sync no partition at 
the first commit, we can get the partiton info from `HoodieTimeline`, just like 
the code I modified.
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1105: [HUDI-405] Fix sync no hive partition at first time

2019-12-16 Thread GitBox
lamber-ken edited a comment on issue #1105: [HUDI-405] Fix sync no hive 
partition at first time
URL: https://github.com/apache/incubator-hudi/pull/1105#issuecomment-566390544
 
 
   > Don't follow why the partitions are not visible after the commit? Can we 
first layout the root cause for that?
   
   ### Why the first time can't get the data
   
   At the first time, the `lastCommitTimeSynced` of the target table is not 
present, HoodieHiveClient gets all partition paths by 
`FSUtils.getAllPartitionPaths`. If `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` is set 
`true`, the fsutil can only match `basePath + /*/*/*`, but actually partition 
is `basePath + /-MM-dd`. 
   
   
![image](https://user-images.githubusercontent.com/20113411/70967797-5cc72f00-20d2-11ea-8004-6d910879d1ac.png)
   
   ### Two ways to solve this problem
   1, Set `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` to `false`. After that, 
HoodieHiveClient will get all folder partitions, for detail, you can visit 
`FSUtils#getAllPartitionPaths`.
   
   2, If user custom the partition extractor, HiveSyncTool sync no partition at 
the first commit, we can get the partiton info from `HoodieTimeline`, just like 
the code I modified.
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1105: [HUDI-405] Fix sync no hive partition at first time

2019-12-16 Thread GitBox
lamber-ken commented on a change in pull request #1105: [HUDI-405] Fix sync no 
hive partition at first time
URL: https://github.com/apache/incubator-hudi/pull/1105#discussion_r358611033
 
 

 ##
 File path: hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -632,24 +632,33 @@ public void close() {
 if (!lastCommitTimeSynced.isPresent()) {
   LOG.info("Last commit time synced is not known, listing all partitions 
in " + syncConfig.basePath + ",FS :" + fs);
   try {
-return FSUtils.getAllPartitionPaths(fs, syncConfig.basePath, 
syncConfig.assumeDatePartitioning);
+
+List fsPartitions = FSUtils.getAllPartitionPaths(fs, 
syncConfig.basePath, syncConfig.assumeDatePartitioning);
+List tlPartitions = findPartitionsAfter("0");
 
 Review comment:
   > Is there a better way to get the only commit that may be present? lets try 
to avoid hardcoding `0`
   
   Thanks, let me think about it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken edited a comment on issue #1105: [HUDI-405] Fix sync no hive partition at first time

2019-12-16 Thread GitBox
lamber-ken edited a comment on issue #1105: [HUDI-405] Fix sync no hive 
partition at first time
URL: https://github.com/apache/incubator-hudi/pull/1105#issuecomment-566390544
 
 
   > Don't follow why the partitions are not visible after the commit? Can we 
first layout the root cause for that?
   
   ### Why the first time can't get the data
   
   At the first time, the `lastCommitTimeSynced` of the target table is not 
present, HoodieHiveClient gets all partition paths by 
`FSUtils.getAllPartitionPaths`. If `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` is set 
`true`, the fsutil can only match `basePath + /*/*/*`, but the partition is 
`basePath + /-MM-dd` actually. 
   
   
![image](https://user-images.githubusercontent.com/20113411/70967797-5cc72f00-20d2-11ea-8004-6d910879d1ac.png)
   
   ### Two ways to solve this problem
   1, Set `HIVE_ASSUME_DATE_PARTITION_OPT_KEY` to `false`. After that, 
HoodieHiveClient will get all folder partitions, for detail, you can visit 
`FSUtils#getAllPartitionPaths`.
   
   2, If user custom the partition extractor, HiveSyncTool sync no partition at 
the first commit, we can get the partiton info from `HoodieTimeline`, just like 
the code I modified.
   
   IMO, the second solution can guarantee that whether 
`HIVE_ASSUME_DATE_PARTITION_OPT_KEY` is true or not, we can sync acpartition at 
first time.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on a change in pull request #1105: [HUDI-405] Fix sync no hive partition at first time

2019-12-16 Thread GitBox
lamber-ken commented on a change in pull request #1105: [HUDI-405] Fix sync no 
hive partition at first time
URL: https://github.com/apache/incubator-hudi/pull/1105#discussion_r358614326
 
 

 ##
 File path: hudi-hive/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
 ##
 @@ -632,24 +632,33 @@ public void close() {
 if (!lastCommitTimeSynced.isPresent()) {
   LOG.info("Last commit time synced is not known, listing all partitions 
in " + syncConfig.basePath + ",FS :" + fs);
   try {
-return FSUtils.getAllPartitionPaths(fs, syncConfig.basePath, 
syncConfig.assumeDatePartitioning);
+
+List fsPartitions = FSUtils.getAllPartitionPaths(fs, 
syncConfig.basePath, syncConfig.assumeDatePartitioning);
+List tlPartitions = findPartitionsAfter("0");
 
 Review comment:
   My idea is extracting the public method to get all the instants after 
startTs. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2019-12-16 Thread Purushotham Pushpavanthar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997914#comment-16997914
 ] 

Purushotham Pushpavanthar commented on HUDI-251:


Thanks. [~vinoth] how do I create RFC? I'm working with my colleagues 
([~rushi1988] and [~inabdul]) on this feature and we've come up with 
requirements and design. We would like to share it with community for comments 
before we proceed on development.

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Taher Koitawala
>Assignee: Purushotham Pushpavanthar
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1108: [MINOR] Add slack invite icon on README

2019-12-16 Thread GitBox
lamber-ken opened a new pull request #1108: [MINOR] Add slack invite icon on 
README
URL: https://github.com/apache/incubator-hudi/pull/1108
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Add slack invite icon on README
   
   ## Brief change log
   
 - Add slack invite icon on README
   
   ## Verify this pull request
   
   This pull request is doc update without any test coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #976: [HUDI-106] Adding support for DynamicBloomFilter

2019-12-16 Thread GitBox
nsivabalan commented on a change in pull request #976: [HUDI-106] Adding 
support for DynamicBloomFilter
URL: https://github.com/apache/incubator-hudi/pull/976#discussion_r358621679
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/bloom/filter/TestInternalDynamicBloomFilter.java
 ##
 @@ -0,0 +1,57 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bloom.filter;
+
+import org.apache.hadoop.util.hash.Hash;
+import org.junit.Test;
+
+/**
+ * Unit tests {@link InternalDynamicBloomFilter} for size bounding.
+ */
+public class TestInternalDynamicBloomFilter {
+
+  @Test
+  public void testBoundedSize() {
+
+int[] batchSizes = {1000, 1, 1, 10, 10, 1};
+int indexForMaxGrowth = 3;
+int maxSize = batchSizes[0] * 100;
+BloomFilter filter = new HoodieDynamicBoundedBloomFilter(batchSizes[0], 
0.01, Hash.MURMUR_HASH, maxSize);
+int index = 0;
+int lastKnownBloomSize = 0;
+while (index < batchSizes.length) {
+  for (int i = 0; i < batchSizes[index]; i++) {
+String key = 
org.apache.commons.lang.RandomStringUtils.randomAlphanumeric(50);
+filter.add(key);
+  }
+
+  String serString = filter.serializeToString();
+  if (index != 0) {
+int curLength = serString.length();
+if (index > indexForMaxGrowth) {
+  assert curLength == lastKnownBloomSize;
 
 Review comment:
   fixed the assert. But I didn't feel a need for parametrized since we are 
testing only dynamic for boundedness once threshold is met. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-16 Thread GitBox
nsivabalan commented on a change in pull request #1091: [HUDI-389] Fixing Index 
look up to return right partitions for a given key along with fileId with 
Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r358621923
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/TestHoodieClientOnCopyOnWriteStorage.java
 ##
 @@ -336,7 +335,54 @@ public void testDeletes() throws Exception {
   }
 
   /**
-   * Test scenario of new file-group getting added during upsert().
+   * Test update of a record to different partition with Global Index
+   */
+  @Test
+  public void testUpsertToDiffPartitionGlobaIndex() throws Exception {
 
 Review comment:
   Have fixed one of the test for GlobaIndex as well. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-16 Thread GitBox
nsivabalan commented on issue #1091: [HUDI-389] Fixing Index look up to return 
right partitions for a given key along with fileId with Global Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#issuecomment-566403318
 
 
   @vinothchandar : have addressed all feedback so far. waiting for few 
clarifications from you. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-259) Hadoop 3 support for Hudi writing

2019-12-16 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997935#comment-16997935
 ] 

Yanjia Gary Li commented on HUDI-259:
-

I am already using Hadoop 3 with Spark 2.4. So far so good :P 

I built Hudi with *mvn clean install -DskipTests -DskipITs*

**not an ideal way but didn't see any problem on the cluster yet. 

 

> Hadoop 3 support for Hudi writing
> -
>
> Key: HUDI-259
> URL: https://issues.apache.org/jira/browse/HUDI-259
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Pratyaksh Sharma
>Priority: Major
>
> Sample issues
>  
> [https://github.com/apache/incubator-hudi/issues/735]
> [https://github.com/apache/incubator-hudi/issues/877#issuecomment-528433568] 
> [https://github.com/apache/incubator-hudi/issues/898]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf merged pull request #1108: [MINOR] Add slack invite icon on README

2019-12-16 Thread GitBox
leesf merged pull request #1108: [MINOR] Add slack invite icon on README
URL: https://github.com/apache/incubator-hudi/pull/1108
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1095: [HUDI-210] Implement prometheus metrics reporter

2019-12-16 Thread GitBox
lamber-ken commented on issue #1095: [HUDI-210] Implement prometheus metrics 
reporter
URL: https://github.com/apache/incubator-hudi/pull/1095#issuecomment-566414155
 
 
   > @lamber-ken @XuQianJin-Stars. Is it ready?
   
   Not yet. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated (9a1f698 -> 7498ca7)

2019-12-16 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 9a1f698  [HUDI-308] Avoid Renames for tracking state transitions of 
all actions on dataset
 add 7498ca7  [MINOR] Add slack invite icon in README (#1108)

No new revisions were added by this update.

Summary of changes:
 README.md | 1 +
 1 file changed, 1 insertion(+)



[GitHub] [incubator-hudi] ezhux opened a new pull request #1109: [HUDI-238] - Migrating to Scala 2.12

2019-12-16 Thread GitBox
ezhux opened a new pull request #1109: [HUDI-238] - Migrating to Scala 2.12
URL: https://github.com/apache/incubator-hudi/pull/1109
 
 
   Summary:
   - Migrating to Scala 2.12
   - Migrating to spark-streaming-kafka-0-10
   - Adapting Kafka and Zookeeper
   
   This includes some changes that are in 
https://github.com/apache/incubator-hudi/pull/1005 as well, but that PR is not 
merged yet.
   
   The main goal of this PR is to migrate to Scala 2.12. Along the way I had to 
upgrade the spark-streaming-kafka version from 0.8 to 0.10, because there is no 
spark-streaming-kafka-0.8 version for Scala 2.12.
   
   spark-streaming-kafka-0.10 uses Kafka 2.0 with the new Consumer API. In 
order to test it, I had to adapt some test classes from Spark core.
   
   Unit and integration tests do pass, but I haven't checked what has to be 
done in order to test with Docker. This if my first PR, so feel free to correct 
and advise.
   
   
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-238) Make separate release for hudi spark/scala based packages for scala 2.12

2019-12-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-238:

Labels: pull-request-available  (was: )

> Make separate release for hudi spark/scala based packages for scala 2.12 
> -
>
> Key: HUDI-238
> URL: https://issues.apache.org/jira/browse/HUDI-238
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: asf-migration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> [https://github.com/apache/incubator-hudi/issues/881#issuecomment-528700749]
> Suspects: 
> h3. Hudi utilities package 
> bringing in spark-streaming-kafka-0.8* 
> {code:java}
> [INFO] Scanning for projects...
> [INFO] 
> [INFO] ---< org.apache.hudi:hudi-utilities 
> >---
> [INFO] Building hudi-utilities 0.5.0-SNAPSHOT
> [INFO] [ jar 
> ]-
> [INFO] 
> [INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) @ hudi-utilities 
> ---
> [INFO] org.apache.hudi:hudi-utilities:jar:0.5.0-SNAPSHOT
> [INFO] ...
> [INFO] +- org.apache.hudi:hudi-client:jar:0.5.0-SNAPSHOT:compile
>...
> [INFO] 
> [INFO] +- org.apache.hudi:hudi-spark:jar:0.5.0-SNAPSHOT:compile
> [INFO] |  \- org.scala-lang:scala-library:jar:2.11.8:compile
> [INFO] +- log4j:log4j:jar:1.2.17:compile
>...
> [INFO] +- org.apache.spark:spark-core_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.avro:avro-mapred:jar:hadoop2:1.7.7:provided
> [INFO] |  |  +- org.apache.avro:avro-ipc:jar:1.7.7:provided
> [INFO] |  |  \- org.apache.avro:avro-ipc:jar:tests:1.7.7:provided
> [INFO] |  +- com.twitter:chill_2.11:jar:0.8.0:provided
> [INFO] |  +- com.twitter:chill-java:jar:0.8.0:provided
> [INFO] |  +- org.apache.xbean:xbean-asm5-shaded:jar:4.4:provided
> [INFO] |  +- org.apache.spark:spark-launcher_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-common_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-network-shuffle_2.11:jar:2.1.0:provided
> [INFO] |  +- org.apache.spark:spark-unsafe_2.11:jar:2.1.0:provided
> [INFO] |  +- net.java.dev.jets3t:jets3t:jar:0.7.1:provided
> [INFO] |  +- org.apache.curator:curator-recipes:jar:2.4.0:provided
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.5:provided
> [INFO] |  +- org.apache.commons:commons-math3:jar:3.4.1:provided
> [INFO] |  +- com.google.code.findbugs:jsr305:jar:1.3.9:provided
> [INFO] |  +- org.slf4j:slf4j-api:jar:1.7.16:compile
> [INFO] |  +- org.slf4j:jul-to-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:jcl-over-slf4j:jar:1.7.16:provided
> [INFO] |  +- org.slf4j:slf4j-log4j12:jar:1.7.16:compile
> [INFO] |  +- com.ning:compress-lzf:jar:1.0.3:provided
> [INFO] |  +- org.xerial.snappy:snappy-java:jar:1.1.2.6:compile
> [INFO] |  +- net.jpountz.lz4:lz4:jar:1.3.0:compile
> [INFO] |  +- org.roaringbitmap:RoaringBitmap:jar:0.5.11:provided
> [INFO] |  +- commons-net:commons-net:jar:2.2:provided
>
> [INFO] +- org.apache.spark:spark-sql_2.11:jar:2.1.0:provided
> [INFO] |  +- com.univocity:univocity-parsers:jar:2.2.1:provided
> [INFO] |  +- org.apache.spark:spark-sketch_2.11:jar:2.1.0:provided
> [INFO] |  \- org.apache.spark:spark-catalyst_2.11:jar:2.1.0:provided
> [INFO] | +- org.codehaus.janino:janino:jar:3.0.0:provided
> [INFO] | +- org.codehaus.janino:commons-compiler:jar:3.0.0:provided
> [INFO] | \- org.antlr:antlr4-runtime:jar:4.5.3:provided
> [INFO] +- com.databricks:spark-avro_2.11:jar:4.0.0:provided
> [INFO] +- org.apache.spark:spark-streaming_2.11:jar:2.1.0:compile
> [INFO] +- org.apache.spark:spark-streaming-kafka-0-8_2.11:jar:2.1.0:compile
> [INFO] |  \- org.apache.kafka:kafka_2.11:jar:0.8.2.1:compile
> [INFO] | +- org.scala-lang.modules:scala-xml_2.11:jar:1.0.2:compile
> [INFO] | +- 
> org.scala-lang.modules:scala-parser-combinators_2.11:jar:1.0.2:compile
> [INFO] | \- org.apache.kafka:kafka-clients:jar:0.8.2.1:compile
> [INFO] +- io.dropwizard.metrics:metrics-core:jar:4.0.2:compile
> [INFO] +- org.antlr:stringtemplate:jar:4.0.2:compile
> [INFO] |  \- org.antlr:antlr-runtime:jar:3.3:compile
> [INFO] +- com.beust:jcommander:jar:1.72:compile
> [INFO] +- com.twitter:bijection-avro_2.11:jar:0.9.2:compile
> [INFO] |  \- com.twitter:bijection-core_2.11:jar:0.9.2:compile
> [INFO] +- io.confluent:kafka-avro-serializer:jar:3.0.0:compile
> [INFO] +- io.confluent:common-config:jar:3.0.0:compile
> [INFO] +- io.confluent:common-utils:jar:3.0.0:compile
> [INFO] |  \- com.101tec:zkclient:jar:0.5:compile
> [INFO] +- io.confluent:kafka-schema-registry-client:jar:3.0.0:compile
> [INFO] \- org.mockito:mockito-all:jar:1.10.19:test
> [INFO] 
> 

[jira] [Created] (HUDI-416) Improve hint information for Cli

2019-12-16 Thread hong dongdong (Jira)
hong dongdong created HUDI-416:
--

 Summary: Improve hint information for Cli
 Key: HUDI-416
 URL: https://issues.apache.org/jira/browse/HUDI-416
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: CLI
Reporter: hong dongdong


Right now, cli always give error information: 
{code:java}
Command 'desc' was found but is not currently available (type 'help' then ENTER 
to learn about this command)
{code}
but it is confused to user. We can give a hint clearly like:
{code:java}
Command failed java.lang.NullPointerException: There is no hudi dataset. Please 
use connect command to set dataset first
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-416) Improve hint information for Cli

2019-12-16 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-416:

Labels: pull-request-available  (was: )

> Improve hint information for Cli
> 
>
> Key: HUDI-416
> URL: https://issues.apache.org/jira/browse/HUDI-416
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: CLI
>Reporter: hong dongdong
>Priority: Minor
>  Labels: pull-request-available
>
> Right now, cli always give error information: 
> {code:java}
> Command 'desc' was found but is not currently available (type 'help' then 
> ENTER to learn about this command)
> {code}
> but it is confused to user. We can give a hint clearly like:
> {code:java}
> Command failed java.lang.NullPointerException: There is no hudi dataset. 
> Please use connect command to set dataset first
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] leesf commented on a change in pull request #1091: [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom

2019-12-16 Thread GitBox
leesf commented on a change in pull request #1091: [HUDI-389] Fixing Index look 
up to return right partitions for a given key along with fileId with Global 
Bloom
URL: https://github.com/apache/incubator-hudi/pull/1091#discussion_r358643404
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/index/bloom/TestHoodieGlobalBloomIndex.java
 ##
 @@ -267,12 +267,20 @@ public void testTagLocation() throws Exception {
 for (HoodieRecord record : taggedRecordRDD.collect()) {
   if (record.getRecordKey().equals("000")) {
 
assertTrue(record.getCurrentLocation().getFileId().equals(FSUtils.getFileId(filename0)));
+System.out.println("Record data " + record.getData().toString() + " 
rowChage 1 " + rowChange1.toString());
 
 Review comment:
   remove `System.out.println`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong opened a new pull request #1110: [HUDI-416]improve hint information for cli

2019-12-16 Thread GitBox
hddong opened a new pull request #1110: [HUDI-416]improve hint information for 
cli
URL: https://github.com/apache/incubator-hudi/pull/1110
 
 
   ## What is the purpose of the pull request
   
   Right now, cli always give error information: `Command 'desc' was found but 
is not currently available (type 'help' then ENTER to learn about this 
command)`, but it is confused to user. We can give a hint clearly like: 
`Command failed java.lang.NullPointerException: There is no hudi dataset. 
Please use connect command to set dataset first`
   
   ## Brief change log
   
   *(for example:)*
 - *use `exception` instead of `CliAvailabilityIndicator`*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services