[GitHub] [incubator-hudi] codecov-io edited a comment on issue #1485: [HUDI-756] Organize Cleaning Action execution into a single package in hudi-client

2020-04-03 Thread GitBox
codecov-io edited a comment on issue #1485: [HUDI-756] Organize Cleaning Action 
execution into a single package in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1485#issuecomment-608981000
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=h1) 
Report
   > Merging 
[#1485](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/575d87cf7d6f0f743cb7cec6520d80e6fcc3e20a&el=desc)
 will **increase** coverage by `0.05%`.
   > The diff coverage is `87.61%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1485/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1485  +/-   ##
   
   + Coverage 71.48%   71.54%   +0.05% 
 Complexity  261  261  
   
 Files   334  336   +2 
 Lines 1573115744  +13 
 Branches   1613 1610   -3 
   
   + Hits  1124611264  +18 
   + Misses 3762 3759   -3 
   + Partials723  721   -2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...a/org/apache/hudi/client/AbstractHoodieClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllQ2xpZW50LmphdmE=)
 | `78.37% <ø> (+2.06%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/common/util/CleanerUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ2xlYW5lclV0aWxzLmphdmE=)
 | `95.65% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `69.23% <64.00%> (-0.55%)` | `0.00 <0.00> (ø)` | |
   | 
[...he/hudi/table/action/clean/PartitionCleanStat.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NsZWFuL1BhcnRpdGlvbkNsZWFuU3RhdC5qYXZh)
 | `86.36% <86.36%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...e/hudi/table/action/clean/CleanActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NsZWFuL0NsZWFuQWN0aW9uRXhlY3V0b3IuamF2YQ==)
 | `89.31% <89.31%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `74.15% <100.00%> (-0.29%)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `100.00% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCommitArchiveLog.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29tbWl0QXJjaGl2ZUxvZy5qYXZh)
 | `75.00% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `89.28% <100.00%> (-0.66%)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieMergeOnReadTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllTWVyZ2VPblJlYWRUYWJsZS5qYXZh)
 | `83.12% <100.00%> (-2.50%)` | `0.00 <0.00> (ø)` | |
   | ... and [12 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powere

[GitHub] [incubator-hudi] codecov-io commented on issue #1485: [HUDI-756] Organize Cleaning Action execution into a single package in hudi-client

2020-04-03 Thread GitBox
codecov-io commented on issue #1485: [HUDI-756] Organize Cleaning Action 
execution into a single package in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1485#issuecomment-608981000
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=h1) 
Report
   > Merging 
[#1485](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/575d87cf7d6f0f743cb7cec6520d80e6fcc3e20a&el=desc)
 will **increase** coverage by `0.05%`.
   > The diff coverage is `87.61%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1485/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1485  +/-   ##
   
   + Coverage 71.48%   71.54%   +0.05% 
 Complexity  261  261  
   
 Files   334  336   +2 
 Lines 1573115744  +13 
 Branches   1613 1610   -3 
   
   + Hits  1124611264  +18 
   + Misses 3762 3759   -3 
   + Partials723  721   -2 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[...a/org/apache/hudi/client/AbstractHoodieClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllQ2xpZW50LmphdmE=)
 | `78.37% <ø> (+2.06%)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/common/util/CleanerUtils.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3V0aWwvQ2xlYW5lclV0aWxzLmphdmE=)
 | `95.65% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...java/org/apache/hudi/client/HoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVdyaXRlQ2xpZW50LmphdmE=)
 | `69.23% <64.00%> (-0.55%)` | `0.00 <0.00> (ø)` | |
   | 
[...he/hudi/table/action/clean/PartitionCleanStat.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NsZWFuL1BhcnRpdGlvbkNsZWFuU3RhdC5qYXZh)
 | `86.36% <86.36%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...e/hudi/table/action/clean/CleanActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2NsZWFuL0NsZWFuQWN0aW9uRXhlY3V0b3IuamF2YQ==)
 | `89.31% <89.31%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[.../apache/hudi/client/AbstractHoodieWriteClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0Fic3RyYWN0SG9vZGllV3JpdGVDbGllbnQuamF2YQ==)
 | `74.15% <100.00%> (-0.29%)` | `0.00 <0.00> (ø)` | |
   | 
[.../java/org/apache/hudi/client/HoodieReadClient.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpZW50L0hvb2RpZVJlYWRDbGllbnQuamF2YQ==)
 | `100.00% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCommitArchiveLog.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29tbWl0QXJjaGl2ZUxvZy5qYXZh)
 | `75.00% <100.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieCopyOnWriteTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllQ29weU9uV3JpdGVUYWJsZS5qYXZh)
 | `89.28% <100.00%> (-0.66%)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/table/HoodieMergeOnReadTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllTWVyZ2VPblJlYWRUYWJsZS5qYXZh)
 | `83.12% <100.00%> (-2.50%)` | `0.00 <0.00> (ø)` | |
   | ... and [12 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1485/diff?src=pr&el=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1485?src=pr&el=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = absolute  (impact)`, `ø = not affected`, `? = missing data`
   > Powered by 
[

[jira] [Updated] (HUDI-761) Organize Rollback/Savepoint/Restore action implementation under a single package

2020-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-761:

Status: Open  (was: New)

> Organize Rollback/Savepoint/Restore action implementation under a single 
> package
> 
>
> Key: HUDI-761
> URL: https://issues.apache.org/jira/browse/HUDI-761
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-761) Organize Rollback/Savepoint/Restore action implementation under a single package

2020-04-03 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-761:
---

 Summary: Organize Rollback/Savepoint/Restore action implementation 
under a single package
 Key: HUDI-761
 URL: https://issues.apache.org/jira/browse/HUDI-761
 Project: Apache Hudi (incubating)
  Issue Type: Sub-task
  Components: Code Cleanup, Writer Core
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-760) Remove Rolling Stat management from Hudi Writer

2020-04-03 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-760:
---

 Summary: Remove Rolling Stat management from Hudi Writer
 Key: HUDI-760
 URL: https://issues.apache.org/jira/browse/HUDI-760
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.6.0


Current implementation of rolling stat is not scalable. As Consolidated 
Metadata will be implemented eventually, we can have one design to manage 
file-level stats too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #237

2020-04-03 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
or

[jira] [Updated] (HUDI-759) Integrate checkpoint provider

2020-04-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-759:

Status: In Progress  (was: Open)

> Integrate checkpoint provider
> -
>
> Key: HUDI-759
> URL: https://issues.apache.org/jira/browse/HUDI-759
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-759) Integrate checkpoint provider

2020-04-03 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-759:
---

 Summary: Integrate checkpoint provider
 Key: HUDI-759
 URL: https://issues.apache.org/jira/browse/HUDI-759
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-759) Integrate checkpoint provider

2020-04-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-759:

Status: Open  (was: New)

> Integrate checkpoint provider
> -
>
> Key: HUDI-759
> URL: https://issues.apache.org/jira/browse/HUDI-759
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-644) checkpoint generator tool for delta streamer

2020-04-03 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-644.
-
Resolution: Fixed

> checkpoint generator tool for delta streamer
> 
>
> Key: HUDI-644
> URL: https://issues.apache.org/jira/browse/HUDI-644
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This ticket is to resolve the following problem:
> The user has finished the initial load and write to Hudi table
> The user would like to migrate to Delta Streamer
> The user needs a tool to provide the checkpoint for the Delta Streamer in the 
> first run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1485: [HUDI-756] Organize Cleaning Action execution into a single package in hudi-client

2020-04-03 Thread GitBox
vinothchandar commented on issue #1485: [HUDI-756] Organize Cleaning Action 
execution into a single package in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1485#issuecomment-608962470
 
 
   cc @yanghua @hddong if we do more of these.. then the multi engine work will 
be easier in the sense that HoodieWriteClient and Table classes have 
straightforward code and you don't have to care as much about all the action 
execution code .. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to 
namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608962328
 
 
   Yep that did it!  Thanks for the help @lamber-ken and @bhasudha 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 closed issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
malanb5 closed issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode
URL: https://github.com/apache/incubator-hudi/issues/1483
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-756) Organize Cleaning Action execution into a single package in hudi-client

2020-04-03 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-756:

Labels: pull-request-available  (was: )

> Organize Cleaning Action execution into a single package in hudi-client
> ---
>
> Key: HUDI-756
> URL: https://issues.apache.org/jira/browse/HUDI-756
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar opened a new pull request #1485: [HUDI-756] Organize Cleaning Action execution into a single package in hudi-client

2020-04-03 Thread GitBox
vinothchandar opened a new pull request #1485: [HUDI-756] Organize Cleaning 
Action execution into a single package in hudi-client
URL: https://github.com/apache/incubator-hudi/pull/1485
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Beginning work to clean up code spread around in hudi-client 
   
   ## Brief change log
   
   
- Introduced a thin abstraction ActionExecutor, that all actions will 
implement
- Pulled cleaning code from table, writeclient into a single package
- CleanHelper is now CleanPlanner, HoodieCleanClient is no longer around
- Minor refactor of HoodieTable factory method
- HoodieTable.create() methods with and without metaclient passed in
- HoodieTable constructor now does not do a redundant instantiation
- Fixed existing unit tests to work at the HoodieWriteClient level
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-756) Organize Cleaning Action execution into a single package in hudi-client

2020-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-756:

Summary: Organize Cleaning Action execution into a single package in 
hudi-client  (was: Organize Action execution logic into a nicer class hierarchy 
in hudi-client)

> Organize Cleaning Action execution into a single package in hudi-client
> ---
>
> Key: HUDI-756
> URL: https://issues.apache.org/jira/browse/HUDI-756
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-677) Abstract/Refactor all transaction management logic into a set of classes

2020-04-03 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17075015#comment-17075015
 ] 

Vinoth Chandar commented on HUDI-677:
-

After spending some time, accurately added new description.. 

For sake of multi engine support, I think the real issue would be to first 
clean up teh calling code i.e the various action executions that interact with 
the timeline.. I created HUDI-756 to do this for cleaning.. Will follow up with 
other actions as well..

We can park this one until those are done.. 

> Abstract/Refactor all transaction management logic into a set of classes 
> -
>
> Key: HUDI-677
> URL: https://issues.apache.org/jira/browse/HUDI-677
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Hudi's timeline management code sits in HoodieActiveTimeline and 
> HoodieDefaultTimeline classes, taking action through the four stages : 
> REQUESTED, INFLIGHT, COMPLETED, INVALID.
> For sake of better readability and maintenance, we should look into 
> reimplementing these as a state machine. 
> Note that this is better done after organizing the action execution classes 
> (as in HUDI-756) in hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
lamber-ken commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect 
to namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608961344
 
 
   if so, it may be caused by port conflict. Can you stop local hadoop and try 
anagin if possible? : )
   
   
![image](https://user-images.githubusercontent.com/20113411/78417175-2f42c900-7662-11ea-95bf-0bb51a40face.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-677) Abstract/Refactor all transaction management logic into a set of classes

2020-04-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-677:

Description: 
Hudi's timeline management code sits in HoodieActiveTimeline and 
HoodieDefaultTimeline classes, taking action through the four stages : 
REQUESTED, INFLIGHT, COMPLETED, INVALID.

For sake of better readability and maintenance, we should look into 
reimplementing these as a state machine. 

Note that this is better done after organizing the action execution classes (as 
in HUDI-756) in hudi-client

  was:
Over time a lot of the core transaction management code has been  split across 
various files in hudi-client.. We want to clean this up and present a nice 
interface.. 

Some notes and thoughts and suggestions..  

 


> Abstract/Refactor all transaction management logic into a set of classes 
> -
>
> Key: HUDI-677
> URL: https://issues.apache.org/jira/browse/HUDI-677
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.0
>
>
> Hudi's timeline management code sits in HoodieActiveTimeline and 
> HoodieDefaultTimeline classes, taking action through the four stages : 
> REQUESTED, INFLIGHT, COMPLETED, INVALID.
> For sake of better readability and maintenance, we should look into 
> reimplementing these as a state machine. 
> Note that this is better done after organizing the action execution classes 
> (as in HUDI-756) in hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to 
namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608960731
 
 
   I didn't include it in my original description but I also do have a local 
install of mysql, hive, hadoop, and spark


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to 
namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608960468
 
 
   @bhasudha
   Here's the output to nmap:
   ```
   sudo nmap -sT -sU 127.0.0.1
   Starting Nmap 7.80 ( https://nmap.org ) at 2020-04-03 19:42 PDT
   Nmap scan report for localhost (127.0.0.1)
   Host is up (0.00026s latency).
   Not shown: 1958 closed ports, 30 filtered ports
   PORT STATE SERVICE
   22/tcp   open  ssh
   445/tcp  open  microsoft-ds
   3031/tcp open  eppc
   3283/tcp open  netassistant
   3306/tcp open  mysql
   5900/tcp open  vnc
   /tcp open  sun-answerbook
   88/udp   open|filtered kerberos-sec
   137/udp  open|filtered netbios-ns
   138/udp  open|filtered netbios-dgm
   3283/udp open  netassistant
   5353/udp open  zeroconf
   ```
   
   Here's what's in my /etc/hosts 
   ```
   ##
   # Host Database
   #
   # localhost is used to configure the loopback interface
   # when the system is booting.  Do not change this entry.
   ##
   127.0.0.1localhost
   255.255.255.255  broadcasthost
   ::1 localhost
   
   127.0.0.1 adhoc-1
   127.0.0.1 adhoc-2
   127.0.0.1 namenode
   127.0.0.1 datanode1
   127.0.0.1 hiveserver
   127.0.0.1 hivemetastore
   127.0.0.1 kafkabroker
   127.0.0.1 sparkmaster
   127.0.0.1 zookeeper
   
   
   # Added by Docker Desktop
   # To allow the same kube context to work on the host and the container:
   127.0.0.1 kubernetes.docker.internal
   # End of section
   
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403394637
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
 ##
 @@ -111,6 +112,10 @@ public static String makeDataFileName(String instantTime, 
String writeToken, Str
 return String.format("%s_%s_%s.parquet", fileId, writeToken, instantTime);
   }
 
+  public static String makeBootstrapIndexFileName(String instantTime, String 
fileId, String fileType) {
+return String.format("%s_%s_%s.%s", fileId, "1-0-1", instantTime, 
fileType);
 
 Review comment:
   Just for my understanding is there is a reasoning behind `1-0-1` here ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403362949
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -240,4 +244,27 @@ public static String decompress(byte[] bytes) {
   throw new HoodieIOException("IOException while decompressing text", e);
 }
   }
+
+  /**
+   * Generate a reader schema off the provided writeSchema, to just project 
out the provided columns.
+   */
+  public static Schema generateProjectionSchema(Schema originalSchema, 
List fieldNames) {
+Map schemaFieldsMap = originalSchema.getFields().stream()
+.map(r -> Pair.of(r.name().toLowerCase(), 
r)).collect(Collectors.toMap(Pair::getLeft, Pair::getRight));
+List projectedFields = new ArrayList<>();
+for (String fn : fieldNames) {
+  Schema.Field field = schemaFieldsMap.get(fn.toLowerCase());
 
 Review comment:
   I think we can avoid forming `schemaFieldsMap` here. We can directly get the 
field from original schema using `originalSchema.getField()` and I don't think 
there will be any performance impact here because Avro internally maintains a 
Map to similar to this to find the field instead of `O(n)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403408572
 
 

 ##
 File path: hudi-common/src/main/avro/HoodieBootstrapIndexInfo.avsc
 ##
 @@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+{
+   "namespace":"org.apache.hudi.avro.model",
+   "type":"record",
+   "name":"BootstrapIndexInfo",
+   "fields":[
+  {
+"name":"version",
 
 Review comment:
   I don't really see `version` being used anywhere in code, but present in all 
the model files. What's the reason behind it ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403391721
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -375,6 +395,7 @@ public static HoodieTableMetaClient 
initTableAndGetMetaClient(Configuration hado
   fs.mkdirs(auxiliaryFolder);
 }
 
+initializeBootstrapDirsIfNotExists(hadoopConf, basePath, fs);
 
 Review comment:
   Seems like this can be an optional depending on whether or not the user 
chooses to bootstrap. Its okay if that work is going to be part of a separate 
PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403400178
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/BootstrapSourceFileMapping.java
 ##
 @@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.model;
+
+import java.io.Serializable;
+import java.util.Objects;
+import org.apache.hudi.avro.model.HoodieFileStatus;
+
+public class BootstrapSourceFileMapping implements Serializable, 
Comparable {
+
+  private final String sourceBasePath;
+  private final String sourcePartitionPath;
+  private final String hudiPartitionPath;
+  private final HoodieFileStatus sourceFileStatus;
+  private final String hudiFileId;
+
+  public BootstrapSourceFileMapping(String sourceBasePath, String 
sourcePartitionPath,
+  String hudiPartitionPath, HoodieFileStatus sourceFileStatus, String 
hudiFileId) {
+this.sourceBasePath = sourceBasePath;
+this.sourcePartitionPath = sourcePartitionPath;
+this.hudiPartitionPath = hudiPartitionPath;
+this.sourceFileStatus = sourceFileStatus;
+this.hudiFileId = hudiFileId;
+  }
+
+  @Override
+  public String toString() {
+return "BootstrapSourceFileMapping{"
++ "sourceBasePath='" + sourceBasePath + '\''
++ ", sourcePartitionPath='" + sourcePartitionPath + '\''
++ ", hudiPartitionPath='" + hudiPartitionPath + '\''
++ ", sourceFileStatus='" + sourceFileStatus + '\''
++ ", hudiFileId='" + hudiFileId + '\''
++ '}';
 
 Review comment:
   [Suggestion] You may want to use: 
http://commons.apache.org/proper/commons-lang/javadocs/api-release/org/apache/commons/lang3/builder/ToStringBuilder.html


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403406705
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/BootstrapIndex.java
 ##
 @@ -0,0 +1,494 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.bootstrap;
+
+import java.io.IOException;
+import java.io.Serializable;
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hbase.CellUtil;
+import org.apache.hadoop.hbase.HConstants;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.io.hfile.HFileScanner;
+import org.apache.hadoop.hbase.util.Bytes;
+import org.apache.hudi.avro.model.BootstrapIndexInfo;
+import org.apache.hudi.avro.model.BootstrapPartitionMetadata;
+import org.apache.hudi.avro.model.BootstrapSourceFilePartitionInfo;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.BootstrapSourceFileMapping;
+import org.apache.hudi.common.model.HoodieFileGroupId;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+/**
+ * Maintains mapping from hudi file id (which contains skeleton file) to 
external base file.
+ * It maintains 2 physical indices.
+ *  (a) At partition granularity to lookup all indices for each partition.
+ *  (b) At file-group granularity to lookup bootstrap mapping for an 
individual file-group.
+ *
+ * This implementation uses HFile as physical storage of index. FOr the 
initial run, bootstrap
+ * mapping for the entire dataset resides in a single file but care has been 
taken in naming
+ * the index files in the same way as Hudi data files so that we can reuse 
file-system abstraction
+ * on these index files to manage multiple file-groups.
+ */
+
+public class BootstrapIndex implements Serializable, AutoCloseable {
+
+  private static final Logger LOG = LogManager.getLogger(BootstrapIndex.class);
+
+  public static final String BOOTSTRAP_INDEX_FILE_ID = 
"-----0";
+
+  // Used as naming extensions.
+  public static final String BOOTSTRAP_INDEX_FILE_TYPE = "hfile";
+
+  // Additional Metadata written to HFiles.
+  public static final byte[] INDEX_INFO_KEY = Bytes.toBytes("INDEX_INFO");
+
+  private final HoodieTableMetaClient metaClient;
+  // Base Path of external files.
+  private final String sourceBasePath;
+  // Well Known Paths for indices
+  private final String indexByPartitionPath;
+  private final String indexByFileIdPath;
+  // Flag to idenitfy if Bootstrap Index is empty or not
+  private final boolean isBootstrapped;
+
+  // Index Readers
+  private transient HFile.Reader indexByPartitionReader;
+  private transient HFile.Reader indexByFileIdReader;
+
+  // Bootstrap Index Info
+  private transient BootstrapIndexInfo bootstrapIndexInfo;
+
+  public BootstrapIndex(HoodieTableMetaClient metaClient) {
+this.metaClient = metaClient;
+
+try {
+  metaClient.initializeBootstrapDirsIfNotExists();
+  Path indexByPartitionPath = getIndexByPartitionPath(metaClient);
+  Path indexByFilePath = getIndexByFileIdPath(metaClient);
+  if (metaClient.getFs().exists(indexByPartitionPath) && 
metaCli

[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403394224
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
 ##
 @@ -207,6 +213,20 @@ public String getMetaAuxiliaryPath() {
 return basePath + File.separator + AUXILIARYFOLDER_NAME;
   }
 
+  /**
+   * @return Bootstrap Index By Partition Folder
+   */
+  public String getBootstrapIndexByPartitionFolderName() {
+return getMetaAuxiliaryPath() + File.separator + 
BOOTSTRAP_INDEX_BY_PARTITION_FOLDER_NAME;
+  }
+
+  /**
+   * @return Bootstrap Index By Hudi File Id Folder
+   */
+  public String getBootstrapIndexByFileIdFolderNameFolderName() {
+return getMetaAuxiliaryPath() + File.separator + 
BOOTSTRAP_INDEX_BY_FILE_ID_FOLDER_NAME;
+  }
 
 Review comment:
   `BOOTSTRAP_INDEX_BY_PARTITION_FOLDER_NAME` and 
`BOOTSTRAP_INDEX_BY_FILE_ID_FOLDER_NAME` both already contain 
`AUXILIARYFOLDER_NAME` in it. Here we are again prefixing the auxiliary path 
while returning. I think this would be prefixing the path twice with aux 
folder. Seems like a bug.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bhasudha commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
bhasudha commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to 
namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608957566
 
 
   Hi @malanb5  I am not able to reproduce this. I am using 
   
   MacOS: 10.15.3
   Docker: version 19.03.8
   
   Have you already confirmed that there are no other ssh tunnels that are 
already running ? And the following settings are added to /etc/hosts
   ```
   127.0.0.1 adhoc-1
   127.0.0.1 adhoc-2
   127.0.0.1 namenode
   127.0.0.1 datanode1
   127.0.0.1 hiveserver
   127.0.0.1 hivemetastore
   127.0.0.1 kafkabroker
   127.0.0.1 sparkmaster
   127.0.0.1 zookeeper
   ```
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar merged pull request #1453: HUDI-644 kafka connect checkpoint provider

2020-04-03 Thread GitBox
vinothchandar merged pull request #1453: HUDI-644 kafka connect checkpoint 
provider
URL: https://github.com/apache/incubator-hudi/pull/1453
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: HUDI-644 kafka connect checkpoint provider (#1453)

2020-04-03 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 575d87c  HUDI-644 kafka connect checkpoint provider (#1453)
575d87c is described below

commit 575d87cf7d6f0f743cb7cec6520d80e6fcc3e20a
Author: YanJia-Gary-Li 
AuthorDate: Fri Apr 3 18:57:34 2020 -0700

HUDI-644 kafka connect checkpoint provider (#1453)
---
 .../checkpointing/InitialCheckPointProvider.java   |  31 +
 .../checkpointing/KafkaConnectHdfsProvider.java| 152 +
 .../TestKafkaConnectHdfsProvider.java  |  94 +
 3 files changed, 277 insertions(+)

diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/InitialCheckPointProvider.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/InitialCheckPointProvider.java
new file mode 100644
index 000..741b05c
--- /dev/null
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/InitialCheckPointProvider.java
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.checkpointing;
+
+import org.apache.hudi.exception.HoodieException;
+
+/**
+ * Provide the initial checkpoint for delta streamer.
+ */
+public interface InitialCheckPointProvider {
+  /**
+   * Get checkpoint string recognizable for delta streamer.
+   */
+  String getCheckpoint() throws HoodieException;
+}
diff --git 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/KafkaConnectHdfsProvider.java
 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/KafkaConnectHdfsProvider.java
new file mode 100644
index 000..f464f68
--- /dev/null
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/KafkaConnectHdfsProvider.java
@@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.checkpointing;
+
+import org.apache.hudi.exception.HoodieException;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.fs.PathFilter;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+/**
+ * Generate checkpoint from Kafka-Connect-HDFS managed data set.
+ * Documentation: 
https://docs.confluent.io/current/connect/kafka-connect-hdfs/index.html
+ */
+public class KafkaConnectHdfsProvider implements InitialCheckPointProvider {
+  private final Path path;
+  private final FileSystem fs;
+
+  private static final String FILENAME_SEPARATOR = "[\\+\\.]";
+
+  public KafkaConnectHdfsProvider(final Path basePath, final FileSystem 
fileSystem) {
+this.path = basePath;
+this.fs = fileSystem;
+  }
+
+  /**
+   * PathFilter for Kafka-Connect-HDFS.
+   * Directory format: /partition1=xxx/partition2=xxx
+   * File format: topic+partition+lowerOffset+upperOffset.file
+   */
+  public static class KafkaConnectPathFilter implements PathFilter {
+private static final Pattern DIRECTORY_PATTERN = Pattern.compile(".*=.*");
+private static final Pattern PATTERN =
+Pattern.compile("[a-zA-Z0-9\\._\\-]+\\+\\d+\\+\\d+\\+\\d+(.\\w+)?");
+
+@Override
+public boolean accept(

[GitHub] [incubator-hudi] garyli1019 commented on issue #1453: HUDI-644 kafka connect checkpoint provider

2020-04-03 Thread GitBox
garyli1019 commented on issue #1453: HUDI-644 kafka connect checkpoint provider
URL: https://github.com/apache/incubator-hudi/pull/1453#issuecomment-608930288
 
 
   @vinothchandar Thanks for reviewing. Yes, I will follow up with a separate PR


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1453: HUDI-644 kafka connect checkpoint provider

2020-04-03 Thread GitBox
vinothchandar commented on issue #1453: HUDI-644 kafka connect checkpoint 
provider
URL: https://github.com/apache/incubator-hudi/pull/1453#issuecomment-608925701
 
 
   @garyli1019 missed it.. apologies 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1484: [HUDI-316] : Hbase qps repartition writestatus

2020-04-03 Thread GitBox
vinothchandar commented on a change in pull request #1484: [HUDI-316] : Hbase 
qps repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#discussion_r403398611
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/RateLimiter.java
 ##
 @@ -0,0 +1,245 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.util;
+
+import java.util.concurrent.TimeUnit;
+import javax.annotation.Nullable;
+import javax.annotation.concurrent.ThreadSafe;
+
+/*
+ * Note: Based on RateLimiter implementation in Google/Guava.
 
 Review comment:
   Could we please avoid using this? Should not be too hard to roll our own.. 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1479: [HUDI-758] Modify Integration test to include incremental queries on MOR tables

2020-04-03 Thread GitBox
vinothchandar commented on issue #1479: [HUDI-758] Modify Integration test to 
include incremental queries on MOR tables
URL: https://github.com/apache/incubator-hudi/pull/1479#issuecomment-608922301
 
 
   @bhasudha could you please take a look? this is relevant to wht you are 
fixing 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-03 Thread GitBox
vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403397923
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -186,6 +168,96 @@ public void testSimpleInsertAndUpdate() throws Exception {
 }
   }
 
+  // test incremental read does not go past compaction instant for RO views
+  // For RT views, incremental read can go past compaction
+  @Test
+  public void testIncrementalReadsWithCompaction() throws Exception {
 
 Review comment:
   Let's make it clear that this an incremental on RO.. rename to 
`testIncrementalROReadsWithCOmpaction()`? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-03 Thread GitBox
vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403398166
 
 

 ##
 File path: 
hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
 ##
 @@ -1311,4 +1383,111 @@ private void assertNoWriteErrors(List 
statuses) {
   assertFalse("Errors found in write of " + status.getFileId(), 
status.hasErrors());
 }
   }
+  
+  private FileStatus[] insertAndGetFilePaths(List records, 
HoodieWriteClient client,
 
 Review comment:
   I am surprised by the amount of test helper code you needed to write 
yourself.. Should we be using code from some existing helpers.. or should these 
helpers move somewhere else, so they are beneficial over all? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

2020-04-03 Thread GitBox
vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403397635
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##
 @@ -118,6 +119,34 @@
 return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+Option pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   This seems like the crux of the change? and most of the other code is 
improving tests etc. If so, this seems like a  reasonable interim solution to 
me... Although we should encourage users to do incremental pull out of the 
RTInputFormat really ... 
   
   The core problem of "data loss" being brought in this issue, feels like a 
mis-expectation really :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (HUDI-49) Handle non-backwards compatible or smaller schema appended to the log file during updates #337

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-49?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-49:
--

Assignee: Prashant Wason

> Handle non-backwards compatible or smaller schema appended to the log file 
> during updates #337
> --
>
> Key: HUDI-49
> URL: https://issues.apache.org/jira/browse/HUDI-49
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Major
>
> https://github.com/uber/hudi/issues/337



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-49) Handle non-backwards compatible or smaller schema appended to the log file during updates #337

2020-04-03 Thread Prashant Wason (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-49?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074977#comment-17074977
 ] 

Prashant Wason commented on HUDI-49:


Being fixed by HUDI-741

> Handle non-backwards compatible or smaller schema appended to the log file 
> during updates #337
> --
>
> Key: HUDI-49
> URL: https://issues.apache.org/jira/browse/HUDI-49
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Storage Management, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Major
>
> https://github.com/uber/hudi/issues/337



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-757) Add a command to hudi-cli to export commit metadata

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-757:
---

Assignee: Prashant Wason

> Add a command to hudi-cli to export commit metadata
> ---
>
> Key: HUDI-757
> URL: https://issues.apache.org/jira/browse/HUDI-757
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Time Spent: 10m
>  Remaining Estimate: 3h 50m
>
> HUDI stores commit related information in files within the .hoodie directory. 
> Each commit / delatacommit / rollback / etc creates one or more files. To 
> prevent a large number of files, older files are consolidated together and 
> moved into a commit archive which has multiple such files written together 
> using the format of HUDI Log files.
> During debugging of issues or for development of new features, it may be 
> required to refer to the metadata of older commits / cleanups / rollbacks. 
> There is no simple way to get these from a production setup especially from 
> the archive files.
> This enhancement provides a hudi cli command which allows exporting metadata 
> from HUDI commit archives.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-563) Improve unit test coverage for org.apache.hudi.common.table.log.block.HoodieAvroDataBlock

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-563.
---
Resolution: Abandoned

> Improve unit test coverage for 
> org.apache.hudi.common.table.log.block.HoodieAvroDataBlock
> -
>
> Key: HUDI-563
> URL: https://issues.apache.org/jira/browse/HUDI-563
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-565) Improve unit test coverage for org.apache.hudi.WriteStatus

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason closed HUDI-565.
---
Resolution: Abandoned

> Improve unit test coverage for org.apache.hudi.WriteStatus
> --
>
> Key: HUDI-565
> URL: https://issues.apache.org/jira/browse/HUDI-565
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-92) Include custom names for spark HUDI spark DAG stages for easier understanding

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-92?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-92:
--

Assignee: Prashant Wason

> Include custom names for spark HUDI spark DAG stages for easier understanding
> -
>
> Key: HUDI-92
> URL: https://issues.apache.org/jira/browse/HUDI-92
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: newbie, Usability
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-566) Improve unit test coverage for org.apache.hudi.common.table.HoodieTimeline

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason resolved HUDI-566.
-
Resolution: Fixed

> Improve unit test coverage for org.apache.hudi.common.table.HoodieTimeline
> --
>
> Key: HUDI-566
> URL: https://issues.apache.org/jira/browse/HUDI-566
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-565) Improve unit test coverage for org.apache.hudi.WriteStatus

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-565:
---

Assignee: Prashant Wason

> Improve unit test coverage for org.apache.hudi.WriteStatus
> --
>
> Key: HUDI-565
> URL: https://issues.apache.org/jira/browse/HUDI-565
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-563) Improve unit test coverage for org.apache.hudi.common.table.log.block.HoodieAvroDataBlock

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-563:
---

Assignee: Prashant Wason

> Improve unit test coverage for 
> org.apache.hudi.common.table.log.block.HoodieAvroDataBlock
> -
>
> Key: HUDI-563
> URL: https://issues.apache.org/jira/browse/HUDI-563
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-564) Improve unit test coverage for org.apache.hudi.common.table.log.HoodieLogFormatVersion

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-564:
---

Assignee: Prashant Wason

> Improve unit test coverage for 
> org.apache.hudi.common.table.log.HoodieLogFormatVersion
> --
>
> Key: HUDI-564
> URL: https://issues.apache.org/jira/browse/HUDI-564
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.5.2
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-567) Improve unit test coverage for org.apache.hudi.common.table.HoodieActiveTimeline

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-567:
---

Assignee: Prashant Wason

> Improve unit test coverage for 
> org.apache.hudi.common.table.HoodieActiveTimeline
> 
>
> Key: HUDI-567
> URL: https://issues.apache.org/jira/browse/HUDI-567
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-566) Improve unit test coverage for org.apache.hudi.common.table.HoodieTimeline

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-566:
---

Assignee: Prashant Wason

> Improve unit test coverage for org.apache.hudi.common.table.HoodieTimeline
> --
>
> Key: HUDI-566
> URL: https://issues.apache.org/jira/browse/HUDI-566
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-671) Improve unit test coverage for org.apache.hudi.index.hbase.HbaseIndex

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-671:
---

Assignee: Prashant Wason

> Improve unit test coverage for org.apache.hudi.index.hbase.HbaseIndex
> -
>
> Key: HUDI-671
> URL: https://issues.apache.org/jira/browse/HUDI-671
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-670) Improve unit test coverage for org.apache.hudi.common.util.collection.DiskBasedMap

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-670:
---

Assignee: Prashant Wason

> Improve unit test coverage for 
> org.apache.hudi.common.util.collection.DiskBasedMap
> --
>
> Key: HUDI-670
> URL: https://issues.apache.org/jira/browse/HUDI-670
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 2h
>  Time Spent: 20m
>  Remaining Estimate: 1h 40m
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-668) Improve unit test coverage org.apache.hudi.metrics.Metrics

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-668:
---

Assignee: Prashant Wason

> Improve unit test coverage org.apache.hudi.metrics.Metrics
> --
>
> Key: HUDI-668
> URL: https://issues.apache.org/jira/browse/HUDI-668
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-748) Add exclusions for code coverage reports

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-748:

Status: Open  (was: New)

> Add exclusions for code coverage reports
> 
>
> Key: HUDI-748
> URL: https://issues.apache.org/jira/browse/HUDI-748
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Time Spent: 20m
>  Remaining Estimate: 1h 40m
>
> HUDI project uses codecov for code coverage reports generated from the unit 
> test runs. Currently, 67% of code is covered in unit-tests.
> There are sections of code which are not a good candidate for unit tests 
> either because they are migration classes, tools, one-off usage code, simple 
> POJO classes, etc. Such code can be excluded from code coverage reports so 
> that the reported coverage number more accurately reflects to the real 
> coverage. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-748) Add exclusions for code coverage reports

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-748:
---

Assignee: Prashant Wason

> Add exclusions for code coverage reports
> 
>
> Key: HUDI-748
> URL: https://issues.apache.org/jira/browse/HUDI-748
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Time Spent: 20m
>  Remaining Estimate: 1h 40m
>
> HUDI project uses codecov for code coverage reports generated from the unit 
> test runs. Currently, 67% of code is covered in unit-tests.
> There are sections of code which are not a good candidate for unit tests 
> either because they are migration classes, tools, one-off usage code, simple 
> POJO classes, etc. Such code can be excluded from code coverage reports so 
> that the reported coverage number more accurately reflects to the real 
> coverage. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-748) Add exclusions for code coverage reports

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-748:

Status: In Progress  (was: Open)

> Add exclusions for code coverage reports
> 
>
> Key: HUDI-748
> URL: https://issues.apache.org/jira/browse/HUDI-748
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Time Spent: 20m
>  Remaining Estimate: 1h 40m
>
> HUDI project uses codecov for code coverage reports generated from the unit 
> test runs. Currently, 67% of code is covered in unit-tests.
> There are sections of code which are not a good candidate for unit tests 
> either because they are migration classes, tools, one-off usage code, simple 
> POJO classes, etc. Such code can be excluded from code coverage reports so 
> that the reported coverage number more accurately reflects to the real 
> coverage. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-748) Add exclusions for code coverage reports

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason resolved HUDI-748.
-
Resolution: Fixed

> Add exclusions for code coverage reports
> 
>
> Key: HUDI-748
> URL: https://issues.apache.org/jira/browse/HUDI-748
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Time Spent: 20m
>  Remaining Estimate: 1h 40m
>
> HUDI project uses codecov for code coverage reports generated from the unit 
> test runs. Currently, 67% of code is covered in unit-tests.
> There are sections of code which are not a good candidate for unit tests 
> either because they are migration classes, tools, one-off usage code, simple 
> POJO classes, etc. Such code can be excluded from code coverage reports so 
> that the reported coverage number more accurately reflects to the real 
> coverage. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-717) Fix HudiHiveClient for Hive 2.x

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason resolved HUDI-717.
-
Resolution: Fixed

> Fix HudiHiveClient for Hive 2.x
> ---
>
> Key: HUDI-717
> URL: https://issues.apache.org/jira/browse/HUDI-717
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>
> When using the HiveDriver mode in HudiHiveClient, Hive 2.x DDL operations 
> like ALTER may fail. This is because Hive 2.x doesn't like `db`.`table_name` 
> for operations.
> There are two ways to fix this:
> 1. Precede all DDL statements by "USE ;"
> 2. Set the name of the database in the SessionState create for the Driver.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-717) Fix HudiHiveClient for Hive 2.x

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason reassigned HUDI-717:
---

Assignee: Prashant Wason

> Fix HudiHiveClient for Hive 2.x
> ---
>
> Key: HUDI-717
> URL: https://issues.apache.org/jira/browse/HUDI-717
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Assignee: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>
> When using the HiveDriver mode in HudiHiveClient, Hive 2.x DDL operations 
> like ALTER may fail. This is because Hive 2.x doesn't like `db`.`table_name` 
> for operations.
> There are two ways to fix this:
> 1. Precede all DDL statements by "USE ;"
> 2. Set the name of the database in the SessionState create for the Driver.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-717) Fix HudiHiveClient for Hive 2.x

2020-04-03 Thread Prashant Wason (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Prashant Wason updated HUDI-717:

Status: Open  (was: New)

> Fix HudiHiveClient for Hive 2.x
> ---
>
> Key: HUDI-717
> URL: https://issues.apache.org/jira/browse/HUDI-717
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Prashant Wason
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 4h
>  Time Spent: 20m
>  Remaining Estimate: 3h 40m
>
> When using the HiveDriver mode in HudiHiveClient, Hive 2.x DDL operations 
> like ALTER may fail. This is because Hive 2.x doesn't like `db`.`table_name` 
> for operations.
> There are two ways to fix this:
> 1. Precede all DDL statements by "USE ;"
> 2. Set the name of the database in the SessionState create for the Driver.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated: [HUDI-748] Adding .codecov.yml to set exclusions for code coverage reports. (#1468)

2020-04-03 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new deb95ad  [HUDI-748] Adding .codecov.yml to set exclusions for code 
coverage reports. (#1468)
deb95ad is described below

commit deb95ad9962bd40dc3edcc66281e8d333f32bab3
Author: Prashant Wason 
AuthorDate: Fri Apr 3 16:25:01 2020 -0700

[HUDI-748] Adding .codecov.yml to set exclusions for code coverage reports. 
(#1468)
---
 .codecov.yml | 44 
 1 file changed, 44 insertions(+)

diff --git a/.codecov.yml b/.codecov.yml
new file mode 100644
index 000..ddd5c70
--- /dev/null
+++ b/.codecov.yml
@@ -0,0 +1,44 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# For more configuration details:
+# https://docs.codecov.io/docs/codecov-yaml
+
+# Check if this file is valid by running in bash:
+# curl -X POST --data-binary @.codecov.yml https://codecov.io/validate
+
+# Ignoring Paths
+# --
+# which folders/files to ignore
+ignore:
+  - "hudi-common/src/main/java/org/apache/hudi/avro/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java"
+  - "hudi-common/src/main/java/org/apache/hudi/common/HoodieJsonPayload"
+  - "hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactionAdminTool.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotCopier.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieWithTimelineServer.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/UpgradePayloadFromUberToApache.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/perf/TimelineServerPerf.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/HiveIncrementalPuller.java"
+  - 
"hudi-utilities/src/main/java/org/apache/hudi/utilities/adhoc/UpgradePayloadFromUberToApache.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxMetricsReporter.java"
+  - "hudi-client/src/main/java/org/apache/hudi/metrics/JmxReporterServer.java"
+  - 
"hudi-client/src/main/java/org/apache/hudi/metrics/MetricsGraphiteReporter.java"
+  - 
"hudi-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/HoodieInputFormat.java"
+  - 
"hudi-hadoop-mr/src/main/java/com/uber/hoodie/hadoop/realtime/HoodieRealtimeInputFormat.java"
+



[GitHub] [incubator-hudi] bvaradar merged pull request #1468: [HUDI-748] Adding .codecov.yml to set exclusions for code coverage reports.

2020-04-03 Thread GitBox
bvaradar merged pull request #1468: [HUDI-748] Adding .codecov.yml to set 
exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1468
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar merged pull request #1416: [HUDI-717] Fixed usage of HiveDriver for DDL statements for Hive 2.x

2020-04-03 Thread GitBox
bvaradar merged pull request #1416: [HUDI-717] Fixed usage of HiveDriver for 
DDL statements for Hive 2.x
URL: https://github.com/apache/incubator-hudi/pull/1416
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[incubator-hudi] branch master updated: [HUDI-717] Fixed usage of HiveDriver for DDL statements. (#1416)

2020-04-03 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 6808559  [HUDI-717] Fixed usage of HiveDriver for DDL statements. 
(#1416)
6808559 is described below

commit 6808559b018366b4bc6d47b40dbbe362f48f65d7
Author: Prashant Wason 
AuthorDate: Fri Apr 3 16:23:05 2020 -0700

[HUDI-717] Fixed usage of HiveDriver for DDL statements. (#1416)

When using HiveDriver mode in HudiHiveClient, Hive 2.x DDL operations like 
ALTER PARTITION may fail. This is because Hive 2.x doesn't like 
`db`.`table_name` for operations. In this fix, we set the name of the database 
in the SessionState create for the Driver.
---
 .../org/apache/hudi/hive/HoodieHiveClient.java |  4 +-
 .../org/apache/hudi/hive/TestHiveSyncTool.java | 91 +-
 .../test/java/org/apache/hudi/hive/TestUtil.java   | 16 ++--
 3 files changed, 101 insertions(+), 10 deletions(-)

diff --git 
a/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java 
b/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
index 1bfbe20..55a4968 100644
--- a/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
+++ b/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HoodieHiveClient.java
@@ -198,7 +198,8 @@ public class HoodieHiveClient {
 for (String partition : partitions) {
   String partitionClause = getPartitionClause(partition);
   Path partitionPath = FSUtils.getPartitionPath(syncConfig.basePath, 
partition);
-  String fullPartitionPath = 
partitionPath.toUri().getScheme().equals(StorageSchemes.HDFS.getScheme())
+  String partitionScheme = partitionPath.toUri().getScheme();
+  String fullPartitionPath = 
StorageSchemes.HDFS.getScheme().equals(partitionScheme)
   ? FSUtils.getDFSFullPartitionPath(fs, partitionPath) : 
partitionPath.toString();
   String changePartition =
   alterTable + " PARTITION (" + partitionClause + ") SET LOCATION '" + 
fullPartitionPath + "'";
@@ -505,6 +506,7 @@ public class HoodieHiveClient {
 try {
   final long startTime = System.currentTimeMillis();
   ss = SessionState.start(configuration);
+  ss.setCurrentDatabase(syncConfig.databaseName);
   hiveDriver = new org.apache.hadoop.hive.ql.Driver(configuration);
   final long endTime = System.currentTimeMillis();
   LOG.info(String.format("Time taken to start SessionState and create 
Driver: %s ms", (endTime - startTime)));
diff --git 
a/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java 
b/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
index f804219..449c7f3 100644
--- a/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
+++ b/hudi-hive-sync/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
@@ -168,6 +168,47 @@ public class TestHiveSyncTool {
 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
 assertEquals("The last commit that was sycned should be updated in the 
TBLPROPERTIES", instantTime,
 
hiveClient.getLastCommitTimeSynced(TestUtil.hiveSyncConfig.tableName).get());
+
+// Adding of new partitions
+List newPartition = Arrays.asList("2050/01/01");
+hiveClient.addPartitionsToTable(TestUtil.hiveSyncConfig.tableName, 
Arrays.asList());
+assertEquals("No new partition should be added", 5,
+
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
+hiveClient.addPartitionsToTable(TestUtil.hiveSyncConfig.tableName, 
newPartition);
+assertEquals("New partition should be added", 6,
+
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
+
+// Update partitions
+hiveClient.updatePartitionsToTable(TestUtil.hiveSyncConfig.tableName, 
Arrays.asList());
+assertEquals("Partition count should remain the same", 6,
+
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
+hiveClient.updatePartitionsToTable(TestUtil.hiveSyncConfig.tableName,  
newPartition);
+assertEquals("Partition count should remain the same", 6,
+
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName).size());
+
+// Alter partitions
+// Manually change a hive partition location to check if the sync will 
detect
+// it and generage a partition update event for it.
+hiveClient.updateHiveSQL("ALTER TABLE `" + 
TestUtil.hiveSyncConfig.tableName
++ "` PARTITION (`datestr`='2050-01-01') SET LOCATION 
'/some/new/location'");
+
+hiveClient = new HoodieHiveClient(TestUtil.hiveSyncConfig, 
TestUtil.getHiveConf(), TestUtil.fileSystem);
+List hivePartitions = 
hiveClient.scanTablePartitions(TestUtil.hiveSyncConfig.tableName);
+List writtenPartitionsSince = 
hiveClient.getPartitionsWrittenToSin

[GitHub] [incubator-hudi] prashantwason commented on issue #1468: [HUDI-748] Adding .codecov.yml to set exclusions for code coverage reports.

2020-04-03 Thread GitBox
prashantwason commented on issue #1468: [HUDI-748] Adding .codecov.yml to set 
exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1468#issuecomment-608792528
 
 
   I have removed the following form the exclusions as requested. Please take a 
look again.
   
   hudi-common/src/main/java/org/apache/hudi/common/model/*
   "hudi-common/src/main/java/org/apache/hudi/common/table/timeline/dto/*


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1468: [HUDI-748] Adding .codecov.yml to set exclusions for code coverage reports.

2020-04-03 Thread GitBox
prashantwason commented on a change in pull request #1468: [HUDI-748] Adding 
.codecov.yml to set exclusions for code coverage reports.
URL: https://github.com/apache/incubator-hudi/pull/1468#discussion_r403378353
 
 

 ##
 File path: .codecov.yml
 ##
 @@ -0,0 +1,46 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# For more configuration details:
+# https://docs.codecov.io/docs/codecov-yaml
+
+# Check if this file is valid by running in bash:
+# curl -X POST --data-binary @.codecov.yml https://codecov.io/validate
+
+# Ignoring Paths
+# --
+# which folders/files to ignore
+ignore:
+  - "hudi-common/src/main/java/org/apache/hudi/avro/model/*"
+  - "hudi-common/src/main/java/org/apache/hudi/common/model/*"
 
 Review comment:
   No issues. I will remove the hand-written model classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on issue #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-03 Thread GitBox
prashantwason commented on issue #1476: [HUDI-757] Added hudi-cli command to 
export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#issuecomment-608777353
 
 
   I have addressed all you review comments. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on a change in pull request #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-03 Thread GitBox
prashantwason commented on a change in pull request #1476: [HUDI-757] Added 
hudi-cli command to export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#discussion_r403376483
 
 

 ##
 File path: 
hudi-cli/src/main/java/org/apache/hudi/cli/commands/ExportCommand.java
 ##
 @@ -0,0 +1,152 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.avro.model.HoodieArchivedMetaEntry;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.table.log.HoodieLogFormat;
+import org.apache.hudi.common.table.log.HoodieLogFormat.Reader;
+import org.apache.hudi.common.table.log.block.HoodieAvroDataBlock;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.springframework.shell.core.CommandMarker;
+import org.springframework.shell.core.annotation.CliCommand;
+import org.springframework.shell.core.annotation.CliOption;
+import org.springframework.stereotype.Component;
+
+import java.io.File;
+import java.io.FileOutputStream;
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * CLI command to export various information from a HUDI dataset.
+ */
+@Component
+public class ExportCommand implements CommandMarker {
+
+  @CliCommand(value = "export instants", help = "Export Instants and their 
metadata from the Timeline")
+  public String showArchivedCommits(
+  @CliOption(key = {"limit"}, help = "Limit Instants", 
unspecifiedDefaultValue = "-1") final Integer limit,
+  @CliOption(key = {"actions"}, help = "Comma seperated list of Instant 
actions to export",
+unspecifiedDefaultValue = 
"clean,commit,deltacommit,rollback,savepoint,restore") final String filter,
+  @CliOption(key = {"desc"}, help = "Ordering", unspecifiedDefaultValue = 
"false") final boolean descending,
+  @CliOption(key = {"localFolder"}, help = "Local Folder to export to", 
mandatory = true) String localFolder)
+  throws IOException {
+
+final String basePath = HoodieCLI.getTableMetaClient().getBasePath();
+final Path archivePath = new Path(basePath + 
"/.hoodie/.commits_.archive*");
+final Path metaPath = new Path(basePath + "/.hoodie/");
+final Set actionSet = new 
HashSet(Arrays.asList(filter.split(",")));
+int numExports = limit == -1 ? Integer.MAX_VALUE : limit;
+int numCopied = 0;
+
+if (! new File(localFolder).isDirectory()) {
+  throw new RuntimeException(localFolder + " is not a valid local 
directory");
+}
+
+// The non archived instants are of the format 
... We only
+// want the completed ones which do not have the requested/inflight suffix.
+FileStatus[] statuses = FSUtils.getFs(basePath, 
HoodieCLI.conf).listStatus(metaPath);
+List nonArchivedStatuses = Arrays.stream(statuses).filter(f -> 
{
 
 Review comment:
   Good idea. It even makes the filtering easier.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403362814
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -240,4 +244,27 @@ public static String decompress(byte[] bytes) {
   throw new HoodieIOException("IOException while decompressing text", e);
 }
   }
+
+  /**
+   * Generate a reader schema off the provided writeSchema, to just project 
out the provided columns.
+   */
+  public static Schema generateProjectionSchema(Schema originalSchema, 
List fieldNames) {
+Map schemaFieldsMap = originalSchema.getFields().stream()
+.map(r -> Pair.of(r.name().toLowerCase(), 
r)).collect(Collectors.toMap(Pair::getLeft, Pair::getRight));
+List projectedFields = new ArrayList<>();
+for (String fn : fieldNames) {
+  Schema.Field field = schemaFieldsMap.get(fn.toLowerCase());
 
 Review comment:
   I think we can avoid forming `schemaFieldsMap` here. We can directly get the 
field from original schema using `originalSchema.getField()` and I don't think 
there will be any performance impact here because Avro internally maintains a 
Map to similar to this to find the field instead of `O(n)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] Bootstrap Index using HFile and File System View Changes with unit-test

2020-04-03 Thread GitBox
umehrot2 commented on a change in pull request #1459: [HUDI-418] [HUDI-421] 
Bootstrap Index using HFile and File System View Changes with unit-test
URL: https://github.com/apache/incubator-hudi/pull/1459#discussion_r403362814
 
 

 ##
 File path: hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
 ##
 @@ -240,4 +244,27 @@ public static String decompress(byte[] bytes) {
   throw new HoodieIOException("IOException while decompressing text", e);
 }
   }
+
+  /**
+   * Generate a reader schema off the provided writeSchema, to just project 
out the provided columns.
+   */
+  public static Schema generateProjectionSchema(Schema originalSchema, 
List fieldNames) {
+Map schemaFieldsMap = originalSchema.getFields().stream()
+.map(r -> Pair.of(r.name().toLowerCase(), 
r)).collect(Collectors.toMap(Pair::getLeft, Pair::getRight));
+List projectedFields = new ArrayList<>();
+for (String fn : fieldNames) {
+  Schema.Field field = schemaFieldsMap.get(fn.toLowerCase());
 
 Review comment:
   I think we can avoid forming `schemaFieldsMap` here. We can directly get the 
field from original schema using `originalSchema.getField()` and I don't think 
there will be any performance impact here because Avro internally maintains a 
Map to similar to this to find the field instead of `O(n)`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-316) Improve performance of HbaseIndex puts by repartitioning WriteStatus and using rate limiter instead of sleep()

2020-04-03 Thread Venkatesh Rudraraju (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074888#comment-17074888
 ] 

Venkatesh Rudraraju commented on HUDI-316:
--

[https://github.com/apache/incubator-hudi/pull/1484]

> Improve performance of HbaseIndex puts by repartitioning WriteStatus and 
> using rate limiter instead of sleep()
> --
>
> Key: HUDI-316
> URL: https://issues.apache.org/jira/browse/HUDI-316
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Index
>Reporter: Venkatesh Rudraraju
>Assignee: Venkatesh Rudraraju
>Priority: Major
>
> * Repartition WriteStatus before index writes, in a way that each WriteStatus 
> with new records are not clubbed together.
>  * This repartition will improve parallelism for this hbase index operation.
>  * In HBaseIndex puts call, there is a sleep of 100 millis for each batch of 
> puts. This implementation assumes negligible time for puts, but for large 
> batches of puts it is inefficient.
>  * Using rate limiter will be efficient compared to sleep as it accounts for 
> the time taken for puts as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

2020-04-03 Thread GitBox
symfrog commented on issue #1480: [SUPPORT] Backwards Incompatible Schema 
Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608650572
 
 
   @vinothchandar yes, exactly, some schema evolution operations may also 
involve the splitting or merging of tables
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] v3nkatesh opened a new pull request #1484: Hbase qps repartition writestatus

2020-04-03 Thread GitBox
v3nkatesh opened a new pull request #1484: Hbase qps repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   This pull request optimizes hbase index write operations.
   
   ## Brief change log
 - *Replaces Thread.sleep() with RateLimiter to reduce wait time during 
hbase puts operation*
 - *Repartitions `WriteStatus with new records` to improve parallelism of 
hbase index operations*  
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
 - *Added tests to TestHbaseIndex to verify the repartition optimization*
 - *Also verified the change by running a ob end to end*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-316) Improve performance of HbaseIndex puts by repartitioning WriteStatus and using rate limiter instead of sleep()

2020-04-03 Thread Venkatesh Rudraraju (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkatesh Rudraraju updated HUDI-316:
-
Description: 
* Repartition WriteStatus before index writes, in a way that each WriteStatus 
with new records are not clubbed together.
 * This repartition will improve parallelism for this hbase index operation.
 * In HBaseIndex puts call, there is a sleep of 100 millis for each batch of 
puts. This implementation assumes negligible time for puts, but for large 
batches of puts it is inefficient.
 * Using rate limiter will be efficient compared to sleep as it accounts for 
the time taken for puts as well.

  was:
* In HBaseIndex puts call, there is a sleep of 100 millis for each batch of 
puts. This implementation assumes negligible time for puts, but for large 
batches of puts it is inefficient.
 * Using rate limiter will be efficient compared to sleep as it accounts for 
the time taken for puts as well.


> Improve performance of HbaseIndex puts by repartitioning WriteStatus and 
> using rate limiter instead of sleep()
> --
>
> Key: HUDI-316
> URL: https://issues.apache.org/jira/browse/HUDI-316
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Index
>Reporter: Venkatesh Rudraraju
>Assignee: Venkatesh Rudraraju
>Priority: Major
>
> * Repartition WriteStatus before index writes, in a way that each WriteStatus 
> with new records are not clubbed together.
>  * This repartition will improve parallelism for this hbase index operation.
>  * In HBaseIndex puts call, there is a sleep of 100 millis for each batch of 
> puts. This implementation assumes negligible time for puts, but for large 
> batches of puts it is inefficient.
>  * Using rate limiter will be efficient compared to sleep as it accounts for 
> the time taken for puts as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-316) Improve performance of HbaseIndex puts by repartitioning WriteStatus and using rate limiter instead of sleep()

2020-04-03 Thread Venkatesh Rudraraju (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkatesh Rudraraju updated HUDI-316:
-
Summary: Improve performance of HbaseIndex puts by repartitioning 
WriteStatus and using rate limiter instead of sleep()  (was: Improve 
performance of HbaseIndex puts by using rate limiter instead of sleep())

> Improve performance of HbaseIndex puts by repartitioning WriteStatus and 
> using rate limiter instead of sleep()
> --
>
> Key: HUDI-316
> URL: https://issues.apache.org/jira/browse/HUDI-316
> Project: Apache Hudi (incubating)
>  Issue Type: Task
>  Components: Index
>Reporter: Venkatesh Rudraraju
>Assignee: Venkatesh Rudraraju
>Priority: Major
>
> * In HBaseIndex puts call, there is a sleep of 100 millis for each batch of 
> puts. This implementation assumes negligible time for puts, but for large 
> batches of puts it is inefficient.
>  * Using rate limiter will be efficient compared to sleep as it accounts for 
> the time taken for puts as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] ramachandranms commented on a change in pull request #1473: [HUDI-568] Improve unit test coverage

2020-04-03 Thread GitBox
ramachandranms commented on a change in pull request #1473: [HUDI-568] Improve 
unit test coverage
URL: https://github.com/apache/incubator-hudi/pull/1473#discussion_r403266155
 
 

 ##
 File path: 
hudi-common/src/test/java/org/apache/hudi/common/util/collection/TestRocksDBManager.java
 ##
 @@ -99,25 +104,119 @@ public void testRocksDBManager() {
 List> gotPayloads =
 dbManager.prefixSearch(family, 
prefix).collect(Collectors.toList());
 Integer expCount = countsMap.get(family).get(prefix);
+System.out.printf("%s,%s: %d, %d\n", prefix, family, expCount == null 
? 0L : expCount.longValue(), gotPayloads.size());
 
 Review comment:
   this was just a debug statement. so removed it


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
malanb5 commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to 
namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608605564
 
 
   I'm a little bit of noob to hudi and the hadoop ecosystem in general so 
thank you for bearing with me.
   
   The docker compose yaml file indicates that the version number is: 3.3, I 
found that many of the processes reference: 
**apachehudi/hudi-hadoop_2.8.4-history:latest**
   which I assume is the Hudi version being used in the demo?
   
   Here is the output of docker ps.
   ```
   CONTAINER IDIMAGE
  COMMAND  CREATED STATUS   
  PORTS 

  NAMES
   c6428f5ed413apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest 
  "entrypoint.sh worker"   7 seconds ago   Up Less than a second
  0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 5-50200/tcp, 
58042/tcp, 58088/tcp, 58188/tcp 
  presto-worker-1
   667e2ffc1cb1
apachehudi/hudi-hadoop_2.8.4-hive_2.3.3-sparkmaster_2.4.4:latest   
"entrypoint.sh /bin/…"   7 seconds ago   Up Less than a second  
0-1024/tcp, 4040/tcp, 5000-5100/tcp, 6066/tcp, 7000-7076/tcp, 
0.0.0.0:7077->7077/tcp, 7078-8079/tcp, 8081-10100/tcp, 5-50200/tcp, 
58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8080->8080/tcp   sparkmaster
   31f4f30bbc54apachehudi/hudi-hadoop_2.8.4-datanode:latest 
  "/bin/bash /entrypoi…"   12 seconds ago  Up 6 seconds (health: 
starting)0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-10100/tcp, 
5-50009/tcp, 0.0.0.0:50010->50010/tcp, 50011-50074/tcp, 50076-50200/tcp, 
58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:50075->50075/tcp datanode1
   abebc706e3f3apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest   
  "entrypoint.sh /bin/…"   12 seconds ago  Up 6 seconds 
  0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-/tcp, 10001-10100/tcp, 
5-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:1->1/tcp  
   hiveserver
   e1103ccb2097apachehudi/hudi-hadoop_2.8.4-prestobase_0.217:latest 
  "entrypoint.sh coord…"   12 seconds ago  Up 7 seconds 
  0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8089/tcp, 8091-10100/tcp, 
5-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:8090->8090/tcp
presto-coordinator-1
   59717fef35eaapachehudi/hudi-hadoop_2.8.4-history:latest  
  "/bin/bash /entrypoi…"   15 seconds ago  Up 11 seconds (health: 
starting)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8187/tcp, 8189-10100/tcp, 
5-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:58188->8188/tcp   
historyserver
   e60354b6d191apachehudi/hudi-hadoop_2.8.4-hive_2.3.3:latest   
  "entrypoint.sh /opt/…"   15 seconds ago  Up 11 seconds (health: 
starting)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-9082/tcp, 9084-10100/tcp, 
5-50200/tcp, 58042/tcp, 58088/tcp, 58188/tcp, 0.0.0.0:9083->9083/tcp
hivemetastore
   16c4cdce76cfapachehudi/hudi-hadoop_2.8.4-namenode:latest 
  "/bin/bash /entrypoi…"   18 seconds ago  Up 14 seconds (health: 
starting)   0-1024/tcp, 4040/tcp, 5000-5100/tcp, 7000-8019/tcp, 8021-10100/tcp, 
0.0.0.0:8020->8020/tcp, 5-50069/tcp, 50071-50200/tcp, 58042/tcp, 58088/tcp, 
58188/tcp, 0.0.0.0:50070->50070/tcp namenode
   b8811253a180bde2020/hive-metastore-postgresql:2.3.0  
  "/docker-entrypoint.…"   18 seconds ago  Up 14 seconds
  5432/tcp  

  hive-metastore-postgresql
   b0ef4a787039bitnami/zookeeper:3.4.12-r68 
  "/app-entrypoint.sh …"   18 seconds ago  Up 14 seconds
  2888/tcp, 0.0.0.0:2181->2181/tcp, 3888/tcp

  zookeeper
   0eafd90cb012bitnami/kafka:2.0.0  
  "/app-entrypoint.sh …"   18 seconds ago  Up 14 seconds
  0.0.0.0:9092->9092/tcp

[GitHub] [incubator-hudi] lamber-ken commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
lamber-ken commented on issue #1483: [SUPPORT] Docker Demo: Failed to Connect 
to namenode
URL: https://github.com/apache/incubator-hudi/issues/1483#issuecomment-608584549
 
 
   Thanks for report this issue, what version of hudi do you use? can you share 
the output of `docker ps`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] malanb5 opened a new issue #1483: [SUPPORT] Docker Demo: Failed to Connect to namenode

2020-04-03 Thread GitBox
malanb5 opened a new issue #1483: [SUPPORT] Docker Demo: Failed to Connect to 
namenode
URL: https://github.com/apache/incubator-hudi/issues/1483
 
 
   **Describe the problem you faced**
   Failed to connect to server: namenode/172.19.0.5:8020: try once and fail 
when running the ./setup_demo.sh script.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.  Follow the setup per the Docker Demo
   2.  Run the script ./setup_demo.sh
   
   **Expected behavior**
   Connection to the namenode and the successful startup of Hudi.
   
   **Environment Description**
   MacOS: 10.15.4
   Docker: version 19.03.8, build afacb8b
   
   **Stacktrace**
   ```
   Creating network "compose_default" with the default driver
   Creating zookeeper ... done
   Creating namenode  ... done
   Creating kafkabroker   ... done
   Creating hive-metastore-postgresql ... done
   Creating hivemetastore ... done
   Creating historyserver ... done
   Creating datanode1 ... done
   Creating presto-coordinator-1  ... done
   Creating hiveserver... done
   Creating sparkmaster   ... done
   Creating presto-worker-1   ... done
   Creating spark-worker-1... done
   Creating adhoc-2   ... done
   Creating adhoc-1   ... done
   
   Copying spark default config and setting up configs
   20/04/03 17:48:13 WARN ipc.Client: Failed to connect to server: 
namenode/172.19.0.5:8020: try once and fail.
   java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at 
org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:685)
at 
org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:788)
at org.apache.hadoop.ipc.Client$Connection.access$3500(Client.java:410)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1550)
at org.apache.hadoop.ipc.Client.call(Client.java:1381)
at org.apache.hadoop.ipc.Client.call(Client.java:1345)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy10.getFileInfo(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346)
at com.sun.proxy.$Proxy11.getFileInfo(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1649)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1440)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1437)
at org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)
at org.apache.hadoop.fs.Globber.doGlob(Globber.java:269)
at org.apache.hadoop.fs.Globber.glob(Globber.java:148)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1686)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:326)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
at 
org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.T

[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1473: [HUDI-568] Improve unit test coverage

2020-04-03 Thread GitBox
n3nash commented on a change in pull request #1473: [HUDI-568] Improve unit 
test coverage
URL: https://github.com/apache/incubator-hudi/pull/1473#discussion_r403190612
 
 

 ##
 File path: 
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/realtime/TestHoodieRealtimeFileSplit.java
 ##
 @@ -0,0 +1,129 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.realtime;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapred.FileSplit;
+import org.junit.Before;
+import org.junit.Test;
+import org.mockito.InOrder;
+import org.mockito.invocation.InvocationOnMock;
+import org.mockito.stubbing.Answer;
+
+import java.io.DataInput;
+import java.io.DataOutput;
+import java.io.IOException;
+import java.nio.charset.StandardCharsets;
+import java.util.Collections;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.mockito.AdditionalMatchers.aryEq;
+import static org.mockito.Matchers.any;
+import static org.mockito.Matchers.anyByte;
+import static org.mockito.Matchers.anyInt;
+import static org.mockito.Mockito.doAnswer;
+import static org.mockito.Mockito.doNothing;
+import static org.mockito.Mockito.eq;
+import static org.mockito.Mockito.inOrder;
+import static org.mockito.Mockito.mock;
+import static org.mockito.Mockito.times;
+import static org.mockito.Mockito.when;
+
+public class TestHoodieRealtimeFileSplit {
+
+  private HoodieRealtimeFileSplit split;
+  private String basePath = "/tmp";
 
 Review comment:
   The path is something you can keep in a temporary variable in a @BeforeClass 
or @Before, it's best to use that or if you prefer use the TemporaryFolder from 
junit -> would like to avoid hard-coded temp paths used for testing


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on a change in pull request #1473: [HUDI-568] Improve unit test coverage

2020-04-03 Thread GitBox
n3nash commented on a change in pull request #1473: [HUDI-568] Improve unit 
test coverage
URL: https://github.com/apache/incubator-hudi/pull/1473#discussion_r403189372
 
 

 ##
 File path: 
hudi-common/src/main/java/org/apache/hudi/common/util/collection/RocksDBDAO.java
 ##
 @@ -75,9 +75,6 @@ public RocksDBDAO(String basePath, String rocksDBBasePath) {
* Create RocksDB if not initialized.
*/
   private RocksDB getRocksDB() {
-if (null == rocksDB) {
 
 Review comment:
   sounds good, thanks for the explanation


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1375: [SUPPORT] HoodieDeltaStreamer offset not handled correctly

2020-04-03 Thread GitBox
lamber-ken commented on issue #1375: [SUPPORT] HoodieDeltaStreamer offset not 
handled correctly
URL: https://github.com/apache/incubator-hudi/issues/1375#issuecomment-60859
 
 
   So, we can close this issue? @eigakow 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema Evolution

2020-04-03 Thread GitBox
vinothchandar commented on issue #1480: [SUPPORT] Backwards Incompatible Schema 
Evolution
URL: https://github.com/apache/incubator-hudi/issues/1480#issuecomment-608529410
 
 
   >>we would like the instant timestamps to be the same in the new target 
tables after the transformation so that downstream clients can continue to use 
their existing instant values while performing incremental pull queries. 
   
   IIUC the current initialization process hands you a single commit for the 
first ingest.. but you basically want a physical copy of the old data, as the 
new data , with just renamed fields/new schema.. In general, this may be worth 
adding support for in the new exporter tool cc @xushiyan ... wdyt? essentially, 
something that will preserve file names and just transform the data. 
   
   For now, even if you create those commit timeline files yourself in 
`.hoodie`, it may not work since the metadata inside will point to files that 
no longer exist in the new table..  Here's an approach that could work.. 
Writing a small program, that will 
   
   - First copy the `.hoodie` folder to new table location
   - Then list all files (directly using fs.listStatus()) and filter them such 
that their commit time < latest commit time in the `.hoodie` folder you copied 
above
   - Read all files out using AvroParquetReader to get RDD[GenericRecord] (if 
it's MOR, we need more work), do your schema adjusting to derive a new 
RDD[GenericRecord]
   - Write this out using HoodieAvroParquetWriter back into the same file 
names.. 
   
   Essentially, you will have the same file names and same timline (.hoodie) 
metadata, just with different schema.. 
   
   Let's also wait to hear from @xushiyan . may be the exporter tool could be 
reused here
   
   
   
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] venkee14 commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-03 Thread GitBox
venkee14 commented on issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608505242
 
 
   > @venkee14 : may I know how does the new schema look like? Did you update 
the schema explicitly
   
   No, I did not change the schema explicitly


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] venkee14 commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-03 Thread GitBox
venkee14 commented on issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608504894
 
 
   @nsivabalan : My new schema looks like,
   
   20/04/02 06:04:08 INFO DeltaSync: Registering Schema 
:[{"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields": 

   
   
{"name":"updatedby_user","type":["string","null"]},{"name":"_hoodie_is_deleted","type":"boolean"},{"name":"partition_date","type":["string","null"]}]}]
   
   Let me know if you would need complete schema defn


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-03 Thread GitBox
nsivabalan commented on issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608466335
 
 
   @venkee14 : may I know how does the new schema look like? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan edited a comment on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-03 Thread GitBox
nsivabalan edited a comment on issue #1482: [SUPPORT] Deletion of records 
through deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608466335
 
 
   @venkee14 : may I know how does the new schema look like? Did you update the 
schema explicitly


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar commented on issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-03 Thread GitBox
bvaradar commented on issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482#issuecomment-608448949
 
 
   @nsivabalan : can you take a look at this issue ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] venkee14 opened a new issue #1482: [SUPPORT] Deletion of records through deltaStreamer _hoodie_is_deleted flag does not work as expected

2020-04-03 Thread GitBox
venkee14 opened a new issue #1482: [SUPPORT] Deletion of records through 
deltaStreamer _hoodie_is_deleted flag does not work as expected
URL: https://github.com/apache/incubator-hudi/issues/1482
 
 
   I am trying to get "Deletion with HoodieDeltaStreamer" working for my 
existing dataset in Hudi. I am following - 
https://cwiki.apache.org/confluence/display/HUDI/2020/01/15/Delete+support+in+Hudi
   My initial dataset exists without "_hoodie_is_deleted" key, I am trying to 
upsert the records with this key for all incoming records , my code -
   
   Dataset deletedRows = 
dataframe.filter(dataframe.col(this.deleteKey).equalTo(this.deleteValue));
   Dataset remainingRows = 
dataframe.filter(dataframe.col(this.deleteKey).notEqual(this.deleteValue));
   deletedRows = deletedRows.withColumn("_hoodie_is_deleted", lit(true));
   remainingRows = remainingRows.withColumn("_hoodie_is_deleted", lit(false));
   dataframe = deletedRows.union(remainingRows);
   
   I have noticed that, the upsert runs fine, when the record to be deleted is 
the only record in the parquet file. But fails with below error -
   Null-value for required field: _hoodie_is_deleted
at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
   
   When there are other records in the parquet file. Would appreciate any help 
here
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Load initial dataset without _hoodie_is_deleted in the schema
   2. Pick a record from a parquet file, which has multiple records 
   3. Delete this record by adding _hoodie_is_deleted : true, pass this flag 
for all incoming upserts.
   4. Throws "Null-value for required field: _hoodie_is_deleted"
   
   Works when the record record to be deleted is the only record on the parquet 
file
   
   **Expected behavior**
   
   Only a single record has to be deleted on the parquet file and all other 
records should exist and the upsert should not throw "Null-value for required 
field: _hoodie_is_deleted"
   
   **Environment Description**
   
   * Hudi version : 0.5.1
   
   * Spark version : 2.2
   
   * EMR Version: emr-5.28
   
   * Hive version : NA
   
   * Hadoop version : NA
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   StackTrace : 
   
   Caused by: org.apache.hudi.exception.HoodieException: 
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Null-value 
for required field: _hoodie_is_deleted
at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:143)
at 
org.apache.hudi.table.HoodieCopyOnWriteTable.handleUpdateInternal(HoodieCopyOnWriteTable.java:204)
... 32 more
   Caused by: java.util.concurrent.ExecutionException: 
java.lang.RuntimeException: Null-value for required field: _hoodie_is_deleted
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.execute(BoundedInMemoryExecutor.java:141)
... 33 more
   Caused by: java.lang.RuntimeException: Null-value for required field: 
_hoodie_is_deleted
at 
org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:194)
at 
org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
at 
org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
at 
org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
at 
org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:296)
at 
org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:434)
at 
org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:424)
at 
org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
at 
org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
   )
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services