[GitHub] [incubator-hudi] xushiyan commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5

2020-04-13 Thread GitBox
xushiyan commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5
URL: https://github.com/apache/incubator-hudi/pull/1504#issuecomment-613241906
 
 
   @prashantwason Thanks for checking the report. I updated the branch, and as 
before, the report omitted the test cases covered by JUnit 5, resulting in 
coverage decrease. I've created https://jira.apache.org/jira/browse/HUDI-792 
for investigation. @vinothchandar Let me check with the community for help  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-792) Codecov does not report test cases ran by JUnit 5

2020-04-13 Thread Raymond Xu (Jira)
Raymond Xu created HUDI-792:
---

 Summary: Codecov does not report test cases ran by JUnit 5
 Key: HUDI-792
 URL: https://issues.apache.org/jira/browse/HUDI-792
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Testing
Reporter: Raymond Xu
 Fix For: 0.6.0


The problem was detected in https://github.com/apache/incubator-hudi/pull/1504




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] codecov-io commented on issue #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-13 Thread GitBox
codecov-io commented on issue #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#issuecomment-613236143
 
 
   # 
[Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=h1) 
Report
   > Merging 
[#1500](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=desc) 
into 
[master](https://codecov.io/gh/apache/incubator-hudi/commit/4e5c8671ef3213ffa5c40f09aae27aacfa20f907=desc)
 will **increase** coverage by `0.52%`.
   > The diff coverage is `81.90%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/incubator-hudi/pull/1500/graphs/tree.svg?width=650=150=pr=VTTXabwbs2)](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#1500  +/-   ##
   
   + Coverage 71.66%   72.18%   +0.52% 
   + Complexity  290  289   -1 
   
 Files   338  372  +34 
 Lines 1593116278 +347 
 Branches   1625 1638  +13 
   
   + Hits  1141711751 +334 
   - Misses 3781 3793  +12 
   - Partials733  734   +1 
   ```
   
   
   | [Impacted 
Files](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=tree) | 
Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | 
[.../apache/hudi/exception/HoodieRestoreException.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVJlc3RvcmVFeGNlcHRpb24uamF2YQ==)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...c/main/java/org/apache/hudi/table/HoodieTable.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvSG9vZGllVGFibGUuamF2YQ==)
 | `79.64% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...g/apache/hudi/table/action/BaseActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL0Jhc2VBY3Rpb25FeGVjdXRvci5qYXZh)
 | `100.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...ltacommit/BulkInsertDeltaCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2RlbHRhY29tbWl0L0J1bGtJbnNlcnREZWx0YUNvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...it/BulkInsertPreppedDeltaCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2RlbHRhY29tbWl0L0J1bGtJbnNlcnRQcmVwcGVkRGVsdGFDb21taXRBY3Rpb25FeGVjdXRvci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...commit/InsertPreppedDeltaCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2RlbHRhY29tbWl0L0luc2VydFByZXBwZWREZWx0YUNvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...commit/UpsertPreppedDeltaCommitActionExecutor.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL2RlbHRhY29tbWl0L1Vwc2VydFByZXBwZWREZWx0YUNvbW1pdEFjdGlvbkV4ZWN1dG9yLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...che/hudi/table/action/rollback/RollbackHelper.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL1JvbGxiYWNrSGVscGVyLmphdmE=)
 | `80.43% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...he/hudi/table/action/rollback/RollbackRequest.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jbGllbnQvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdGFibGUvYWN0aW9uL3JvbGxiYWNrL1JvbGxiYWNrUmVxdWVzdC5qYXZh)
 | `94.73% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...che/hudi/common/table/timeline/HoodieTimeline.java](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3RpbWVsaW5lL0hvb2RpZVRpbWVsaW5lLmphdmE=)
 | `100.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | ... and [102 
more](https://codecov.io/gh/apache/incubator-hudi/pull/1500/diff?src=pr=tree-more)
 | |
   
   --
   
   [Continue to review full report at 
Codecov](https://codecov.io/gh/apache/incubator-hudi/pull/1500?src=pr=continue).
   > **Legend** - [Click here to learn 
more](https://docs.codecov.io/docs/codecov-delta)
   > `Δ = 

[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407875123
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -293,6 +296,12 @@ public Operation convert(String value) throws 
ParameterException {
 @Parameter(names = {"--checkpoint"}, description = "Resume Delta Streamer 
from this checkpoint.")
 public String checkpoint = null;
 
+@Parameter(names = {"--initial-checkpoint-provider"}, description = 
"Generate check point for delta streamer "
++ "for the first run. This field will override the checkpoint of last 
commit using the checkpoint field. "
++ "Use this field only when switch source, for example, from DFS 
source to Kafka Source. Check the class "
++ "org.apache.hudi.utilities.checkpointing for details")
 
 Review comment:
   Changed the description to match with `--schemaprovider-class`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407872338
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -293,6 +296,12 @@ public Operation convert(String value) throws 
ParameterException {
 @Parameter(names = {"--checkpoint"}, description = "Resume Delta Streamer 
from this checkpoint.")
 public String checkpoint = null;
 
+@Parameter(names = {"--initial-checkpoint-provider"}, description = 
"Generate check point for delta streamer "
++ "for the first run. This field will override the checkpoint of last 
commit using the checkpoint field. "
++ "Use this field only when switch source, for example, from DFS 
source to Kafka Source. Check the class "
++ "org.apache.hudi.utilities.checkpointing for details")
 
 Review comment:
   Do you mean we should add `InitialCheckPointProvider` here or we should 
remove this description? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5

2020-04-13 Thread GitBox
vinothchandar commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5
URL: https://github.com/apache/incubator-hudi/pull/1504#issuecomment-613227516
 
 
   @prashantwason yeah seems like a good idea 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
vinothchandar commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407868180
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -293,6 +296,12 @@ public Operation convert(String value) throws 
ParameterException {
 @Parameter(names = {"--checkpoint"}, description = "Resume Delta Streamer 
from this checkpoint.")
 public String checkpoint = null;
 
+@Parameter(names = {"--initial-checkpoint-provider"}, description = 
"Generate check point for delta streamer "
++ "for the first run. This field will override the checkpoint of last 
commit using the checkpoint field. "
++ "Use this field only when switch source, for example, from DFS 
source to Kafka Source. Check the class "
++ "org.apache.hudi.utilities.checkpointing for details")
 
 Review comment:
   `InitialCheckPointProvider` did you intend to write the name of the class 
here? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
vinothchandar commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407867948
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -90,35 +90,33 @@
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc) throws 
IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()));
+jssc.hadoopConfiguration(), null);
   }
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, 
TypedProperties props) throws IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()), props);
+jssc.hadoopConfiguration(), props);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf,
- TypedProperties properties) throws IOException {
-this.cfg = cfg;
-this.deltaSyncService = new DeltaSyncService(cfg, jssc, fs, hiveConf, 
properties);
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf) throws IOException {
+this(cfg, jssc, fs, hiveConf, null);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf) throws IOException {
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf,
+ TypedProperties properties) throws IOException {
+if (cfg.initialCheckpointProvider != null && cfg.bootstrapFromPath != null 
&& cfg.checkpoint == null) {
+  InitialCheckPointProvider checkPointProvider =
+  
UtilHelpers.createInitialCheckpointProvider(cfg.initialCheckpointProvider, new 
Path(cfg.bootstrapFromPath), fs);
+  cfg.checkpoint = checkPointProvider.getCheckpoint();
 
 Review comment:
   okay.. lets revisit once we have bootstrap support.. cc @bvaradar as fyi 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-13 Thread GitBox
vinothchandar commented on issue #1512: [HUDI-763] Add 
hoodie.table.base.file.format option to hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1512#issuecomment-613225596
 
 
   @lamber-ken trying to understand why this should be needed.. can't we 
distinguish based on the input format for ORC, where we can create 
HoodieOrcInputFormat .. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-04-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-409:

Fix Version/s: (was: 0.5.2)
   0.6.0

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Ramachandran M S
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-409) Replace Log Magic header with a secure hash to avoid clashes with data

2020-04-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-409:
-

Don't think this is fixed yet? 

> Replace Log Magic header with a secure hash to avoid clashes with data
> --
>
> Key: HUDI-409
> URL: https://issues.apache.org/jira/browse/HUDI-409
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Ramachandran M S
>Priority: Major
> Fix For: 0.5.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #247

2020-04-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.40 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test 
for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407840275
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCleansCommand.java
 ##
 @@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.HoodieTableHeaderFields;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.common.HoodieTestCommitMetadataGenerator;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+
+import org.junit.Before;
+import org.junit.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.net.URL;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * Test Cases for {@link CleansCommand}.
+ */
+public class TestCleansCommand extends AbstractShellIntegrationTest {
+
+  private String tablePath;
+  private URL propsFilePath;
+
+  @Before
+  public void init() throws IOException {
+HoodieCLI.conf = jsc.hadoopConfiguration();
+
+String tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+propsFilePath = 
TestCleansCommand.class.getClassLoader().getResource("clean.properties");
+
+// Create table and connect
+new TableCommand().createTable(
+tablePath, tableName, HoodieTableType.COPY_ON_WRITE.name(),
+"", TimelineLayoutVersion.VERSION_1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
+
+Configuration conf = HoodieCLI.conf;
+
+metaClient = HoodieCLI.getTableMetaClient();
+// Create four commits
+for (int i = 100; i < 104; i++) {
+  String timestamp = String.valueOf(i);
+  // Requested Compaction
+  
HoodieTestCommitMetadataGenerator.createCompactionAuxiliaryMetadata(tablePath,
+  new HoodieInstant(HoodieInstant.State.REQUESTED, 
HoodieTimeline.COMPACTION_ACTION, timestamp), conf);
+  // Inflight Compaction
+  
HoodieTestCommitMetadataGenerator.createCompactionAuxiliaryMetadata(tablePath,
+  new HoodieInstant(HoodieInstant.State.INFLIGHT, 
HoodieTimeline.COMPACTION_ACTION, timestamp), conf);
+  
HoodieTestCommitMetadataGenerator.createCommitFileWithMetadata(tablePath, 
timestamp, conf);
+}
+
+metaClient = HoodieTableMetaClient.reload(metaClient);
+// reload the timeline and get all the commits before archive
+metaClient.getActiveTimeline().reload();
+  }
+
+  /**
+   * Test case for show all cleans.
+   */
+  @Test
+  public void testShowCleans() throws Exception {
+// Check properties file exists.
+assertNotNull("Not found properties file", propsFilePath);
+
+// First, run clean
+new File(tablePath + File.separator + 
HoodieTestCommitMetadataGenerator.DEFAULT_FIRST_PARTITION_PATH
++ File.separator + 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE).createNewFile();
+SparkMain.clean(jsc, HoodieCLI.basePath, propsFilePath.getPath(), new 
ArrayList<>());
+assertEquals("Loaded 1 clean and the count should match", 1,
+
metaClient.getActiveTimeline().reload().getCleanerTimeline().getInstants().count());
+
+CommandResult cr = 

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test 
for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407838960
 
 

 ##
 File path: hudi-cli/src/test/resources/clean.properties
 ##
 @@ -0,0 +1,19 @@
+###
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+###
+hoodie.cleaner.incremental.mode=true
 
 Review comment:
   add an empty line before this line?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test 
for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407838801
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/integ/ITTestCleansCommand.java
 ##
 @@ -0,0 +1,101 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.integ;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.commands.TableCommand;
+import org.apache.hudi.cli.common.HoodieTestCommitMetadataGenerator;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.versioning.TimelineLayoutVersion;
+
+import org.junit.Before;
+import org.junit.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.net.URL;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotNull;
+import static org.junit.Assert.assertTrue;
+
+public class ITTestCleansCommand extends AbstractShellIntegrationTest {
+  private String tablePath;
+  private URL propsFilePath;
+
+  @Before
+  public void init() throws IOException {
+HoodieCLI.conf = jsc.hadoopConfiguration();
+
+String tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+propsFilePath = 
this.getClass().getClassLoader().getResource("clean.properties");
+
+// Create table and connect
+new TableCommand().createTable(
+tablePath, tableName, HoodieTableType.COPY_ON_WRITE.name(),
+"", TimelineLayoutVersion.VERSION_1, 
"org.apache.hudi.common.model.HoodieAvroPayload");
+
+Configuration conf = HoodieCLI.conf;
+
+metaClient = HoodieCLI.getTableMetaClient();
+// Create four commits
+for (int i = 100; i < 104; i++) {
+  String timestamp = String.valueOf(i);
+  // Requested Compaction
+  
HoodieTestCommitMetadataGenerator.createCompactionAuxiliaryMetadata(tablePath,
+  new HoodieInstant(HoodieInstant.State.REQUESTED, 
HoodieTimeline.COMPACTION_ACTION, timestamp), conf);
+  // Inflight Compaction
+  
HoodieTestCommitMetadataGenerator.createCompactionAuxiliaryMetadata(tablePath,
+  new HoodieInstant(HoodieInstant.State.INFLIGHT, 
HoodieTimeline.COMPACTION_ACTION, timestamp), conf);
+  
HoodieTestCommitMetadataGenerator.createCommitFileWithMetadata(tablePath, 
timestamp, conf);
+}
+  }
+
+  /**
+   * Test case for cleans run.
+   */
+  @Test
+  public void testRunClean() throws IOException {
+// First, there should none of clean instant.
+assertEquals(0, 
metaClient.getActiveTimeline().reload().getCleanerTimeline().getInstants().count());
+
+// Check properties file exists.
+assertNotNull("Not found properties file", propsFilePath);
+
+// Create partition metadata
+new File(tablePath + File.separator + 
HoodieTestCommitMetadataGenerator.DEFAULT_FIRST_PARTITION_PATH
++ File.separator + 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE).createNewFile();
+new File(tablePath + File.separator + 
HoodieTestCommitMetadataGenerator.DEFAULT_SECOND_PARTITION_PATH
++ File.separator + 
HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE).createNewFile();
+
+CommandResult cr = getShell().executeCommand("cleans run --sparkMaster 
local --propsFilePath " + propsFilePath.toString());
+assertTrue(cr.isSuccess());
+
+// After run clean, there should have 1 clean instant
+assertEquals("Loaded 1 clean and the count should match",1,
 
 Review comment:
   add a blank before the number `1`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:

[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test 
for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407839213
 
 

 ##
 File path: .travis.yml
 ##
 @@ -39,3 +39,9 @@ script:
   - scripts/run_travis_tests.sh $TEST_SUITE
 after_success:
   - bash <(curl -s https://codecov.io/bash)
+before_script:
+  - echo "=[ Download spark]="
+  - wget 
http://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz 
-O /tmp/spark-2.4.4.tgz
 
 Review comment:
   Can we extract the spark and Hadoop version number into a variable for 
upgrading purposes?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test 
for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407837053
 
 

 ##
 File path: hudi-cli/pom.xml
 ##
 @@ -122,6 +122,31 @@
   false
 
   
+  
+org.apache.maven.plugins
+maven-failsafe-plugin
+2.22.0
+
+  
+**/ITT*.java
+  
+
+
+  
+integration-test
 
 Review comment:
   Are you sure we must introduce this `phase` in `hudi-cli` module?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
yanghua commented on a change in pull request #1449: [HUDI-698]Add unit test 
for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407834527
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCleansCommand.java
 ##
 @@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.common.HoodieTestCommitMetadataGenerator;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+
+import org.junit.Before;
+import org.junit.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * Test Cases for {@link CleansCommand}.
+ */
+public class TestCleansCommand extends AbstractShellIntegrationTest {
+
+  private String tablePath;
+  private String propsFilePath;
+
+  @Before
+  public void init() throws IOException {
+HoodieCLI.conf = jsc.hadoopConfiguration();
+
+String tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+propsFilePath = 
TestCleansCommand.class.getClassLoader().getResource("clean.properties").getPath();
 
 Review comment:
   Yes, if it's `null`, the test would fail. However, it's a good habit to 
check it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] lamber-ken commented on issue #1509: [HUDI-525] lack of insert info in delta_commit inflight

2020-04-13 Thread GitBox
lamber-ken commented on issue #1509: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1509#issuecomment-613186819
 
 
   LGTM, can you please rebase and resolve the conflicts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hmatu commented on a change in pull request #1511: [HUDI-789]Adjust logic of upsert in HDFSParquetImporter

2020-04-13 Thread GitBox
hmatu commented on a change in pull request #1511: [HUDI-789]Adjust logic of 
upsert in HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/1511#discussion_r407815629
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java
 ##
 @@ -100,6 +100,10 @@ public static void main(String[] args) {
 
   }
 
+  private boolean isUpsert() {
 
 Review comment:
   use boolean `flag` ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (HUDI-69) Support realtime view in Spark datasource #136

2020-04-13 Thread Yanjia Gary Li (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17082773#comment-17082773
 ] 

Yanjia Gary Li commented on HUDI-69:


After a closer look, I think Spark datasource support for realtime table needs:
 * Refactoring HoodieRealtimeFormat and (file split, record reader). Decouple 
Hudi logic from the MapredParquetInputFormat. I think we can maintain the Hudi 
file split and path filtering in a central place, and able to be adopted by 
different query engines. With bootstrap support, the file format maintenance 
could be more complicated. I think this is very essential. 
 * Implement the extension of ParquetInputFormat from Spark or a custom data 
source reader to handle merge. 
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala]
 * Use Datasource V2 to be the default data source. 

Please let me know what you guys think. 

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> https://github.com/uber/hudi/issues/136



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
garyli1019 commented on issue #1486: [HUDI-759] Integrate checkpoint provider 
with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#issuecomment-613162710
 
 
   @vinothchandar Thanks for all the feedback! Very helpful! 
   Comments addressed. Please take a look.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407793452
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/checkpointing/InitialCheckPointProvider.java
 ##
 @@ -18,14 +18,38 @@
 
 package org.apache.hudi.utilities.checkpointing;
 
+import org.apache.hudi.common.config.TypedProperties;
 import org.apache.hudi.exception.HoodieException;
 
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+
 /**
  * Provide the initial checkpoint for delta streamer.
  */
-public interface InitialCheckPointProvider {
+public abstract class InitialCheckPointProvider {
+  protected transient Path path;
+  protected transient FileSystem fs;
+  protected transient TypedProperties props;
+
+  static class Config {
+private static String CHECKPOINT_PROVIDER_PATH_PROP = 
"hoodie.deltastreamer.checkpoint.provider.path";
+  }
+
+  public InitialCheckPointProvider(TypedProperties props) {
+this.props = props;
+this.path = new 
Path(props.getString(Config.CHECKPOINT_PROVIDER_PATH_PROP));
+  }
+
+  /**
+   * Initialize the class with the current filesystem.
+   *
+   * @param fileSystem
+   */
+  public abstract void init(FileSystem fileSystem);
 
 Review comment:
   Good point. now it's not `hiveConf` any more :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407793139
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -90,35 +90,33 @@
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc) throws 
IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()));
+jssc.hadoopConfiguration(), null);
   }
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, 
TypedProperties props) throws IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()), props);
+jssc.hadoopConfiguration(), props);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf,
- TypedProperties properties) throws IOException {
-this.cfg = cfg;
-this.deltaSyncService = new DeltaSyncService(cfg, jssc, fs, hiveConf, 
properties);
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf) throws IOException {
+this(cfg, jssc, fs, hiveConf, null);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf) throws IOException {
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf,
+ TypedProperties properties) throws IOException {
+if (cfg.initialCheckpointProvider != null && cfg.bootstrapFromPath != null 
&& cfg.checkpoint == null) {
+  InitialCheckPointProvider checkPointProvider =
+  
UtilHelpers.createInitialCheckpointProvider(cfg.initialCheckpointProvider, new 
Path(cfg.bootstrapFromPath), fs);
+  cfg.checkpoint = checkPointProvider.getCheckpoint();
 
 Review comment:
   I believe the initial checkpoint provider should be just used once when the 
user wants to switch from one source to another. After that, the delta streamer 
should be able to get the checkpoint from the previous commit. We can improve 
this once the bootstrap is ready. At this point, I am not sure how to put 
everything together if we want one step to handling everything.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate checkpoint provider with delta streamer

2020-04-13 Thread GitBox
garyli1019 commented on a change in pull request #1486: [HUDI-759] Integrate 
checkpoint provider with delta streamer
URL: https://github.com/apache/incubator-hudi/pull/1486#discussion_r407791400
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
 ##
 @@ -90,35 +90,33 @@
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc) throws 
IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()));
+jssc.hadoopConfiguration(), null);
   }
 
   public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, 
TypedProperties props) throws IOException {
 this(cfg, jssc, FSUtils.getFs(cfg.targetBasePath, 
jssc.hadoopConfiguration()),
-getDefaultHiveConf(jssc.hadoopConfiguration()), props);
+jssc.hadoopConfiguration(), props);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf,
- TypedProperties properties) throws IOException {
-this.cfg = cfg;
-this.deltaSyncService = new DeltaSyncService(cfg, jssc, fs, hiveConf, 
properties);
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf) throws IOException {
+this(cfg, jssc, fs, hiveConf, null);
   }
 
-  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
HiveConf hiveConf) throws IOException {
+  public HoodieDeltaStreamer(Config cfg, JavaSparkContext jssc, FileSystem fs, 
Configuration hiveConf,
+ TypedProperties properties) throws IOException {
+if (cfg.initialCheckpointProvider != null && cfg.bootstrapFromPath != null 
&& cfg.checkpoint == null) {
 
 Review comment:
   yes https://issues.apache.org/jira/browse/HUDI-791


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-791) Replace null by Option in Delta Streamer

2020-04-13 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-791:

Issue Type: Improvement  (was: New Feature)

> Replace null by Option in Delta Streamer
> 
>
> Key: HUDI-791
> URL: https://issues.apache.org/jira/browse/HUDI-791
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Yanjia Gary Li
>Priority: Minor
>
> There is a lot of null in Delta Streamer. That will be great if we can 
> replace those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-791) Replace null by Option in Delta Streamer

2020-04-13 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-791:
---

 Summary: Replace null by Option in Delta Streamer
 Key: HUDI-791
 URL: https://issues.apache.org/jira/browse/HUDI-791
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: DeltaStreamer, newbie
Reporter: Yanjia Gary Li


There is a lot of null in Delta Streamer. That will be great if we can replace 
those null by Option. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] kwondw commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-13 Thread GitBox
kwondw commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r407780799
 
 

 ##
 File path: 
hudi-spark/src/test/java/org/apache/hudi/table/NoOpBulkInsertPartitioner.java
 ##
 @@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.spark.api.java.JavaRDD;
+
+public class NoOpBulkInsertPartitioner
 
 Review comment:
   I can use Lambda to implement it, however to test it, the class should be 
able to be loaded(created) by reflection API(```Class.forName(clazzName)```) 
with its class name. I wasn't sure how classLoader can load lambda class from 
its class name? 
   
   From my quick test, it doesn't work? do you have any suggestion?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] kwondw commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-13 Thread GitBox
kwondw commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r40827
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
 ##
 @@ -55,6 +55,7 @@
   private static final String DEFAULT_PARALLELISM = "1500";
   private static final String INSERT_PARALLELISM = 
"hoodie.insert.shuffle.parallelism";
   private static final String BULKINSERT_PARALLELISM = 
"hoodie.bulkinsert.shuffle.parallelism";
+  private static final String BULKINSERT_USER_DEFINED_PARTITIONER_CLASS = 
"hoodie.bulkinsert.user_defined.partitioner.class";
 
 Review comment:
   Sure, I will change it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] kwondw commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-13 Thread GitBox
kwondw commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r40801
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -152,6 +154,24 @@ public static KeyGenerator 
createKeyGenerator(TypedProperties props) throws IOEx
 }
   }
 
+  /**
+   * Create a UserDefinedBulkInsertPartitioner class via reflection,
+   * 
+   * if the class name of UserDefinedBulkInsertPartitioner is configured 
through the HoodieWriteConfig.
+   * @see HoodieWriteConfig#getUserDefinedBulkInsertPartitionerClass()
+   */
+  private static Option 
createUserDefinedBulkInsertPartitioner(HoodieWriteConfig config)
+  throws IOException {
+String bulkInsertPartitionerClass = 
config.getUserDefinedBulkInsertPartitionerClass();
+try {
+  return bulkInsertPartitionerClass == null || 
bulkInsertPartitionerClass.isEmpty()
 
 Review comment:
   Please let me update it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] kwondw commented on a change in pull request #1500: [HUDI-772] Make UserDefinedBulkInsertPartitioner configurable for DataSource

2020-04-13 Thread GitBox
kwondw commented on a change in pull request #1500: [HUDI-772] Make 
UserDefinedBulkInsertPartitioner configurable for DataSource
URL: https://github.com/apache/incubator-hudi/pull/1500#discussion_r40746
 
 

 ##
 File path: hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java
 ##
 @@ -152,6 +154,24 @@ public static KeyGenerator 
createKeyGenerator(TypedProperties props) throws IOEx
 }
   }
 
+  /**
+   * Create a UserDefinedBulkInsertPartitioner class via reflection,
+   * 
+   * if the class name of UserDefinedBulkInsertPartitioner is configured 
through the HoodieWriteConfig.
+   * @see HoodieWriteConfig#getUserDefinedBulkInsertPartitionerClass()
+   */
+  private static Option 
createUserDefinedBulkInsertPartitioner(HoodieWriteConfig config)
+  throws IOException {
+String bulkInsertPartitionerClass = 
config.getUserDefinedBulkInsertPartitionerClass();
+try {
+  return bulkInsertPartitionerClass == null || 
bulkInsertPartitionerClass.isEmpty()
+  ? Option.empty() :
+  Option.of((UserDefinedBulkInsertPartitioner) 
ReflectionUtils.loadClass(bulkInsertPartitionerClass));
+} catch (Throwable e) {
+  throw new IOException("Could not create UserDefinedBulkInsertPartitioner 
class " + bulkInsertPartitionerClass, e);
 
 Review comment:
   While I was reading the code of 
[DataSourceUtils](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L58),
 I had the impression that it tries to throw IOException as a configuration 
issue of user's while Hoodie*Exception would means as more toward some sort of 
Hudi internal error.
   
   For example, failed to load of 
```hoodie.datasource.write.keygenerator.class``` class throws as IOException 
from 
[createKeyGenerator](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L151),
 and ```hoodie.datasource.write.payload.class``` too from 
[createPayload](https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L164),
 so question is when a user mis-configure as class name, should it be 
considered as "hoodie" error? or general configuration error as user mistake?
   I hope there is specific user configuration exception separately, but I 
wasn't sure the intention of IOException from other configurations, so I 
followed existing ways.
   
   Please let me know if you think this should be HoodieException, I can update 
it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on issue #1402: [WIP][HUDI-407] Adding Simple Index

2020-04-13 Thread GitBox
nsivabalan commented on issue #1402: [WIP][HUDI-407] Adding Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#issuecomment-613146147
 
 
   @lamber-ken : have removed naming job desc for now. Will add it in a diff 
patch. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-04-13 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r407776044
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.SparkConfigUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.PairFlatMapFunction;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+import static 
org.apache.hudi.index.HoodieIndexUtils.loadLatestDataFilesForAllPartitions;
+
+/**
+ * A simple index which reads interested fields(record key and partition path) 
from base files and
+ * joins with incoming records to find the tagged location.
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieIndex {
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+ HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+return true;
+  }
+
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+if (config.getSimpleIndexUseCaching()) {
+  
recordRDD.persist(SparkConfigUtils.getBloomIndexInputStorageLevel(config.getProps()));
+}
+
+JavaPairRDD incomingRecords = 
recordRDD.mapToPair(entry -> new Tuple2<>(entry.getKey(), entry));
+
+JavaPairRDD existingRecords = 
fetchRecordLocations(incomingRecords.keys(), jsc, hoodieTable);
+
+jsc.setJobGroup(this.getClass().getSimpleName(), "Tagging incoming records 
with record location");
+JavaRDD>>> untaggedRecordsRDD = 
incomingRecords.leftOuterJoin(existingRecords)
+.map(entry -> new Tuple2(entry._1, new Tuple2(entry._2._1, 
Option.ofNullable(entry._2._2.orNull();
+
+JavaRDD> taggedRecordRDD = untaggedRecordsRDD.map(entry -> 
getTaggedRecord(entry._2._1, entry._2._2));
+
+if (config.getSimpleIndexUseCaching()) {
+  recordRDD.unpersist(); // unpersist the input Record RDD
+}
+return taggedRecordRDD;
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not.
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD> incomingRecords =
+hoodieKeys.mapToPair(entry -> new Tuple2<>(entry, Option.empty()));
+
+JavaPairRDD existingRecords = 
fetchRecordLocations(hoodieKeys, jsc, hoodieTable);
+
+jsc.setJobGroup(this.getClass().getSimpleName(), "Joining existing records 
with incoming 

[GitHub] [incubator-hudi] nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding Simple Index

2020-04-13 Thread GitBox
nsivabalan commented on a change in pull request #1402: [WIP][HUDI-407] Adding 
Simple Index
URL: https://github.com/apache/incubator-hudi/pull/1402#discussion_r407772487
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/index/HoodieSimpleIndex.java
 ##
 @@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.SparkConfigUtils;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.hadoop.fs.Path;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.api.java.function.PairFlatMapFunction;
+
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+
+import scala.Tuple2;
+
+import static java.util.stream.Collectors.toList;
+import static 
org.apache.hudi.index.HoodieIndexUtils.loadLatestDataFilesForAllPartitions;
+
+/**
+ * A simple index which reads interested fields(record key and partition path) 
from base files and
+ * joins with incoming records to find the tagged location.
+ *
+ * @param 
+ */
+public class HoodieSimpleIndex extends 
HoodieIndex {
+
+  public HoodieSimpleIndex(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
+ HoodieTable hoodieTable) {
+return writeStatusRDD;
+  }
+
+  @Override
+  public boolean rollbackCommit(String commitTime) {
+return true;
+  }
+
+  @Override
+  public boolean isGlobal() {
+return false;
+  }
+
+  @Override
+  public boolean canIndexLogFiles() {
+return false;
+  }
+
+  @Override
+  public boolean isImplicitWithStorage() {
+return true;
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD, JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+if (config.getSimpleIndexUseCaching()) {
+  
recordRDD.persist(SparkConfigUtils.getBloomIndexInputStorageLevel(config.getProps()));
+}
+
+JavaPairRDD incomingRecords = 
recordRDD.mapToPair(entry -> new Tuple2<>(entry.getKey(), entry));
+
+JavaPairRDD existingRecords = 
fetchRecordLocations(incomingRecords.keys(), jsc, hoodieTable);
+
+jsc.setJobGroup(this.getClass().getSimpleName(), "Tagging incoming records 
with record location");
+JavaRDD>>> untaggedRecordsRDD = 
incomingRecords.leftOuterJoin(existingRecords)
+.map(entry -> new Tuple2(entry._1, new Tuple2(entry._2._1, 
Option.ofNullable(entry._2._2.orNull();
+
+JavaRDD> taggedRecordRDD = untaggedRecordsRDD.map(entry -> 
getTaggedRecord(entry._2._1, entry._2._2));
+
+if (config.getSimpleIndexUseCaching()) {
+  recordRDD.unpersist(); // unpersist the input Record RDD
+}
+return taggedRecordRDD;
+  }
+
+  /**
+   * Returns an RDD mapping each HoodieKey with a partitionPath/fileID which 
contains it. Option.Empty if the key is not.
+   * found.
+   *
+   * @param hoodieKeys  keys to lookup
+   * @param jsc spark context
+   * @param hoodieTable hoodie table object
+   */
+  @Override
+  public JavaPairRDD>> 
fetchRecordLocation(JavaRDD hoodieKeys,
+   
   JavaSparkContext jsc, HoodieTable hoodieTable) {
+JavaPairRDD> incomingRecords =
+hoodieKeys.mapToPair(entry -> new Tuple2<>(entry, Option.empty()));
+
+JavaPairRDD existingRecords = 
fetchRecordLocations(hoodieKeys, jsc, hoodieTable);
+
+jsc.setJobGroup(this.getClass().getSimpleName(), "Joining existing records 
with incoming 

[jira] [Updated] (HUDI-763) Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-763:

Labels: pull-request-available  (was: )

> Add hoodie.table.base.file.format option to hoodie.properties file
> --
>
> Key: HUDI-763
> URL: https://issues.apache.org/jira/browse/HUDI-763
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>  Labels: pull-request-available
>
> Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] lamber-ken opened a new pull request #1512: [HUDI-763] Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-13 Thread GitBox
lamber-ken opened a new pull request #1512: [HUDI-763] Add 
hoodie.table.base.file.format option to hoodie.properties file
URL: https://github.com/apache/incubator-hudi/pull/1512
 
 
   ## What is the purpose of the pull request
   
   In order to support multiple storage formats, store 
`hoodie.table.base.file.format` to hoodie.properties.
   
   - `hoodie.table.base.file.format`: default `PARQUET`
   
   ## Brief change log
   
 - Store `hoodie.table.base.file.format` to hoodie.properties.
   
   ## Verify this pull request
   
   This pull request is a minal rework.
   
   ## Committer checklist
   
- [X] Has a corresponding JIRA in PR title & commit

- [X] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-763) Add hoodie.table.base.file.format option to hoodie.properties file

2020-04-13 Thread lamber-ken (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lamber-ken updated HUDI-763:

Summary: Add hoodie.table.base.file.format option to hoodie.properties file 
 (was: Add storage type option to hoodie.properties file)

> Add hoodie.table.base.file.format option to hoodie.properties file
> --
>
> Key: HUDI-763
> URL: https://issues.apache.org/jira/browse/HUDI-763
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: Storage Management
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Major
>
> Add an option like "hoodie.table.storage.type=ORC" to hoodie.properties file



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] prashantwason commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5

2020-04-13 Thread GitBox
prashantwason commented on issue #1504: [HUDI-780] Migrate test cases to Junit 5
URL: https://github.com/apache/incubator-hudi/pull/1504#issuecomment-613071292
 
 
   Everything in the commit looks good so no reason coverage should be low. I 
checked the coverage report and it shows about 1000 lines missing coverage. 
   
   Since codecov has been flaky in the past (during merges, updates etc), I 
suggest forcing the commit again so codecov runs (maybe from the head or 
master). If it again reports low coverage, we can investigate.  
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] prashantwason commented on issue #1476: [HUDI-757] Added hudi-cli command to export metadata of Instants.

2020-04-13 Thread GitBox
prashantwason commented on issue #1476: [HUDI-757] Added hudi-cli command to 
export metadata of Instants.
URL: https://github.com/apache/incubator-hudi/pull/1476#issuecomment-613049344
 
 
   Fixed the merge conflict and pushed again. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] n3nash commented on issue #1484: [HUDI-316] : Hbase qps repartition writestatus

2020-04-13 Thread GitBox
n3nash commented on issue #1484: [HUDI-316] : Hbase qps repartition writestatus
URL: https://github.com/apache/incubator-hudi/pull/1484#issuecomment-613027382
 
 
   @satishkotha can you please help review this ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-784) CorruptedLogFileException sometimes happens on GCS

2020-04-13 Thread Alexander Filipchik (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Filipchik updated HUDI-784:
-
Fix Version/s: 0.5.0

> CorruptedLogFileException sometimes happens on GCS
> --
>
> Key: HUDI-784
> URL: https://issues.apache.org/jira/browse/HUDI-784
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.5.0
>
>
> 768726 [Executor task launch worker-2] ERROR 
> org.apache.hudi.common.table.log.AbstractHoodieLogRecordScanner  - Got 
> exception when reading log file
> org.apache.hudi.exception.CorruptedLogFileException: HoodieLogFile{pathStr='
> [gs://.log.|gs://1_20200219014757.log.2]
> ', fileLen=0}could not be read. Did not find the magic bytes at the start of 
> the block
> at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.readMagic(HoodieLogFileReader.java:313)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFileReader.hasNext(HoodieLogFileReader.java:295)
>   at 
> org.apache.hudi.common.table.log.HoodieLogFormatReader.hasNext(HoodieLogFormatReader.java:103)
>  
> I did extensive debugging and still unclear on why it is happening. It might 
> be issue with GCS libraries themselves. The fix that is working:
>  
> In: HoodieLogFileReader made 
> {code:java}
> // private final byte[] magicBuffer = new byte[6];
> {code}
> non static. I'm not sure why it is actually static in the first place as it 
> is inviting a race.
> Also in HoodieLogFileReader:
> added
> {code:java}
> // fsDataInputStream.seek(0);
> {code}
> added right after stream creation in the constructor.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong commented on issue #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
hddong commented on issue #1449: [HUDI-698]Add unit test for CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#issuecomment-612963742
 
 
   @yanghua Thanks for your review and suggestions, had address them.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on a change in pull request #1449: [HUDI-698]Add unit test for CleansCommand

2020-04-13 Thread GitBox
hddong commented on a change in pull request #1449: [HUDI-698]Add unit test for 
CleansCommand
URL: https://github.com/apache/incubator-hudi/pull/1449#discussion_r407555357
 
 

 ##
 File path: 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestCleansCommand.java
 ##
 @@ -0,0 +1,166 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.cli.commands;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hudi.avro.model.HoodieCleanMetadata;
+import org.apache.hudi.cli.AbstractShellIntegrationTest;
+import org.apache.hudi.cli.HoodieCLI;
+import org.apache.hudi.cli.HoodiePrintHelper;
+import org.apache.hudi.cli.TableHeader;
+import org.apache.hudi.cli.common.HoodieTestCommitMetadataGenerator;
+import org.apache.hudi.common.model.HoodieCleaningPolicy;
+import org.apache.hudi.common.model.HoodiePartitionMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.table.timeline.TimelineMetadataUtils;
+
+import org.junit.Before;
+import org.junit.Test;
+import org.springframework.shell.core.CommandResult;
+
+import java.io.File;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.HashMap;
+import java.util.List;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+/**
+ * Test Cases for {@link CleansCommand}.
+ */
+public class TestCleansCommand extends AbstractShellIntegrationTest {
+
+  private String tablePath;
+  private String propsFilePath;
+
+  @Before
+  public void init() throws IOException {
+HoodieCLI.conf = jsc.hadoopConfiguration();
+
+String tableName = "test_table";
+tablePath = basePath + File.separator + tableName;
+propsFilePath = 
TestCleansCommand.class.getClassLoader().getResource("clean.properties").getPath();
 
 Review comment:
   > `getResource` may be nullable. Here, we may need to check null firstly.
   
   Had add a check in `Test`. But, IMO, the check is non-essential, test will 
failed if `getResource` is null.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-790) Improve OverwriteWithLatestAvroPayload to consider value on disk as well before overwriting.

2020-04-13 Thread Bhavani Sudha (Jira)
Bhavani Sudha created HUDI-790:
--

 Summary: Improve OverwriteWithLatestAvroPayload to consider value 
on disk as well before overwriting.
 Key: HUDI-790
 URL: https://issues.apache.org/jira/browse/HUDI-790
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Writer Core
Reporter: Bhavani Sudha
Assignee: Bhavani Sudha






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-04-13 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-305:
--

Assignee: (was: Bhavani Sudha Saktheeswaran)

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Priority: Major
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-04-13 Thread Bhavani Sudha (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reassigned HUDI-305:
--

Assignee: Bhavani Sudha

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha
>Priority: Major
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[incubator-hudi] branch master updated (17bf930 -> 661b0b3)

2020-04-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from 17bf930  [HUDI-770] Organize upsert/insert API implementation under a 
single package (#1495)
 add 661b0b3  [HUDI-761] Refactoring rollback and restore actions using the 
ActionExecutor abstraction (#1492)

No new revisions were added by this update.

Summary of changes:
 .../org/apache/hudi/cli/commands/SparkMain.java|   2 +-
 .../hudi/client/AbstractHoodieWriteClient.java | 118 +---
 .../org/apache/hudi/client/HoodieWriteClient.java  | 156 ++--
 ...kException.java => HoodieRestoreException.java} |   8 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  |  79 +---
 .../apache/hudi/table/HoodieMergeOnReadTable.java  | 198 ++---
 .../java/org/apache/hudi/table/HoodieTable.java|  39 ++--
 .../table/action/clean/CleanActionExecutor.java|   8 +-
 .../action/restore/BaseRestoreActionExecutor.java  | 111 
 .../restore/CopyOnWriteRestoreActionExecutor.java  |  57 ++
 .../restore/MergeOnReadRestoreActionExecutor.java  |  64 +++
 .../rollback/BaseRollbackActionExecutor.java   | 192 
 .../CopyOnWriteRollbackActionExecutor.java |  94 ++
 .../MergeOnReadRollbackActionExecutor.java}| 193 
 .../{ => action}/rollback/RollbackHelper.java  |   2 +-
 .../{ => action}/rollback/RollbackRequest.java |   2 +-
 .../org/apache/hudi/client/TestClientRollback.java |  16 +-
 .../java/org/apache/hudi/table/TestCleaner.java|   2 +-
 .../table/timeline/TimelineMetadataUtils.java  |  15 +-
 .../common/fs/inline/TestInLineFileSystem.java |   5 +-
 20 files changed, 680 insertions(+), 681 deletions(-)
 copy 
hudi-client/src/main/java/org/apache/hudi/exception/{HoodieRollbackException.java
 => HoodieRestoreException.java} (81%)
 create mode 100644 
hudi-client/src/main/java/org/apache/hudi/table/action/restore/BaseRestoreActionExecutor.java
 create mode 100644 
hudi-client/src/main/java/org/apache/hudi/table/action/restore/CopyOnWriteRestoreActionExecutor.java
 create mode 100644 
hudi-client/src/main/java/org/apache/hudi/table/action/restore/MergeOnReadRestoreActionExecutor.java
 create mode 100644 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackActionExecutor.java
 create mode 100644 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/CopyOnWriteRollbackActionExecutor.java
 copy 
hudi-client/src/main/java/org/apache/hudi/table/{HoodieMergeOnReadTable.java => 
action/rollback/MergeOnReadRollbackActionExecutor.java} (59%)
 rename hudi-client/src/main/java/org/apache/hudi/table/{ => 
action}/rollback/RollbackHelper.java (99%)
 rename hudi-client/src/main/java/org/apache/hudi/table/{ => 
action}/rollback/RollbackRequest.java (98%)



[GitHub] [incubator-hudi] vinothchandar merged pull request #1492: [HUDI-761] Refactoring rollback and restore actions using the ActionExecutor abstraction

2020-04-13 Thread GitBox
vinothchandar merged pull request #1492: [HUDI-761] Refactoring rollback and 
restore actions using the ActionExecutor abstraction
URL: https://github.com/apache/incubator-hudi/pull/1492
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1511: [HUDI-789]Adjust logic of upsert in HDFSParquetImporter

2020-04-13 Thread GitBox
hddong commented on issue #1511: [HUDI-789]Adjust logic of upsert in 
HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/1511#issuecomment-612930401
 
 
   @hmatu Before change, `upsert` is equivalent is `insert`, because target 
path must not present(will be delete if present). We can do `upsert` based on 
existing data after change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] dengziming commented on issue #1151: [WIP][HUDI-476] Add hudi-examples module

2020-04-13 Thread GitBox
dengziming commented on issue #1151: [WIP][HUDI-476] Add hudi-examples module
URL: https://github.com/apache/incubator-hudi/pull/1151#issuecomment-612881257
 
 
   Hello, @vinothchandar ,Sorry for the late reply. I want to address some of 
your comment and here are issues:
   1. I tried make data prep part of the deltastreamer themselves and then also 
provide same defaults for input/output paths, but at last I found my code the 
same as `org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer#main`, it 
seems that we don't have to add deltastreamer examples because the 
HoodieDeltaStreamer class is already a complete deltastreamer example. Maybe we 
just need to add some simple examples rather than a very complete and unified 
example.
   2. I wrote a `run_hoodie_examples.sh` to run the example and reuse 
spark-bundle/utilities-bundle instead of building a fat jar, but the build 
process of hudi-utilities-bundle will relocate `com.beust.jcommander.` to 
`org.apache.hudi.com.beust.jcommander.`, and my example have a dependency on 
`com.beust.jcommander.` and the spark-shell failed, so should I also add a 
relocation to  pom.xml of hudi-examples. 
   How do you think about these 2 problems.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hmatu commented on issue #1511: [HUDI-789]Adjust logic of upsert in HDFSParquetImporter

2020-04-13 Thread GitBox
hmatu commented on issue #1511: [HUDI-789]Adjust logic of upsert in 
HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/1511#issuecomment-612865257
 
 
   These changes don't make sense(`command` only `UPSERT`) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] hddong commented on issue #1174: [HUDI-96]: Implemented command line options instead of positional arguments for CLI commands

2020-04-13 Thread GitBox
hddong commented on issue #1174: [HUDI-96]: Implemented command line options 
instead of positional arguments for CLI commands
URL: https://github.com/apache/incubator-hudi/pull/1174#issuecomment-612798425
 
 
   @vinothchandar : as @pratyakshsharma said, it's not same as this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pratyakshsharma commented on issue #1174: [HUDI-96]: Implemented command line options instead of positional arguments for CLI commands

2020-04-13 Thread GitBox
pratyakshsharma commented on issue #1174: [HUDI-96]: Implemented command line 
options instead of positional arguments for CLI commands
URL: https://github.com/apache/incubator-hudi/pull/1174#issuecomment-612790506
 
 
   > @pratyakshsharma @n3nash this has been around for sometime.. does @hddong 
also have a PR for the same JIRA?
   
   @hddong had a PR to fix specifying sparkMaster and sparkMemory at one place. 
It involved some code refactoring but it did not fix positional arguments. Here 
it is - https://github.com/apache/incubator-hudi/pull/1452
   
   Yesterday I rebased the branch with master and integration tests are 
breaking. I am working on fixing them. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (HUDI-789) Adjust logic of upsert in HDFSParquetImporter

2020-04-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-789:

Labels: pull-request-available  (was: )

> Adjust logic of upsert in HDFSParquetImporter
> -
>
> Key: HUDI-789
> URL: https://issues.apache.org/jira/browse/HUDI-789
> Project: Apache Hudi (incubating)
>  Issue Type: Bug
>  Components: Utilities
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
>
> In HDFSParquetImporter, upsert is equivalent to insert (remove old metadata, 
> then insert). But upsert means update and insert on old data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] hddong opened a new pull request #1511: [HUDI-789]Adjust logic of upsert in HDFSParquetImporter

2020-04-13 Thread GitBox
hddong opened a new pull request #1511: [HUDI-789]Adjust logic of upsert in 
HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/1511
 
 
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *In `HDFSParquetImporter`, `upsert` is equivalent to `insert` (remove old 
metadata, then create and insert data). But `upsert` means update and insert on 
old data. *
   
   ## Brief change log
   
   *(for example:)*
 - *Adjust logic of upsert in HDFSParquetImporter*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (HUDI-789) Adjust logic of upsert in HDFSParquetImporter

2020-04-13 Thread hong dongdong (Jira)
hong dongdong created HUDI-789:
--

 Summary: Adjust logic of upsert in HDFSParquetImporter
 Key: HUDI-789
 URL: https://issues.apache.org/jira/browse/HUDI-789
 Project: Apache Hudi (incubating)
  Issue Type: Bug
  Components: Utilities
Reporter: hong dongdong
Assignee: hong dongdong


In HDFSParquetImporter, upsert is equivalent to insert (remove old metadata, 
then insert). But upsert means update and insert on old data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [incubator-hudi] vinothchandar commented on issue #1492: [HUDI-761] Refactoring rollback and restore actions using the ActionExecutor abstraction

2020-04-13 Thread GitBox
vinothchandar commented on issue #1492: [HUDI-761] Refactoring rollback and 
restore actions using the ActionExecutor abstraction
URL: https://github.com/apache/incubator-hudi/pull/1492#issuecomment-612778656
 
 
   Rebased again.. Will wait for CI to pass and merge. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1492: [HUDI-761] Refactoring rollback and restore actions using the ActionExecutor abstraction

2020-04-13 Thread GitBox
vinothchandar commented on a change in pull request #1492: [HUDI-761] 
Refactoring rollback and restore actions using the ActionExecutor abstraction
URL: https://github.com/apache/incubator-hudi/pull/1492#discussion_r407346793
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/CopyOnWriteRollbackActionExecutor.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.rollback;
+
+import org.apache.hudi.common.HoodieRollbackStat;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class CopyOnWriteRollbackActionExecutor extends 
BaseRollbackActionExecutor {
+
+  private static final Logger LOG = 
LogManager.getLogger(CopyOnWriteRollbackActionExecutor.class);
+
+  public CopyOnWriteRollbackActionExecutor(JavaSparkContext jsc,
+   HoodieWriteConfig config,
+   HoodieTable table,
+   String instantTime,
+   HoodieInstant commitInstant,
+   boolean deleteInstants) {
+super(jsc, config, table, instantTime, commitInstant, deleteInstants);
+  }
+
+  public CopyOnWriteRollbackActionExecutor(JavaSparkContext jsc,
+   HoodieWriteConfig config,
+   HoodieTable table,
+   String instantTime,
+   HoodieInstant commitInstant,
+   boolean deleteInstants,
+   boolean skipTimelinePublish) {
+super(jsc, config, table, instantTime, commitInstant, deleteInstants, 
skipTimelinePublish);
+  }
+
+  @Override
+  protected List executeRollback() throws IOException {
+long startTime = System.currentTimeMillis();
+List stats = new ArrayList<>();
+HoodieActiveTimeline activeTimeline = table.getActiveTimeline();
+HoodieInstant resolvedInstant = instantToRollback;
+
+if (instantToRollback.isCompleted()) {
+  LOG.info("Unpublishing instant " + instantToRollback);
+  resolvedInstant = activeTimeline.revertToInflight(instantToRollback);
+}
+
+// For Requested State (like failure during index lookup), there is 
nothing to do rollback other than
+// deleting the timeline file
+if (!resolvedInstant.isRequested()) {
+  // delete all the data files for this commit
+  LOG.info("Clean out all parquet files generated for commit: " + 
resolvedInstant);
+  List rollbackRequests = 
generateRollbackRequests(resolvedInstant);
+
+  //TODO: We need to persist this as rollback workload and use it in case 
of partial failures
+  stats = new RollbackHelper(table.getMetaClient(), 
config).performRollback(jsc, resolvedInstant, rollbackRequests);
+}
+// Delete Inflight instant if enabled
+deleteInflightAndRequestedInstant(deleteInstants, activeTimeline, 
resolvedInstant);
+LOG.info("Time(in ms) taken to finish rollback " + 
(System.currentTimeMillis() - startTime));
+return stats;
+  }
+
+  private List generateRollbackRequests(HoodieInstant 
instantToRollback)
+  throws IOException {
+return FSUtils.getAllPartitionPaths(table.getMetaClient().getFs(), 
table.getMetaClient().getBasePath(),
+config.shouldAssumeDatePartitioning()).stream()
+.map(partitionPath -> 
RollbackRequest.createRollbackRequestWithDeleteDataAndLogFilesAction(partitionPath,
 instantToRollback))
 
 Review comment:
   it is.. 

[GitHub] [incubator-hudi] bvaradar commented on a change in pull request #1492: [HUDI-761] Refactoring rollback and restore actions using the ActionExecutor abstraction

2020-04-13 Thread GitBox
bvaradar commented on a change in pull request #1492: [HUDI-761] Refactoring 
rollback and restore actions using the ActionExecutor abstraction
URL: https://github.com/apache/incubator-hudi/pull/1492#discussion_r407342052
 
 

 ##
 File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/CopyOnWriteRollbackActionExecutor.java
 ##
 @@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.rollback;
+
+import org.apache.hudi.common.HoodieRollbackStat;
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+import java.util.stream.Collectors;
+
+public class CopyOnWriteRollbackActionExecutor extends 
BaseRollbackActionExecutor {
+
+  private static final Logger LOG = 
LogManager.getLogger(CopyOnWriteRollbackActionExecutor.class);
+
+  public CopyOnWriteRollbackActionExecutor(JavaSparkContext jsc,
+   HoodieWriteConfig config,
+   HoodieTable table,
+   String instantTime,
+   HoodieInstant commitInstant,
+   boolean deleteInstants) {
+super(jsc, config, table, instantTime, commitInstant, deleteInstants);
+  }
+
+  public CopyOnWriteRollbackActionExecutor(JavaSparkContext jsc,
+   HoodieWriteConfig config,
+   HoodieTable table,
+   String instantTime,
+   HoodieInstant commitInstant,
+   boolean deleteInstants,
+   boolean skipTimelinePublish) {
+super(jsc, config, table, instantTime, commitInstant, deleteInstants, 
skipTimelinePublish);
+  }
+
+  @Override
+  protected List executeRollback() throws IOException {
+long startTime = System.currentTimeMillis();
+List stats = new ArrayList<>();
+HoodieActiveTimeline activeTimeline = table.getActiveTimeline();
+HoodieInstant resolvedInstant = instantToRollback;
+
+if (instantToRollback.isCompleted()) {
+  LOG.info("Unpublishing instant " + instantToRollback);
+  resolvedInstant = activeTimeline.revertToInflight(instantToRollback);
+}
+
+// For Requested State (like failure during index lookup), there is 
nothing to do rollback other than
+// deleting the timeline file
+if (!resolvedInstant.isRequested()) {
+  // delete all the data files for this commit
+  LOG.info("Clean out all parquet files generated for commit: " + 
resolvedInstant);
+  List rollbackRequests = 
generateRollbackRequests(resolvedInstant);
+
+  //TODO: We need to persist this as rollback workload and use it in case 
of partial failures
+  stats = new RollbackHelper(table.getMetaClient(), 
config).performRollback(jsc, resolvedInstant, rollbackRequests);
+}
+// Delete Inflight instant if enabled
+deleteInflightAndRequestedInstant(deleteInstants, activeTimeline, 
resolvedInstant);
+LOG.info("Time(in ms) taken to finish rollback " + 
(System.currentTimeMillis() - startTime));
+return stats;
+  }
+
+  private List generateRollbackRequests(HoodieInstant 
instantToRollback)
+  throws IOException {
+return FSUtils.getAllPartitionPaths(table.getMetaClient().getFs(), 
table.getMetaClient().getBasePath(),
+config.shouldAssumeDatePartitioning()).stream()
+.map(partitionPath -> 
RollbackRequest.createRollbackRequestWithDeleteDataAndLogFilesAction(partitionPath,
 instantToRollback))
 
 Review comment:
   This is not related to current change. Wondering why COW rollback needs to 
deal with  log files rollback. 

[GitHub] [incubator-hudi] liujianhuiouc commented on issue #1509: [HUDI-525] lack of insert info in delta_commit inflight

2020-04-13 Thread GitBox
liujianhuiouc commented on issue #1509: [HUDI-525] lack of insert info in 
delta_commit inflight
URL: https://github.com/apache/incubator-hudi/pull/1509#issuecomment-612771752
 
 
   thank you for you reminding, it's my company email, i have set it with 
personal mailbox 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] bvaradar merged pull request #1495: [HUDI-770] Organize upsert/insert API implementation under a single package

2020-04-13 Thread GitBox
bvaradar merged pull request #1495: [HUDI-770] Organize upsert/insert API 
implementation under a single package
URL: https://github.com/apache/incubator-hudi/pull/1495
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services