Build failed in Jenkins: hudi-snapshot-deployment-0.5 #354

2020-07-29 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.28 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${s

[jira] [Updated] (HUDI-875) Introduce a new pom module named hudi-common-sync

2020-07-29 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-875:

Priority: Blocker  (was: Major)

> Introduce a new pom module named hudi-common-sync
> -
>
> Key: HUDI-875
> URL: https://issues.apache.org/jira/browse/HUDI-875
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Blocker
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-875) Introduce a new pom module named hudi-common-sync

2020-07-29 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-875:

Status: Open  (was: New)

> Introduce a new pom module named hudi-common-sync
> -
>
> Key: HUDI-875
> URL: https://issues.apache.org/jira/browse/HUDI-875
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1138) Re-implement marker files via timeline server

2020-07-29 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-1138:


 Summary: Re-implement marker files via timeline server
 Key: HUDI-1138
 URL: https://issues.apache.org/jira/browse/HUDI-1138
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Vinoth Chandar


Even as you can argue that RFC-15/consolidated metadata, removes the need for 
deleting partial files written due to spark task failures/stage retries. It 
will still leave extra files inside the table (and users will pay for it every 
month) and we need the marker mechanism to be able to delete these partial 
files. 

Here we explore if we can improve the current marker file mechanism, that 
creates one marker file per data file written, by 

Delegating the createMarker() call to the driver/timeline server, and have it 
create marker metadata into a single file handle, that is flushed for 
durability guarantees

 

P.S: I was tempted to think Spark listener mechanism can help us deal with 
failed tasks, but it has no guarantees. the writer job could die without 
deleting a partial file. i.e it can improve things, but cant provide guarantees 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch hudi_test_suite_refactor updated (aa7c382 -> f651091)

2020-07-29 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard aa7c382  [HUDI-394] Provide a basic implementation of test suite
 add f651091  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (aa7c382)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (f651091)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 hudi-spark/src/test/java/TestComplexKeyGenerator.java | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[jira] [Created] (HUDI-1137) [Test Suite] Add option to configure different path selector

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1137:
-

 Summary: [Test Suite] Add option to configure different path 
selector
 Key: HUDI-1137
 URL: https://issues.apache.org/jira/browse/HUDI-1137
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Testing
Reporter: Nishith Agarwal
Assignee: satish






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1136) Add back findInstantsAfterOrEquals to the HoodieTimeline class

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1136:
-

 Summary: Add back findInstantsAfterOrEquals to the HoodieTimeline 
class
 Key: HUDI-1136
 URL: https://issues.apache.org/jira/browse/HUDI-1136
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Nishith Agarwal
Assignee: Prashant Wason






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1135) Make timeline server timeout settings configurable

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1135:
-

 Summary: Make timeline server timeout settings configurable
 Key: HUDI-1135
 URL: https://issues.apache.org/jira/browse/HUDI-1135
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1134) Fix timer default to 5 mins

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1134:
-

 Summary: Fix timer default to 5 mins
 Key: HUDI-1134
 URL: https://issues.apache.org/jira/browse/HUDI-1134
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal


For the EmbeddedTImelineServer, the timeout is incorrectly configured.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1133) Tune buffer sizes for the diskbased external spillable map

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1133:
-

 Summary: Tune buffer sizes for the diskbased external spillable map
 Key: HUDI-1133
 URL: https://issues.apache.org/jira/browse/HUDI-1133
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Nishith Agarwal
Assignee: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1132) Use hadoop shim for other file formats in HoodieHiveCombineInputFormat

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1132:
-

 Summary: Use hadoop shim for other file formats in 
HoodieHiveCombineInputFormat
 Key: HUDI-1132
 URL: https://issues.apache.org/jira/browse/HUDI-1132
 Project: Apache Hudi
  Issue Type: Bug
  Components: Hive Integration
Reporter: Nishith Agarwal
Assignee: satish






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1131) Ability to test across different Hudi versions using Hudi test suite

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1131:
-

 Summary: Ability to test across different Hudi versions using Hudi 
test suite
 Key: HUDI-1131
 URL: https://issues.apache.org/jira/browse/HUDI-1131
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Testing
Reporter: Nishith Agarwal
Assignee: Abhishek Modi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1130) Allow for schema evolution within DAG for hudi test suite

2020-07-29 Thread Nishith Agarwal (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nishith Agarwal reassigned HUDI-1130:
-

Assignee: Nishith Agarwal

> Allow for schema evolution within DAG for hudi test suite
> -
>
> Key: HUDI-1130
> URL: https://issues.apache.org/jira/browse/HUDI-1130
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1130) Allow for schema evolution within DAG for hudi test suite

2020-07-29 Thread Nishith Agarwal (Jira)
Nishith Agarwal created HUDI-1130:
-

 Summary: Allow for schema evolution within DAG for hudi test suite
 Key: HUDI-1130
 URL: https://issues.apache.org/jira/browse/HUDI-1130
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Testing
Reporter: Nishith Agarwal






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-983) Add Metrics section to asf-site

2020-07-29 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-983.
-
Resolution: Done

> Add Metrics section to asf-site
> ---
>
> Key: HUDI-983
> URL: https://issues.apache.org/jira/browse/HUDI-983
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: shenh062326
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 0.6.0
>
>
> Document the use of metrics system in Hudi, include all supported metrics 
> reporter.
> See the example
> https://user-images.githubusercontent.com/20113411/83055820-f5e97100-a086-11ea-9ea3-52b342aca9d4.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-983) Add Metrics section to asf-site

2020-07-29 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-983:

Status: Open  (was: New)

> Add Metrics section to asf-site
> ---
>
> Key: HUDI-983
> URL: https://issues.apache.org/jira/browse/HUDI-983
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: shenh062326
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 0.6.0
>
>
> Document the use of metrics system in Hudi, include all supported metrics 
> reporter.
> See the example
> https://user-images.githubusercontent.com/20113411/83055820-f5e97100-a086-11ea-9ea3-52b342aca9d4.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-983) Add Metrics section to asf-site

2020-07-29 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-983:

Status: In Progress  (was: Open)

> Add Metrics section to asf-site
> ---
>
> Key: HUDI-983
> URL: https://issues.apache.org/jira/browse/HUDI-983
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Assignee: shenh062326
>Priority: Minor
>  Labels: documentation, newbie
> Fix For: 0.6.0
>
>
> Document the use of metrics system in Hudi, include all supported metrics 
> reporter.
> See the example
> https://user-images.githubusercontent.com/20113411/83055820-f5e97100-a086-11ea-9ea3-52b342aca9d4.png



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-194) Support for writing Iceberg metadata on Hoodie RO tables

2020-07-29 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-194:

Description: Basic idea here is to map Hudi WriteStatus objects into what 
Iceberg needs to maintain metadata.. Additionally, we need to enhance the 
WriteStat to collect range/null stats information on columns and feed that into 
Iceberg as well. We also will create a TableFileSystemView implementation based 
on Iceberg's metadata, so that it's another option in addition to 
listing/timeline server based implementations.  (was: Basic idea here is to map 
Hudi WriteStatus objects into what Iceberg needs to maintain metadata.. 
Additionally, we need to enhance the WriteStat to collect range/null stats 
information on columns and feed that into Iceberg as well..  

We also will create a TableFileSystemView implementation based on Iceberg's 
metadata, so that it's another option in addition to listing/timeline server 
based implementations. )

> Support for writing Iceberg metadata on Hoodie RO tables
> 
>
> Key: HUDI-194
> URL: https://issues.apache.org/jira/browse/HUDI-194
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>
> Basic idea here is to map Hudi WriteStatus objects into what Iceberg needs to 
> maintain metadata.. Additionally, we need to enhance the WriteStat to collect 
> range/null stats information on columns and feed that into Iceberg as well. 
> We also will create a TableFileSystemView implementation based on Iceberg's 
> metadata, so that it's another option in addition to listing/timeline server 
> based implementations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-194) Support for writing Iceberg metadata on Hoodie RO tables

2020-07-29 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-194:

Description: 
Basic idea here is to map Hudi WriteStatus objects into what Iceberg needs to 
maintain metadata.. Additionally, we need to enhance the WriteStat to collect 
range/null stats information on columns and feed that into Iceberg as well..  

We also will create a TableFileSystemView implementation based on Iceberg's 
metadata, so that it's another option in addition to listing/timeline server 
based implementations. 

  was:Basic idea here is to map Hudi WriteStatus objects into what Iceberg 
needs to maintain metadata.. Additionally, we need to enhance the WriteStat to 
collect range/null stats information on columns and feed that into Iceberg as 
well..  


> Support for writing Iceberg metadata on Hoodie RO tables
> 
>
> Key: HUDI-194
> URL: https://issues.apache.org/jira/browse/HUDI-194
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prasanna Rajaperumal
>Priority: Major
>
> Basic idea here is to map Hudi WriteStatus objects into what Iceberg needs to 
> maintain metadata.. Additionally, we need to enhance the WriteStat to collect 
> range/null stats information on columns and feed that into Iceberg as well..  
> We also will create a TableFileSystemView implementation based on Iceberg's 
> metadata, so that it's another option in addition to listing/timeline server 
> based implementations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-194) Support for writing Iceberg metadata on Hoodie RO tables

2020-07-29 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-194:
---

Assignee: (was: Prasanna Rajaperumal)

> Support for writing Iceberg metadata on Hoodie RO tables
> 
>
> Key: HUDI-194
> URL: https://issues.apache.org/jira/browse/HUDI-194
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Major
>
> Basic idea here is to map Hudi WriteStatus objects into what Iceberg needs to 
> maintain metadata.. Additionally, we need to enhance the WriteStat to collect 
> range/null stats information on columns and feed that into Iceberg as well..  
> We also will create a TableFileSystemView implementation based on Iceberg's 
> metadata, so that it's another option in addition to listing/timeline server 
> based implementations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-29 Thread GitBox


nsivabalan commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r462460480



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackActionExecutor.java
##
@@ -59,31 +61,31 @@
   protected final boolean useMarkerBasedStrategy;
 
   public BaseRollbackActionExecutor(JavaSparkContext jsc,
-HoodieWriteConfig config,
-HoodieTable table,
-String instantTime,
-HoodieInstant instantToRollback,
-boolean deleteInstants) {
+  HoodieWriteConfig config,

Review comment:
   no changes in this file. just formatting changes.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-07-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1122:
-
Labels: pull-request-available  (was: )

> Introduce a kafka implementation of hoodie write commit callback 
> -
>
> Key: HUDI-1122
> URL: https://issues.apache.org/jira/browse/HUDI-1122
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Discussed 
> here:[https://lists.apache.org/thread.html/r2b29fa11ac06b9c93141afcde78ae84592a50123d92cf004c4a7e41b%40%3Cdev.hudi.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Mathieu1124 commented on pull request #1886: [HUDI-1122]Introduce a kafka implementation of hoodie write commit ca…

2020-07-29 Thread GitBox


Mathieu1124 commented on pull request #1886:
URL: https://github.com/apache/hudi/pull/1886#issuecomment-665661859


   @yanghua please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-29 Thread GitBox


garyli1019 commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-665798153


   @bvaradar Thanks for trying this out. `java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile` looks strange. I 
will try it out on my production today to see if I can reproduce. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 opened a new pull request #1886: [HUDI-1122]Introduce a kafka implementation of hoodie write commit ca…

2020-07-29 Thread GitBox


Mathieu1124 opened a new pull request #1886:
URL: https://github.com/apache/hudi/pull/1886


   …llback
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Introduce a kafka implementation of hoodie write commit callback*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch hudi_test_suite_refactor updated (d2b5125 -> aa7c382)

2020-07-29 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch hudi_test_suite_refactor
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard d2b5125  [HUDI-394] Provide a basic implementation of test suite
 add aa7c382  [HUDI-394] Provide a basic implementation of test suite

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (d2b5125)
\
 N -- N -- N   refs/heads/hudi_test_suite_refactor (aa7c382)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 .../hudi/integ/testsuite/reader/TestDFSHoodieDatasetInputReader.java| 2 +-
 .../src/test/java/org/apache/hudi/integ/testsuite/utils/TestUtils.java  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)



[jira] [Updated] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-07-29 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1122:
--
Description: Discussed 
here:[https://lists.apache.org/thread.html/r2b29fa11ac06b9c93141afcde78ae84592a50123d92cf004c4a7e41b%40%3Cdev.hudi.apache.org%3E]

> Introduce a kafka implementation of hoodie write commit callback 
> -
>
> Key: HUDI-1122
> URL: https://issues.apache.org/jira/browse/HUDI-1122
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.0
>
>
> Discussed 
> here:[https://lists.apache.org/thread.html/r2b29fa11ac06b9c93141afcde78ae84592a50123d92cf004c4a7e41b%40%3Cdev.hudi.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1110) Support update partial fields

2020-07-29 Thread leesf (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf reassigned HUDI-1110:
---

Assignee: leesf

> Support update partial fields
> -
>
> Key: HUDI-1110
> URL: https://issues.apache.org/jira/browse/HUDI-1110
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>
> Now Hudi only support update records with all fields present, however, 
> sometimes we may only want to update partial fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-07-29 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1122:
--
Status: In Progress  (was: Open)

> Introduce a kafka implementation of hoodie write commit callback 
> -
>
> Key: HUDI-1122
> URL: https://issues.apache.org/jira/browse/HUDI-1122
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsuthar-lumiq commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

2020-07-29 Thread GitBox


nsuthar-lumiq commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665450530


   @pratyakshsharma could you please share the documentation that has an 
example of composite key uses. We are not getting, how to use it, and also does 
it also support Pyspark?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-07-29 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1122:
--
Status: Open  (was: New)

> Introduce a kafka implementation of hoodie write commit callback 
> -
>
> Key: HUDI-1122
> URL: https://issues.apache.org/jira/browse/HUDI-1122
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] satishkotha edited a comment on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-07-29 Thread GitBox


satishkotha edited a comment on issue #1872:
URL: https://github.com/apache/hudi/issues/1872#issuecomment-665278060


   @luffyd were you able to figure out a workaround? If not, consider opening a 
jira. We think adding  jitter and retries when hudi calls S3 may help. (feel 
free to open Pull request too)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #1768: [HUDI-1054][Peformance] Several performance fixes during finalizing writes

2020-07-29 Thread GitBox


umehrot2 commented on pull request #1768:
URL: https://github.com/apache/hudi/pull/1768#issuecomment-664741636


   > @umehrot2 just landed the changes I mentioned. can we rework this PR and 
try again . We can make things parallel i.e working for s3 for now. and then we 
can adjust for HDFS later on. So we should be able close the loop faster.
   > 
   > I do want to get this into 0.6.0 so also please let me know if you are 
unable to take a stab at this
   
   Working on it @vinothchandar. There has been quite a refactoring it seems, 
which is making the re-basing tricky as now these functions are being called 
from places which do not even have `spark context`.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-29 Thread GitBox


vinothchandar commented on pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#issuecomment-665155548


   >For the default sort mode for bulk insert, shall we set it to a mode other 
than GLOBAL_SORT?
   
   yes. sg. we can retain existing behavior.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1876: [HUDI-242] Support for RFC-12/Bootstrapping of external datasets

2020-07-29 Thread GitBox


vinothchandar commented on pull request #1876:
URL: https://github.com/apache/hudi/pull/1876#issuecomment-665155023


   @yanghua on it. still trying to make the tests all pass with master 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1848: [HUDI-69] Support Spark Datasource for MOR table - RDD approach

2020-07-29 Thread GitBox


bvaradar commented on pull request #1848:
URL: https://github.com/apache/hudi/pull/1848#issuecomment-665478130


   @garyli1019 : I took this patch and ran it in EMR (Spark-2.4.5-amzn-0). I 
got the following exceptions when loading  S3 dataset.
   
   I am using hudi-spark-bundle in the spark session.
   
   
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
   20/07/29 06:57:50 WARN Client: Neither spark.yarn.jars nor 
spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
   Spark context Web UI available at 
http://ip-172-31-33-232.us-east-2.compute.internal:4040
   Spark context available as 'sc' (master = yarn, app id = 
application_1595775804042_9837).
   Spark session available as 'spark'.
   Welcome to
   __
/ __/__  ___ _/ /__
   _\ \/ _ \/ _ `/ __/  '_/
  /___/ .__/\_,_/_/ /_/\_\   version 2.4.5-amzn-0
 /_/

   Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_252)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> val dfh = 
spark.read.format("hudi").load("s3a://hudi.streaming.perf/orders_stream_hudi_mor_4/*/*")
   java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.PartitionedFile.(Lorg/apache/spark/sql/catalyst/InternalRow;Ljava/lang/String;JJ[Ljava/lang/String;)V
 at 
org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:144)
 at 
org.apache.hudi.MergeOnReadSnapshotRelation$$anonfun$4.apply(MergeOnReadSnapshotRelation.scala:141)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at scala.collection.Iterator$class.foreach(Iterator.scala:891)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
 at scala.collection.AbstractTraversable.map(Traversable.scala:104)
 at 
org.apache.hudi.MergeOnReadSnapshotRelation.buildFileIndex(MergeOnReadSnapshotRelation.scala:141)
 at 
org.apache.hudi.MergeOnReadSnapshotRelation.(MergeOnReadSnapshotRelation.scala:75)
 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:70)
 at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:50)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
 at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
 ... 49 elided
   
   scala> 
   
   Have you seen this issue before ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] NikhilSuthar edited a comment on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

2020-07-29 Thread GitBox


NikhilSuthar edited a comment on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665452095


   @pratyakshsharma could you please share the documentation that has an 
example of composite key uses. We are not getting, how to use it, and does it 
also support Pyspark?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #1881: Fix master compilation failure

2020-07-29 Thread GitBox


umehrot2 commented on pull request #1881:
URL: https://github.com/apache/hudi/pull/1881#issuecomment-664755684


   @bvaradar @vinothchandar Master is failing with a compilation issue. Minor 
fix.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #1871: [WIP] [HUDI-781] Introduce HoodieDataPrep for test preparation

2020-07-29 Thread GitBox


xushiyan commented on pull request #1871:
URL: https://github.com/apache/hudi/pull/1871#issuecomment-665437032


   Make sense. Working on incorporating the feature of writing data to files.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 opened a new pull request #1881: Fix master compilation failure

2020-07-29 Thread GitBox


umehrot2 opened a new pull request #1881:
URL: https://github.com/apache/hudi/pull/1881


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Master has a compilation failure which is fixed by this PR.
   
   ## Brief change log
   
   
   ## Verify this pull request
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #1876: [HUDI-242] Support for RFC-12/Bootstrapping of external datasets

2020-07-29 Thread GitBox


yanghua commented on pull request #1876:
URL: https://github.com/apache/hudi/pull/1876#issuecomment-665071390


   @vinothchandar conflicts...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar merged pull request #1882: [MINOR] Fixing default index parallelism for simple index

2020-07-29 Thread GitBox


bvaradar merged pull request #1882:
URL: https://github.com/apache/hudi/pull/1882


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-29 Thread GitBox


bvaradar commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-665120755


   @asheeshgarg : Yes, Currently, concurrent writing could interfere with one 
another as part of automatic rollback process. We are revamping this in 0.6 
which will allow parallel writing across partitions. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1678: [HUDI-242] Metadata Bootstrap changes

2020-07-29 Thread GitBox


bvaradar commented on a change in pull request #1678:
URL: https://github.com/apache/hudi/pull/1678#discussion_r461755543



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/commit/MergeHelper.java
##
@@ -0,0 +1,204 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table.action.commit;
+
+import java.io.ByteArrayOutputStream;
+import org.apache.avro.generic.GenericDatumReader;
+import org.apache.avro.generic.GenericDatumWriter;
+import org.apache.avro.io.BinaryDecoder;
+import org.apache.avro.io.BinaryEncoder;
+import org.apache.avro.io.DecoderFactory;
+import org.apache.avro.io.EncoderFactory;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.client.utils.MergingParquetIterator;
+import org.apache.hudi.client.utils.ParquetReaderIterator;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.queue.BoundedInMemoryExecutor;
+import org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.execution.SparkBoundedInMemoryExecutor;
+import org.apache.hudi.io.HoodieMergeHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.parquet.avro.AvroParquetReader;
+import org.apache.parquet.avro.AvroReadSupport;
+import org.apache.parquet.avro.AvroSchemaConverter;
+import org.apache.parquet.hadoop.ParquetReader;
+import org.apache.parquet.schema.MessageType;
+
+import org.apache.avro.Schema;
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.conf.Configuration;
+
+import java.io.IOException;
+import java.util.Iterator;
+
+/**
+ * Helper to read records from previous version of parquet and run Merge.
+ */
+public class MergeHelper {
+
+  /**
+   * Read records from previous version of base file and merge.
+   * @param table Hoodie Table
+   * @param upsertHandle Merge Handle
+   * @param 
+   * @throws IOException in case of error
+   */
+  public static > void 
runMerge(HoodieTable table,
+  HoodieMergeHandle upsertHandle) throws IOException {
+final boolean externalchemaTransformation = 
table.getConfig().shouldUseExternalSchemaTransformation();
+Configuration configForHudiFile = new Configuration(table.getHadoopConf());
+HoodieBaseFile baseFile = upsertHandle.getPrevBaseFile();
+
+final GenericDatumWriter gWriter;
+final GenericDatumReader gReader;
+if (externalchemaTransformation || 
baseFile.getExternalBaseFile().isPresent()) {
+  MessageType usedParquetSchema = 
ParquetUtils.readSchema(table.getHadoopConf(), upsertHandle.getOldFilePath());

Review comment:
   For reference





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua edited a comment on pull request #1880: [HUDI-1125] build framework to support structured streaming

2020-07-29 Thread GitBox


yanghua edited a comment on pull request #1880:
URL: https://github.com/apache/hudi/pull/1880#issuecomment-664710483


   @linshan-ma Thanks for your contribution. Two suggestions:
   
   1) It contains some irrelevant commits you should remove;
   2) Each PR must be completed and test-able before merging it into the 
codebase, otherwise, you can only provide a completed implementation.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on issue #1586: [SUPPORT] DMS with 2 key example

2020-07-29 Thread GitBox


pratyakshsharma commented on issue #1586:
URL: https://github.com/apache/hudi/issues/1586#issuecomment-665107888


   Will take a look at it. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha merged pull request #1883: [MINOR] change log.info to log.debug

2020-07-29 Thread GitBox


bhasudha merged pull request #1883:
URL: https://github.com/apache/hudi/pull/1883


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1845: [SUPPORT] Support for Schema evolution. Facing an error

2020-07-29 Thread GitBox


bvaradar commented on issue #1845:
URL: https://github.com/apache/hudi/issues/1845#issuecomment-665180775







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

2020-07-29 Thread GitBox


zhedoubushishi commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-665192084


   > @zhedoubushishi There is a conflict file. Can you please fix it?
   
   Sure. Done.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on a change in pull request #1704: [HUDI-115] Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

2020-07-29 Thread GitBox


bhasudha commented on a change in pull request #1704:
URL: https://github.com/apache/hudi/pull/1704#discussion_r461120936



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
##
@@ -83,4 +86,37 @@ private boolean isDeleteRecord(GenericRecord genericRecord) {
 Object deleteMarker = genericRecord.get("_hoodie_is_deleted");
 return (deleteMarker instanceof Boolean && (boolean) deleteMarker);
   }
+
+  @Override
+  public Option combineAndGetUpdateValue(IndexedRecord 
currentValue, Schema schema, Map props) throws IOException {
+if (recordBytes.length == 0) {
+  return Option.empty();
+}
+GenericRecord incomingRecord = bytesToAvro(recordBytes, schema);
+/*
+ * Combining strategy here returns currentValue on disk if incoming record 
is older.
+ * The incoming record can be either a delete (sent as an upsert with 
_hoodie_is_deleted set to true)
+ * or an insert/update record. In any case, if it is older than the record 
in disk, the currentValue
+ * in disk is returned (to be rewritten with new commit time).
+ *
+ * NOTE: Deletes sent via EmptyHoodieRecordPayload and/or Delete operation 
type do not hit this code path

Review comment:
   created a Jira issue to track this separately - 
https://issues.apache.org/jira/browse/HUDI-1127





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar merged pull request #1881: Fix master compilation failure

2020-07-29 Thread GitBox


bvaradar merged pull request #1881:
URL: https://github.com/apache/hudi/pull/1881


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yihua commented on pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-29 Thread GitBox


yihua commented on pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#issuecomment-664834387


   @vinothchandar @nsivabalan this PR is ready for another review.
   
   I fixed the failing tests.  I also simplified the bulk insert logic 
regarding different sort modes.  Besides, I added more javadocs and cleaned up 
the code style.
   
   For the default sort mode for bulk insert, shall we set it to a mode other 
than GLOBAL_SORT?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1881: Fix master compilation failure

2020-07-29 Thread GitBox


bvaradar commented on pull request #1881:
URL: https://github.com/apache/hudi/pull/1881#issuecomment-664795679


   @umehrot2 : Landing this. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] linshan-ma commented on pull request #1880: [HUDI-1125] build framework to support structured streaming

2020-07-29 Thread GitBox


linshan-ma commented on pull request #1880:
URL: https://github.com/apache/hudi/pull/1880#issuecomment-664762563


   > @linshan-ma Thanks for your contribution. Two suggestions:
   > 
   > 1. It contains some irrelevant commits you should remove;
   > 2. Each PR must be completed and test-able before merging it into the 
codebase, otherwise, you can only provide a completed implementation.
   
   @yanghua Thank you for your advice。1 I checked.i will remove irrelevant code 
 2 I have tested the code ,Are you asking me to submit a test class?   3,the 
code is completed  about build framework .The jiar [HUDI-1126] is other 
sub-task to Implement in detail



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-29 Thread GitBox


vinothchandar commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r461265990



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/CopyOnWriteRollbackActionExecutor.java
##
@@ -100,8 +108,13 @@ public CopyOnWriteRollbackActionExecutor(JavaSparkContext 
jsc,
   }
 
   @Override
-  protected List 
executeRollbackUsingFileListing(HoodieInstant instantToRollback) {
+  protected List 
executeRollbackUsingFileListing(HoodieInstant instantToRollback, boolean 
doDelete) {
 List rollbackRequests = 
generateRollbackRequestsByListing();
-return new ListingBasedRollbackHelper(table.getMetaClient(), 
config).performRollback(jsc, instantToRollback, rollbackRequests);
+ListingBasedRollbackHelper listingBasedRollbackHelper = new 
ListingBasedRollbackHelper(table.getMetaClient(), config);
+if(doDelete) {
+  return listingBasedRollbackHelper.performRollback(jsc, 
instantToRollback, rollbackRequests);

Review comment:
   as discussed, we can just call the `collectRollbackStats` directly, 
assuming listing based rollback strategy.

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java
##
@@ -68,34 +69,38 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient 
metaClient, HoodieWriteC
* Performs all rollback actions that we have collected in parallel.
*/
   public List performRollback(JavaSparkContext jsc, 
HoodieInstant instantToRollback, List 
rollbackRequests) {
-SerializablePathFilter filter = (path) -> {

Review comment:
   ack

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java
##
@@ -130,39 +137,55 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient 
metaClient, HoodieWriteC
   1L
   );
   return new Tuple2<>(rollbackRequest.getPartitionPath(),
-  
HoodieRollbackStat.newBuilder().withPartitionPath(rollbackRequest.getPartitionPath())
-  
.withRollbackBlockAppendResults(filesToNumBlocksRollback).build());
+  
HoodieRollbackStat.newBuilder().withPartitionPath(rollbackRequest.getPartitionPath())
+  
.withRollbackBlockAppendResults(filesToNumBlocksRollback).build());
 }
 default:
   throw new IllegalStateException("Unknown Rollback action " + 
rollbackRequest);
   }
-}).reduceByKey(RollbackUtils::mergeRollbackStat).map(Tuple2::_2).collect();
+});
   }
 
 
-
   /**
* Common method used for cleaning out base files under a partition path 
during rollback of a set of commits.
*/
-  private Map deleteCleanedFiles(HoodieTableMetaClient 
metaClient, HoodieWriteConfig config,
-  String partitionPath, 
PathFilter filter) throws IOException {
+  private Map deleteBaseAndLogFiles(HoodieTableMetaClient 
metaClient, HoodieWriteConfig config,

Review comment:
   its hard. but would nt MERGE work for both in terms of actually 
performing a correct rollback? 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/ListingBasedRollbackHelper.java
##
@@ -130,39 +137,55 @@ public ListingBasedRollbackHelper(HoodieTableMetaClient 
metaClient, HoodieWriteC
   1L
   );
   return new Tuple2<>(rollbackRequest.getPartitionPath(),
-  
HoodieRollbackStat.newBuilder().withPartitionPath(rollbackRequest.getPartitionPath())
-  
.withRollbackBlockAppendResults(filesToNumBlocksRollback).build());
+  
HoodieRollbackStat.newBuilder().withPartitionPath(rollbackRequest.getPartitionPath())
+  
.withRollbackBlockAppendResults(filesToNumBlocksRollback).build());
 }
 default:
   throw new IllegalStateException("Unknown Rollback action " + 
rollbackRequest);
   }
-}).reduceByKey(RollbackUtils::mergeRollbackStat).map(Tuple2::_2).collect();
+});
   }
 
 
-
   /**
* Common method used for cleaning out base files under a partition path 
during rollback of a set of commits.
*/
-  private Map deleteCleanedFiles(HoodieTableMetaClient 
metaClient, HoodieWriteConfig config,
-  String partitionPath, 
PathFilter filter) throws IOException {
+  private Map deleteBaseAndLogFiles(HoodieTableMetaClient 
metaClient, HoodieWriteConfig config,
+  String commit, String partitionPath, boolean doDelete) throws 
IOException {
 LOG.info("Cleaning path " + partitionPath);
+String basefileExtension = 
metaClient.getTableConfig().getBaseFileFormat().getFileExtension();
+SerializablePathFilter filter = (path) -> {
+  if (path.toString().endsWith(basefileExtension)) {
+String fileCommitTime = FSUtils.getCommitTime(path.getName());
+return commit.equ

[GitHub] [hudi] nsivabalan commented on a change in pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-29 Thread GitBox


nsivabalan commented on a change in pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#discussion_r461966620



##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/bulkinsert/RDDPartitionSortPartitioner.java
##
@@ -0,0 +1,68 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+
+import org.apache.spark.api.java.JavaRDD;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+
+import scala.Tuple2;
+
+/**
+ * A built-in partitioner that does local sorting for each RDD partition
+ * after coalesce for bulk insert operation, corresponding to the
+ * {@code BulkInsertSortMode.PARTITION_SORT} mode.
+ *
+ * @param  HoodieRecordPayload type
+ */
+public class RDDPartitionSortPartitioner
+extends BulkInsertInternalPartitioner {
+
+  @Override
+  public JavaRDD> repartitionRecords(JavaRDD> 
records,
+ int 
outputSparkPartitions) {
+return records.coalesce(outputSparkPartitions)
+.mapToPair(record ->
+new Tuple2<>(
+new StringBuilder()
+.append(record.getPartitionPath())
+.append("+")
+.append(record.getRecordKey())
+.toString(), record))
+.mapPartitions(partition -> {
+  // Sort locally in partition
+  List>> recordList = new ArrayList<>();
+  for (; partition.hasNext(); ) {
+recordList.add(partition.next());
+  }
+  Collections.sort(recordList, (o1, o2) -> o1._1.compareTo(o2._1));

Review comment:
   will sync up offline with you. interested to know difference between 
repartition and sort within partitions and current approach. Current approach 
will bring all records to memory right?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitioner.java
##
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.execution.bulkinsert;
+
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.table.UserDefinedBulkInsertPartitioner;
+
+/**
+ * Built-in partitioner to repartition input records into at least expected 
number of
+ * output spark partitions for bulk insert operation.
+ *
+ * @param  HoodieRecordPayload type
+ */
+public abstract class BulkInsertInternalPartitioner implements
+UserDefinedBulkInsertPartitioner {
+
+  public static BulkInsertInternalPartitioner get(BulkInsertSortMode sortMode) 
{

Review comment:
   this looks like a factory. did you consider naming with "factory" as 
suffix or something like that ? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1149: [HUDI-472] Introduce configurations and new modes of sorting for bulk_insert

2020-07-29 Thread GitBox


vinothchandar commented on a change in pull request #1149:
URL: https://github.com/apache/hudi/pull/1149#discussion_r461888109



##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java
##
@@ -161,6 +161,17 @@ public static void writePartitionMetadata(FileSystem fs, 
String[] partitionPaths
 }
   }
 
+  public static List newHoodieRecords(int n, String time) throws 
Exception {

Review comment:
   can we not use the existing methods in the data generator to write these 
tests?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on pull request #1884: [HUDI-995] Use Transformations, Assertions and SchemaTestUtil

2020-07-29 Thread GitBox


xushiyan commented on pull request #1884:
URL: https://github.com/apache/hudi/pull/1884#issuecomment-665350793


   @yanghua @vinothchandar This is another set of incremental changes for 
testutils re-organization.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf merged pull request #1879: [DOC][HUDI-1123] add doc for user defined metrics reporter

2020-07-29 Thread GitBox


leesf merged pull request #1879:
URL: https://github.com/apache/hudi/pull/1879


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsuthar-lumiq removed a comment on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

2020-07-29 Thread GitBox


nsuthar-lumiq removed a comment on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665450530


   @pratyakshsharma could you please share the documentation that has an 
example of composite key uses. We are not getting, how to use it, and also does 
it also support Pyspark?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] NikhilSuthar commented on pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

2020-07-29 Thread GitBox


NikhilSuthar commented on pull request #1816:
URL: https://github.com/apache/hudi/pull/1816#issuecomment-665452095


   @pratyakshsharma could you please share the documentation that has an 
example of composite key uses. We are not getting, how to use it, and also does 
it also support Pyspark?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on pull request #1880: [HUDI-1125] build framework to support structured streaming

2020-07-29 Thread GitBox


yanghua commented on pull request #1880:
URL: https://github.com/apache/hudi/pull/1880#issuecomment-664710483







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] reenarosid opened a new issue #1885: [SUPPORT] MISSING RECORDS

2020-07-29 Thread GitBox


reenarosid opened a new issue #1885:
URL: https://github.com/apache/hudi/issues/1885


   
   Issue: I made a huge insert into hudi Table, but only 10th of the records 
were inserted. 
   To add more, I was having a partitionless dataset.
   I also made sure that de-duplication was on False ( i know by default it was 
set false, just to ensure I made an explicit statement).
   Below are the set of commanda that i executed .
   
   
   df = spark.read.parquet(PATH+"/*")
   # took 2000 records of a dataset
   df1=df.limit(2000)
   # took 1000 of the above and inserted it first and then tried appending the 
rest. ( ensuring duplicates)
   set1= df1.limit(1000)
 
   First insert was set1, then I tried inserting df1(a superset of set1) .
   
   hudi_options = {
 'hoodie.table.name': HUDI_TABLE_NAME,
 'hoodie.datasource.write.recordkey.field': 'f1',
"hoodie.datasource.write.insert.drop.duplicates":"false",
 'hoodie.datasource.write.table.name': HUDI_TABLE_NAME,
 'hoodie.datasource.write.operation': 'insert',
 'hoodie.datasource.write.precombine.field': 'f1',
 'hoodie.upsert.shuffle.parallelism': 1,
 'hoodie.insert.shuffle.parallelism': 1,
 "hoodie.cleaner.policy" : "KEEP_LATEST_FILE_VERSIONS",
 'hoodie.datasource.': 'COPY_ON_WRITE', #'COPY_ON_WRITE',MERGE_ON_READ
 "hoodie.cleaner.commits.retained": "1",
 "hoodie.cleaner.fileversions.retained": "1",
 "hoodie.parquet.min.file.size":6221225472,
   }
   
   
   set1.write.format("org.apache.hudi"). \
 options(**hudi_options). \
 mode("overwrite"). \
 save(HUDI_PATH)
   
   --- second insertion -
   hudi_options = {
 'hoodie.table.name': HUDI_TABLE_NAME,
 'hoodie.datasource.write.recordkey.field': 'f1',
"hoodie.datasource.write.insert.drop.duplicates":"false",
 'hoodie.datasource.write.table.name': HUDI_TABLE_NAME,
 'hoodie.datasource.write.operation': 'upsert',
 'hoodie.datasource.write.precombine.field': 'f2',
 'hoodie.upsert.shuffle.parallelism': 1,
 'hoodie.insert.shuffle.parallelism': 1,
 "hoodie.cleaner.policy" : "KEEP_LATEST_FILE_VERSIONS",
 'hoodie.datasource.': 'COPY_ON_WRITE', #'COPY_ON_WRITE',MERGE_ON_READ
 "hoodie.cleaner.commits.retained": "1",
 "hoodie.cleaner.fileversions.retained": "1",
 "hoodie.parquet.min.file.size":6221225472,
   }
   
   df1.write.format("org.apache.hudi"). \
 options(**hudi_options). \
 mode("append"). \
 save(HUDI_PATH)
   
   
   But when I look at the count I see that only a few records were inserted. ( 
1043 instead 3000 in my case).
Field f1 had been duplicated in my data source.
   
   
   
   
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] asheeshgarg commented on issue #1825: [SUPPORT] Compaction of parquet and meta file

2020-07-29 Thread GitBox


asheeshgarg commented on issue #1825:
URL: https://github.com/apache/hudi/issues/1825#issuecomment-665040109


   @bvaradar so even if I change the partition such that I have a different 
partition per day for different datasets so that only one write happens in the 
partition does it still going to be issue in 0.5.3?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on issue #1872: [SUPPORT]Getting 503s from S3 during upserts

2020-07-29 Thread GitBox


satishkotha commented on issue #1872:
URL: https://github.com/apache/hudi/issues/1872#issuecomment-665278060


   @luffyd were you able to figure out a workaround? If not, consider opening a 
jira. We think adding  jitter and retries when hudi calls S3 may help. (feel 
free to open Pull request)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1853: [HUDI-1072] Add replace metadata file to timeline

2020-07-29 Thread GitBox


bvaradar commented on a change in pull request #1853:
URL: https://github.com/apache/hudi/pull/1853#discussion_r461124646



##
File path: hudi-common/src/main/avro/HoodieReplaceMetadata.avsc
##
@@ -0,0 +1,44 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+ /*
+  * Note that all 'replace' instants are read for every query
+  * So it is important to keep this small. Please be careful
+  * before tracking additional information in this file.
+  * This will be used for 'insert_overwrite' (RFC-18) and also 'clustering' 
(RFC-19)
+  */
+{"namespace": "org.apache.hudi.avro.model",
+ "type": "record",
+ "name": "HoodieReplaceMetadata",
+ "fields": [
+ {"name": "totalFilesReplaced", "type": "int"},
+ {"name": "command", "type": "string"},
+ {"name": "partitionMetadata", "type": {

Review comment:
   High level Question :  To make sure we are all on the same page : Is 
this metadata enough to achieve clustering ? Do you foresee any changes that 
needs to happen to this metadata to support clustering ? The PR mentions that 
this is for both clustering and overwrite. Hence, asking this question. 

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -126,6 +129,13 @@
*/
   HoodieTimeline getCommitsAndCompactionTimeline();
 
+  /**
+   * Timeline to just include replace instants that have valid 
(commit/deltacommit) actions.
+   *
+   * @return
+   */
+  HoodieTimeline getCompletedAndReplaceTimeline();

Review comment:
   Does the return timeline contains only replace timeline ? The naming is 
confusing to me. How about getValidReplaceTimeline or anything else to reflect 
the intention

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
##
@@ -57,7 +58,7 @@
 
   String[] VALID_ACTIONS_IN_TIMELINE = {COMMIT_ACTION, DELTA_COMMIT_ACTION,
   CLEAN_ACTION, SAVEPOINT_ACTION, RESTORE_ACTION, ROLLBACK_ACTION,
-  COMPACTION_ACTION};
+  COMPACTION_ACTION, REPLACE_ACTION};

Review comment:
   same comment as above

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieActiveTimeline.java
##
@@ -65,7 +65,8 @@
   COMMIT_EXTENSION, INFLIGHT_COMMIT_EXTENSION, REQUESTED_COMMIT_EXTENSION, 
DELTA_COMMIT_EXTENSION,
   INFLIGHT_DELTA_COMMIT_EXTENSION, REQUESTED_DELTA_COMMIT_EXTENSION, 
SAVEPOINT_EXTENSION,
   INFLIGHT_SAVEPOINT_EXTENSION, CLEAN_EXTENSION, 
REQUESTED_CLEAN_EXTENSION, INFLIGHT_CLEAN_EXTENSION,
-  INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION));
+  INFLIGHT_COMPACTION_EXTENSION, REQUESTED_COMPACTION_EXTENSION, 
INFLIGHT_RESTORE_EXTENSION, RESTORE_EXTENSION,

Review comment:
Its better to avoid this for rollout purpose. In case, this PR gets 
landed before the next and a release cut, then we need to worry about ordering 
of rollout between readers and writers.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1858: [WIP] [1014] Part 1: Adding Upgrade or downgrade infra

2020-07-29 Thread GitBox


nsivabalan commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r461213552



##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackActionExecutor.java
##
@@ -159,24 +161,32 @@ private void rollBackIndex() {
 LOG.info("Index rolled back for commits " + instantToRollback);
   }
 
-  public List doRollbackAndGetStats() {
-final String instantTimeToRollback = instantToRollback.getTimestamp();
-final boolean isPendingCompaction = 
Objects.equals(HoodieTimeline.COMPACTION_ACTION, instantToRollback.getAction())
-&& !instantToRollback.isCompleted();
-validateSavepointRollbacks();
-if (!isPendingCompaction) {
-  validateRollbackCommitSequence();
-}
-
-try {
-  List stats = executeRollback();
-  LOG.info("Rolled back inflight instant " + instantTimeToRollback);
+  public List mayBeRollbackAndGetStats(boolean doDelete) {
+if(doDelete) {
+  final String instantTimeToRollback = instantToRollback.getTimestamp();
+  final boolean isPendingCompaction = 
Objects.equals(HoodieTimeline.COMPACTION_ACTION, instantToRollback.getAction())
+  && !instantToRollback.isCompleted();
+  validateSavepointRollbacks();
   if (!isPendingCompaction) {
-rollBackIndex();
+validateRollbackCommitSequence();
+  }
+
+  try {
+List stats = executeRollback(doDelete);
+LOG.info("Rolled back inflight instant " + instantTimeToRollback);
+if (!isPendingCompaction) {
+  rollBackIndex();
+}
+return stats;
+  } catch (IOException e) {
+throw new HoodieIOException("Unable to execute rollback ", e);
+  }
+} else{
+  try {
+return executeRollback(doDelete);

Review comment:
   this is the else part where we just collect stats. 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/BaseRollbackActionExecutor.java
##
@@ -159,24 +161,32 @@ private void rollBackIndex() {
 LOG.info("Index rolled back for commits " + instantToRollback);
   }
 
-  public List doRollbackAndGetStats() {
-final String instantTimeToRollback = instantToRollback.getTimestamp();
-final boolean isPendingCompaction = 
Objects.equals(HoodieTimeline.COMPACTION_ACTION, instantToRollback.getAction())
-&& !instantToRollback.isCompleted();
-validateSavepointRollbacks();
-if (!isPendingCompaction) {
-  validateRollbackCommitSequence();
-}
-
-try {
-  List stats = executeRollback();
-  LOG.info("Rolled back inflight instant " + instantTimeToRollback);
+  public List mayBeRollbackAndGetStats(boolean doDelete) {

Review comment:
   @vinothchandar : I have added a flag here to say where delete has to be 
done or just stats need to be collected. Since I don't want to duplicate code, 
tried my best to re-use. If you can think of any other ways, lmk. 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/UpgradeDowngradeHelper.java
##
@@ -0,0 +1,202 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.table;
+
+import org.apache.hudi.common.HoodieRollbackStat;
+import org.apache.hudi.common.table.HoodieTableConfig;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.HoodieTableVersion;
+import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.table.timeline.HoodieTimeline;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.exception.HoodieRollbackException;
+import org.apache.hudi.io.IOType;
+import org.apache.hudi.table.action.rollback.CopyOnWriteRollbackActionExecutor;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileUtil;
+import org.apache.hadoop.fs.Path;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+
+imp

[GitHub] [hudi] xushiyan commented on a change in pull request #1884: [HUDI-995] Use Transformations, Assertions and SchemaTestUtil

2020-07-29 Thread GitBox


xushiyan commented on a change in pull request #1884:
URL: https://github.com/apache/hudi/pull/1884#discussion_r462017090



##
File path: 
hudi-spark/src/test/java/org/apache/hudi/testutils/DataSourceTestUtils.java
##
@@ -1,71 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi.testutils;
-
-import org.apache.hudi.common.model.HoodieKey;
-import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.common.model.HoodieRecordPayload;
-import org.apache.hudi.common.testutils.RawTripTestPayload;
-import org.apache.hudi.common.util.Option;
-import org.apache.hudi.table.UserDefinedBulkInsertPartitioner;
-
-import org.apache.spark.api.java.JavaRDD;
-
-import java.io.IOException;
-import java.util.List;
-import java.util.stream.Collectors;
-
-/**
- * Test utils for data source tests.
- */
-public class DataSourceTestUtils {

Review comment:
   all methods moved to `Transformations.java`

##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/testutils/Transformations.java
##
@@ -0,0 +1,100 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.common.testutils;
+
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Set;
+import java.util.stream.Collectors;
+
+/**
+ * Common transformations in test cases.
+ */
+public final class Transformations {
+
+  public static  List flatten(Iterator> iteratorOfLists) {
+List flattened = new ArrayList<>();
+iteratorOfLists.forEachRemaining(flattened::addAll);
+return flattened;
+  }
+
+  public static  Iterator flattenAsIterator(Iterator> 
iteratorOfLists) {
+return flatten(iteratorOfLists).iterator();
+  }
+
+  public static Set recordsToRecordKeySet(List records) {
+return 
records.stream().map(HoodieRecord::getRecordKey).collect(Collectors.toSet());
+  }
+
+  public static List recordsToHoodieKeys(List 
records) {
+return 
records.stream().map(HoodieRecord::getKey).collect(Collectors.toList());
+  }
+
+  public static List hoodieKeysToStrings(List keys) {
+return keys.stream()
+.map(hr -> "{\"_row_key\":\"" + hr.getRecordKey() + 
"\",\"partition\":\"" + hr.getPartitionPath() + "\"}")
+.collect(Collectors.toList());
+  }
+
+  public static List recordsToStrings(List records) {
+return 
records.stream().map(Transformations::recordToString).filter(Option::isPresent).map(Option::get)
+.collect(Collectors.toList());
+  }
+
+  public static Option recordToString(HoodieRecord record) {
+try {
+  String str = ((RawTripTestPayload) record.getData()).getJsonData();
+  str = "{" + str.substring(str.indexOf("\"timestamp\":"));
+  // Remove the last } bracket
+  str = str.substring(0, str.length() - 1);
+  return Option.of(str + ", \"partition\": \"" + record.getPartitionPath() 
+ "\"}");
+} catch (IOException e) {
+  return Option.empty();
+}
+  }
+
+  /**
+   * Pseudorandom: select even indices first, then select odd ones.
+   */
+  public static  List randomSelect(List items, int n) {

Review comment:
   IMO not needing real randomness for caller's scenario

##
File path: 
hudi-common/src/test/java/org/apache/hudi/common/table/ti

[GitHub] [hudi] yanghua commented on pull request #1870: [HUDI-808] Support cleaning bootstrap source data

2020-07-29 Thread GitBox


yanghua commented on pull request #1870:
URL: https://github.com/apache/hudi/pull/1870#issuecomment-664711400


   @zhedoubushishi There is a conflict file. Can you please fix it?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on issue #1847: [SUPPORT] querying MoR tables on S3 becomes slow with number of files growing

2020-07-29 Thread GitBox


umehrot2 commented on issue #1847:
URL: https://github.com/apache/hudi/issues/1847#issuecomment-664709899


   @zuyanton yes `fs.s3.cse.enabled` is required for client side encryption to 
kick in. I wonder why you still have the `fs.s3.cse..kms.keyId` there. Also you 
don't use EmrFS consistent view right ?
   
   At this point I don't see a reason for `getLen()` taking time, since like 
@bvaradar mentioned its just cached when the FileStatus is created. However, I 
would still suggest trying by removing the unnecessary configurations that you 
have for EmrFS. Another thing that I would like you to do is enable EmrFS debug 
logs, by going to `/etc/spark/conf/log4j.properties` and add an entry with 
`DEBUG` log level for `com.amazon.ws.emr.hadoop.fs` namespace. This should give 
more information if there are any S3 calls being made during that time of 
100ms. If it does not reveal anything, I will try to work with you internally 
to reproduce the issue.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto edited a comment on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-29 Thread GitBox


rubenssoto edited a comment on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-665432999







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan opened a new pull request #1884: [HUDI-995] Use Transformations, Assertions and SchemaTestUtil

2020-07-29 Thread GitBox


xushiyan opened a new pull request #1884:
URL: https://github.com/apache/hudi/pull/1884


   There are testutils functions scattered around in different modules (client, 
spark, common) for common transformations like HoodieRecords to HoodieKeys. 
This is to organize them in `Transformations.java` for ease of discovery and 
use.
   
   ## Changes
   
   - Consolidate transform functions for tests in Transformations.java
   - Consolidate assertion functions for tests in Assertions.java
   - Make use of SchemaTestUtil for loading schema from resource
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1871: [WIP] [HUDI-781] Introduce HoodieDataPrep for test preparation

2020-07-29 Thread GitBox


vinothchandar commented on pull request #1871:
URL: https://github.com/apache/hudi/pull/1871#issuecomment-665363288


   On the DataPrep I was wondering through,
   if we should call it `HoodieTestTable` or something, since it's just 
creating some files, not really writing data. Also our code will probably 
generate 1,2 commits in most places. So should it include helpers like 
fakeTestTable.generateTwoCommits() etc. Idea here is to make test shorter. I am 
fine to just begin by standardizing though. 
   We should probably have another `HoodieTestTableWithData` which actually 
generates test table using regular writeClient APIs, does sanity asserts. and 
actually writes data into these files 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on a change in pull request #1819: [HUDI-1058] Make delete marker configurable

2020-07-29 Thread GitBox


shenh062326 commented on a change in pull request #1819:
URL: https://github.com/apache/hudi/pull/1819#discussion_r461266911



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/model/OverwriteWithLatestAvroPayload.java
##
@@ -36,6 +36,8 @@
 public class OverwriteWithLatestAvroPayload extends BaseAvroPayload
 implements HoodieRecordPayload {
 
+  private String deletedField = "_hoodie_is_deleted";

Review comment:
   Thanks for your comments, I will fix the comments.

##
File path: 
hudi-spark/src/test/scala/org/apache/hudi/functional/HoodieSparkSqlWriterSuite.scala
##
@@ -100,6 +100,53 @@ class HoodieSparkSqlWriterSuite extends FunSuite with 
Matchers {
 }
   }
 
+  test("test OverwriteWithLatestAvroPayload with user defined delete field") {
+val session = SparkSession.builder()
+  .appName("test_append_mode")
+  .master("local[2]")
+  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+  .getOrCreate()
+val path = java.nio.file.Files.createTempDirectory("hoodie_test_path1")
+
+try {
+  val sqlContext = session.sqlContext
+  val hoodieFooTableName = "hoodie_foo_tbl"
+
+  val keyField = "id"
+  val deleteField = "delete_field"
+
+  //create a new table
+  val fooTableModifier = Map("path" -> path.toAbsolutePath.toString,
+HoodieWriteConfig.TABLE_NAME -> hoodieFooTableName,
+"hoodie.insert.shuffle.parallelism" -> "2",
+"hoodie.upsert.shuffle.parallelism" -> "2",
+DELETE_FIELD_OPT_KEY -> deleteField,
+RECORDKEY_FIELD_OPT_KEY -> keyField)
+  val fooTableParams = 
HoodieSparkSqlWriter.parametersWithWriteDefaults(fooTableModifier)
+
+  val dataFrame = session.createDataFrame(Seq(
+(12, "ming", 20.23, "2018-01-01T13:51:39.340396Z", false),

Review comment:
   Here the records is no need to scale, I will change the Seq to only 
contains 1 element.

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/TestDeltaStreamerWithOverwriteLatestAvroPayload.java
##
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.utilities.functional;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.fs.Path;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.testutils.HoodieTestDataGenerator;
+import org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer;
+import org.apache.hudi.utilities.sources.ParquetDFSSource;
+import org.apache.hudi.utilities.testutils.UtilitiesTestBase;
+import org.junit.jupiter.api.BeforeAll;
+import org.junit.jupiter.api.Test;
+
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
+public class TestDeltaStreamerWithOverwriteLatestAvroPayload extends 
UtilitiesTestBase {
+  private static String PARQUET_SOURCE_ROOT;
+  private static final String PROPS_FILENAME_TEST_PARQUET = 
"test-parquet-dfs-source.properties";
+
+  @BeforeAll
+  public static void initClass() throws Exception {
+UtilitiesTestBase.initClass(true);
+PARQUET_SOURCE_ROOT = dfsBasePath + "/parquetFiles";
+
+// prepare the configs.
+
UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/base.properties", 
dfs, dfsBasePath + "/base.properties");
+
UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/sql-transformer.properties",
 dfs,
+dfsBasePath + "/sql-transformer.properties");
+UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/source.avsc", 
dfs, dfsBasePath + "/source.avsc");
+
UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/source-flattened.avsc",
 dfs, dfsBasePath + "/source-flattened.avsc");
+UtilitiesTestBase.Helpers.copyToDFS("delta-streamer-config/target.avsc", 
dfs, dfsBasePath + "/target.avsc");
+  }
+
+  private static List genericRecords(int n, boolean 
isDeleteRecord, int instantTime) {
+return IntStream.range(0, n).boxed().map(i -> {
+  String partitionPath = "partitionPath1";
+  HoodieKey key = new HoodieKey("id_" + i

[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-07-29 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r461108736



##
File path: hudi-client/pom.xml
##
@@ -102,6 +102,12 @@
   spark-sql_${scala.binary.version}
 
 
+
+

Review comment:
   we have moved to spark-avro in apache spark IIRC. we should not use 
com.databricks anymore?

##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();

Review comment:
   for now, I guess it's okay to assume the records will fit into memory? 
eventually we need to make this sorting spillable (using rocksDB for eg) for 
RFC-08 indexing work

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##
@@ -0,0 +1,159 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.common.table.log.block;
+
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.HoodieLogFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.io.storage.HoodieHFileReader;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import org.apache.avro.Schema;
+import org.apache.avro.Schema.Field;
+import org.apache.avro.generic.IndexedRecord;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FSDataOutputStream;
+import org.apache.hadoop.hbase.KeyValue;
+import org.apache.hadoop.hbase.io.compress.Compression;
+import org.apache.hadoop.hbase.io.hfile.CacheConfig;
+import org.apache.hadoop.hbase.io.hfile.HFile;
+import org.apache.hadoop.hbase.io.hfile.HFileContext;
+import org.apache.hadoop.hbase.io.hfile.HFileContextBuilder;
+import org.apache.hadoop.hbase.util.Pair;
+
+import java.io.ByteArrayOutputStream;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.TreeMap;
+import java.util.stream.Collectors;
+
+import javax.annotation.Nonnull;
+
+/**
+ * HoodieHFileDataBlock contains a list of records stored inside an HFile 
format. It is used with the HFile
+ * base file format.
+ */
+public c

[GitHub] [hudi] yanghua merged pull request #1774: [HUDI-703]Add unit test for HoodieSyncCommand

2020-07-29 Thread GitBox


yanghua merged pull request #1774:
URL: https://github.com/apache/hudi/pull/1774


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #1878: [SUPPORT] Spark Structured Streaming To Hudi Sink Datasource taking much longer

2020-07-29 Thread GitBox


rubenssoto commented on issue #1878:
URL: https://github.com/apache/hudi/issues/1878#issuecomment-665432999


   Hi bvaradar, how are you? I hope doing fine!
   
   I have a new case, which is a little more important to me, the problem is 
almost the same. I adopted the strategy to first batch all data in an insert 
operation and after that, get the latest data with structured streaming. 
   
   Answer your question, all my tables have PK with integers id and normally 
they are auto-increment. Does Hudi already order data in an insert operation by 
pk? Because in my first batch I am sorting the data by date, is it necessary?
   
   I think I have the CoW problem that you said. I have an order table with my 
clients orders, every minute new orders arrive, and my clients could give a 
grade to the order at any point in time, for example in a streaming batch could 
have a client order grade for an order that was made in the last month.
   
   This table, today, is very small, in hudi dataset, are 15 files of 500mb 
each, I didn't partition the table because a daily partition is small and 
partition by month I think don't make sense. 
   My streaming is running right now, but Hudi rewrites all 15 files every 
streaming batch, my data is small, so its fine, but I think it is not efficient 
and when data the grows it could become a problem.
   
   I will use aws Athena to query all my tables and this specific order table 
may be delayed up to 15 minutes. I saw that Athena only query Read Optmized 
MoR, how MoR could help me in this case?
   
   The last question, in an insert operation, how can I control the file size?
   
   Thank you for your time!
   
   Some images of my streaming:
   ![Uploading Captura de Tela 2020-07-29 às 02.04.06.png…](
   https://user-images.githubusercontent.com/36298331/88758874-ea2a3180-d13f-11ea-914c-268135f002f9.png";>
   https://user-images.githubusercontent.com/36298331/88758879-ebf3f500-d13f-11ea-9f13-0e731940b605.png";>
   https://user-images.githubusercontent.com/36298331/88758885-ee564f00-d13f-11ea-802b-c896de02ded7.png";>
   
   
   )
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sbernauer commented on issue #1845: [SUPPORT] Support for Schema evolution. Facing an error

2020-07-29 Thread GitBox


sbernauer commented on issue #1845:
URL: https://github.com/apache/hudi/issues/1845#issuecomment-664807718


   Is there anything we can do further to resolve this issue?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan opened a new pull request #1882: [MINOR] Fixing default index parallelism for simple index

2020-07-29 Thread GitBox


nsivabalan opened a new pull request #1882:
URL: https://github.com/apache/hudi/pull/1882


   ## What is the purpose of the pull request
   
   Fixing default value for simple index parallelism
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1880: [HUDI-1125] build framework to support structured streaming

2020-07-29 Thread GitBox


vinothchandar commented on pull request #1880:
URL: https://github.com/apache/hudi/pull/1880#issuecomment-665288297


   This is a good addition. 
   +1 on @yanghua 's comments on adding tests and completeness of feature. 
   
   Can we implement this such that, users can do `readStream()` using commit 
times? this is a very desired feature on spark



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1125) build framework to support structured streaming

2020-07-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1125:
-
Labels: pull-request-available  (was: )

> build framework  to support   structured streaming 
> ---
>
> Key: HUDI-1125
> URL: https://issues.apache.org/jira/browse/HUDI-1125
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: linshan-ma
>Assignee: linshan-ma
>Priority: Major
>  Labels: pull-request-available
>
> build framework to support structured streaming
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bhasudha opened a new pull request #1883: [MINOR] change log.info to log.debug

2020-07-29 Thread GitBox


bhasudha opened a new pull request #1883:
URL: https://github.com/apache/hudi/pull/1883


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   CI jobs fail sometimes with the error suggesting job log exceeded length. 
This usually means there is some profuse logging that could be trimmed.  This 
is a minor PR that identified one such logging from this pr - 
https://github.com/apache/hudi/pull/1503. 
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   Change logging level.
   
   ## Verify this pull request
   
   [x] This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org