date:20200814

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #370

2020-08-14 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.56 KB...]
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.1-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.1-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.1-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.1-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.1-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${scala.binary.version}:[unknown-version],
 

 line 27, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effec

[GitHub] [hudi] nsivabalan commented on issue #1962: [SUPPORT] Unable to filter hudi table in hive on partition column

2020-08-14 Thread GitBox



nsivabalan commented on issue #1962:
URL: https://github.com/apache/hudi/issues/1962#issuecomment-674302417


   @bhasudha / @bvaradar : do you folks have any pointers here. Looks like the 
input format is not getting set. 
   ```
   Error: Could not open client transport for any of the Server URI's in 
ZooKeeper: Failed to open new session: java.lang.IllegalArgumentException: 
Cannot modify hive.input.format at runtime. It is not in list of params that 
are allowed to be modified at runtime (state=08S01,code=0)
   ```
   Would that be the issue here? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[hudi] branch release-0.6.0 created (now efb3025)

2020-08-14 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a change to branch release-0.6.0
in repository https://gitbox.apache.org/repos/asf/hudi.git.


  at efb3025  Create release branch for version 0.6.0.

This branch includes the following new commits:

 new efb3025  Create release branch for version 0.6.0.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.

[hudi] 01/01: Create release branch for version 0.6.0.

2020-08-14 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch release-0.6.0
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit efb3025003cd2c1c22e64ec7580450b14f96a190
Author: Bhavani Sudha Saktheeswaran 
AuthorDate: Fri Aug 14 12:56:03 2020 -0700

Create release branch for version 0.6.0.
---
 docker/hoodie/hadoop/base/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml | 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml| 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml| 2 +-
 docker/hoodie/hadoop/namenode/pom.xml | 2 +-
 docker/hoodie/hadoop/pom.xml  | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml  | 2 +-
 hudi-cli/pom.xml  | 2 +-
 hudi-client/pom.xml   | 2 +-
 hudi-common/pom.xml   | 2 +-
 hudi-examples/pom.xml | 2 +-
 hudi-hadoop-mr/pom.xml| 2 +-
 hudi-integ-test/pom.xml   | 2 +-
 hudi-spark/pom.xml| 2 +-
 hudi-sync/hudi-dla-sync/pom.xml   | 2 +-
 hudi-sync/hudi-hive-sync/pom.xml  | 2 +-
 hudi-sync/hudi-sync-common/pom.xml| 2 +-
 hudi-sync/pom.xml | 2 +-
 hudi-timeline-service/pom.xml | 2 +-
 hudi-utilities/pom.xml| 2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml   | 2 +-
 packaging/hudi-hive-sync-bundle/pom.xml   | 2 +-
 packaging/hudi-integ-test-bundle/pom.xml  | 2 +-
 packaging/hudi-presto-bundle/pom.xml  | 2 +-
 packaging/hudi-spark-bundle/pom.xml   | 2 +-
 packaging/hudi-timeline-server-bundle/pom.xml | 2 +-
 packaging/hudi-utilities-bundle/pom.xml   | 2 +-
 pom.xml   | 2 +-
 32 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/docker/hoodie/hadoop/base/pom.xml 
b/docker/hoodie/hadoop/base/pom.xml
index 55205ee..d5b1020 100644
--- a/docker/hoodie/hadoop/base/pom.xml
+++ b/docker/hoodie/hadoop/base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/datanode/pom.xml 
b/docker/hoodie/hadoop/datanode/pom.xml
index e8c95f9..98575bc 100644
--- a/docker/hoodie/hadoop/datanode/pom.xml
+++ b/docker/hoodie/hadoop/datanode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/historyserver/pom.xml 
b/docker/hoodie/hadoop/historyserver/pom.xml
index 725cdcf..5ecf21b 100644
--- a/docker/hoodie/hadoop/historyserver/pom.xml
+++ b/docker/hoodie/hadoop/historyserver/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/hive_base/pom.xml 
b/docker/hoodie/hadoop/hive_base/pom.xml
index 399f7b7..9dda8f3 100644
--- a/docker/hoodie/hadoop/hive_base/pom.xml
+++ b/docker/hoodie/hadoop/hive_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/namenode/pom.xml 
b/docker/hoodie/hadoop/namenode/pom.xml
index 4ec1f9a..dcbe4be 100644
--- a/docker/hoodie/hadoop/namenode/pom.xml
+++ b/docker/hoodie/hadoop/namenode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/pom.xml b/docker/hoodie/hadoop/pom.xml
index bedd3b4..97ed1dc 100644
--- a/docker/hoodie/hadoop/pom.xml
+++ b/docker/hoodie/hadoop/pom.xml
@@ -19,7 +19,7 @@
   
 hudi
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
 ../../../pom.xml
   
   4.0.0
diff --git a/docker/hoodie/hadoop/prestobase/pom.xml 
b/docker/hoodie/hadoop/prestobase/pom.xml
index 2ba319c..14b6e4c 100644
--- a/docker/hoodie/hadoop/prestobase/pom.xml
+++ b/docker/hoodie/hadoop/prestobase/pom.xml
@@ -22,7 +22,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/spark_base/pom.xml 
b/docker/hoodie/hadoop/spark_base/pom.xml
index 6385305..c2d2302 100644
--- a/docker/hoodie/hadoop/spark_base/pom.xml
+++ b/docker/hoodie/hadoop/spark_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.0-rc1
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/sparkadhoc/pom.xml 
b/docker/hoodie/hadoop/sparkadhoc/pom.xml
index c1babf4..fb7bab1 100644
--- a/docker/hoodie/hadoop/sparkadhoc/pom.xml
+++ b/docker/hoodie/hadoop/sparkadhoc/pom

[hudi] branch master updated: Moving to 0.6.1-SNAPSHOT on master branch.

2020-08-14 Thread bhavanisudha

This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 4226d75  Moving to 0.6.1-SNAPSHOT on master branch.
4226d75 is described below

commit 4226d7514400d86761e39639e9554809b72b627c
Author: Bhavani Sudha Saktheeswaran 
AuthorDate: Fri Aug 14 12:54:15 2020 -0700

Moving to 0.6.1-SNAPSHOT on master branch.
---
 docker/hoodie/hadoop/base/pom.xml | 2 +-
 docker/hoodie/hadoop/datanode/pom.xml | 2 +-
 docker/hoodie/hadoop/historyserver/pom.xml| 2 +-
 docker/hoodie/hadoop/hive_base/pom.xml| 2 +-
 docker/hoodie/hadoop/namenode/pom.xml | 2 +-
 docker/hoodie/hadoop/pom.xml  | 2 +-
 docker/hoodie/hadoop/prestobase/pom.xml   | 2 +-
 docker/hoodie/hadoop/spark_base/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml   | 2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml  | 2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml  | 2 +-
 hudi-cli/pom.xml  | 2 +-
 hudi-client/pom.xml   | 2 +-
 hudi-common/pom.xml   | 2 +-
 hudi-examples/pom.xml | 2 +-
 hudi-hadoop-mr/pom.xml| 2 +-
 hudi-integ-test/pom.xml   | 2 +-
 hudi-spark/pom.xml| 2 +-
 hudi-sync/hudi-dla-sync/pom.xml   | 2 +-
 hudi-sync/hudi-hive-sync/pom.xml  | 2 +-
 hudi-sync/hudi-sync-common/pom.xml| 2 +-
 hudi-sync/pom.xml | 2 +-
 hudi-timeline-service/pom.xml | 2 +-
 hudi-utilities/pom.xml| 2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml   | 2 +-
 packaging/hudi-hive-sync-bundle/pom.xml   | 2 +-
 packaging/hudi-integ-test-bundle/pom.xml  | 2 +-
 packaging/hudi-presto-bundle/pom.xml  | 2 +-
 packaging/hudi-spark-bundle/pom.xml   | 2 +-
 packaging/hudi-timeline-server-bundle/pom.xml | 2 +-
 packaging/hudi-utilities-bundle/pom.xml   | 2 +-
 pom.xml   | 2 +-
 32 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/docker/hoodie/hadoop/base/pom.xml 
b/docker/hoodie/hadoop/base/pom.xml
index 55205ee..459379d 100644
--- a/docker/hoodie/hadoop/base/pom.xml
+++ b/docker/hoodie/hadoop/base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/datanode/pom.xml 
b/docker/hoodie/hadoop/datanode/pom.xml
index e8c95f9..f7406a1 100644
--- a/docker/hoodie/hadoop/datanode/pom.xml
+++ b/docker/hoodie/hadoop/datanode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/historyserver/pom.xml 
b/docker/hoodie/hadoop/historyserver/pom.xml
index 725cdcf..da90fa0 100644
--- a/docker/hoodie/hadoop/historyserver/pom.xml
+++ b/docker/hoodie/hadoop/historyserver/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/hive_base/pom.xml 
b/docker/hoodie/hadoop/hive_base/pom.xml
index 399f7b7..220483e 100644
--- a/docker/hoodie/hadoop/hive_base/pom.xml
+++ b/docker/hoodie/hadoop/hive_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/namenode/pom.xml 
b/docker/hoodie/hadoop/namenode/pom.xml
index 4ec1f9a..6e1dfd2 100644
--- a/docker/hoodie/hadoop/namenode/pom.xml
+++ b/docker/hoodie/hadoop/namenode/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/pom.xml b/docker/hoodie/hadoop/pom.xml
index bedd3b4..efb1153 100644
--- a/docker/hoodie/hadoop/pom.xml
+++ b/docker/hoodie/hadoop/pom.xml
@@ -19,7 +19,7 @@
   
 hudi
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
 ../../../pom.xml
   
   4.0.0
diff --git a/docker/hoodie/hadoop/prestobase/pom.xml 
b/docker/hoodie/hadoop/prestobase/pom.xml
index 2ba319c..5f3cd4c 100644
--- a/docker/hoodie/hadoop/prestobase/pom.xml
+++ b/docker/hoodie/hadoop/prestobase/pom.xml
@@ -22,7 +22,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docker/hoodie/hadoop/spark_base/pom.xml 
b/docker/hoodie/hadoop/spark_base/pom.xml
index 6385305..98ad8c9 100644
--- a/docker/hoodie/hadoop/spark_base/pom.xml
+++ b/docker/hoodie/hadoop/spark_base/pom.xml
@@ -19,7 +19,7 @@
   
 hudi-hadoop-docker
 org.apache.hudi
-0.6.0-SNAPSHOT
+0.6.1-SNAPSHOT
   
   4.0.0
   pom
diff --git a/docke

[jira] [Updated] (HUDI-1031) Document how to set job scheduling configs for Async compaction

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1031:

Priority: Major  (was: Blocker)

> Document how to set job scheduling configs for Async compaction 
> 
>
> Key: HUDI-1031
> URL: https://issues.apache.org/jira/browse/HUDI-1031
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
>  
> In case of deltastreamer, Spark job scheduling configs are automatically set. 
> As the configs needs to be set before spark context is initiated, it is not 
> fully automated for Structured Streaming 
> [https://spark.apache.org/docs/latest/job-scheduling.html]
> We need to document how to set job scheduling configs for Spark Structured 
> streaming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-428) Web documentation for explaining how to bootstrap

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-428:
---
Priority: Major  (was: Blocker)

> Web documentation for explaining how to bootstrap 
> --
>
> Key: HUDI-428
> URL: https://issues.apache.org/jira/browse/HUDI-428
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Need to provide examples (demo) to document bootstrapping



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-882) Update documentation with new configs for 0.6.0 release

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-882:
---
Priority: Major  (was: Blocker)

> Update documentation with new configs for 0.6.0 release
> ---
>
> Key: HUDI-882
> URL: https://issues.apache.org/jira/browse/HUDI-882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Umbrella ticket to track new configurations that needs to be added in docs 
> page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1145) Debug if Insert operation calls upsert in case of small file handling path.

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1145:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Debug if Insert operation calls upsert in case of small file handling path.
> ---
>
> Key: HUDI-1145
> URL: https://issues.apache.org/jira/browse/HUDI-1145
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>
> INSERT operations may be triggering UPSERT internally in the Merging process 
> when dealing with small files. This surfaced out of a SLACK thread. Need to 
> config if this is indeed is happening. If yes, this needs to be fixed such 
> that the MERGE HANDLE should not call upsert and instead let conflicting 
> records into the file if it is an INSERT operation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1118) Cleanup rollback files residing in .hoodie folder

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1118:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Cleanup rollback files residing in .hoodie folder
> -
>
> Key: HUDI-1118
> URL: https://issues.apache.org/jira/browse/HUDI-1118
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Hudi Archiving takes care of archiving all metadata files in .hoodie folder 
> except rollback files. Rollback metadata also needs to cleanup in the same 
> way as others.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1172) Use OverwriteWithLatestAvroPayload as default payload class everywhere

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1172:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Use OverwriteWithLatestAvroPayload as default payload class everywhere
> --
>
> Key: HUDI-1172
> URL: https://issues.apache.org/jira/browse/HUDI-1172
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.1
>
>
> We are using HoodieAvroPayload and OverwriteWithLatestAvroPayload as defaults 
> in different use-cases (DeltaStreamer, Spark DataSource, CLI). It is easier 
> to think and fix this  when we club with upgrade-downgrade work. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1112) Blog on Tracking Hudi Data along transaction time and buisness time

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1112:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Blog on Tracking Hudi Data along transaction time and buisness time
> ---
>
> Key: HUDI-1112
> URL: https://issues.apache.org/jira/browse/HUDI-1112
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: Vinoth Chandar
>Assignee: Sandeep Maji
>Priority: Major
> Fix For: 0.6.1
>
>
> https://github.com/apache/hudi/issues/1705



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1079) Cannot upsert on schema with Array of Record with single field

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1079:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Cannot upsert on schema with Array of Record with single field
> --
>
> Key: HUDI-1079
> URL: https://issues.apache.org/jira/browse/HUDI-1079
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.5.3
> Environment: spark 2.4.4, local 
>Reporter: Adrian Tanase
>Priority: Major
> Fix For: 0.6.1
>
>
> I am trying to trigger upserts on a table that has an array field with 
> records of just one field.
>  Here is the code to reproduce:
> {code:scala}
>   val spark = SparkSession.builder()
>   .master("local[1]")
>   .appName("SparkByExamples.com")
>   .config("spark.serializer", 
> "org.apache.spark.serializer.KryoSerializer")
>   .getOrCreate();
>   // https://sparkbyexamples.com/spark/spark-dataframe-array-of-struct/
>   val arrayStructData = Seq(
> Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
> Row("Michael",List(Row("Java","XY",200),Row("Scala","XB",500))),
> Row("Robert",List(Row("Java","XZ",400),Row("Scala","XC",250))),
> Row("Washington",null)
>   )
>   val arrayStructSchema = new StructType()
>   .add("name",StringType)
>   .add("booksIntersted",ArrayType(
> new StructType()
>   .add("bookName",StringType)
> //  .add("author",StringType)
> //  .add("pages",IntegerType)
>   ))
> val df = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> {code}
> Running insert following by upsert will fail:
> {code:scala}
>   df.write
>   .format("hudi")
>   .options(getQuickstartWriteConfigs)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, "COPY_ON_WRITE")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .mode(Overwrite)
>   .save(basePath)
>   df.write
>   .format("hudi")
>   .options(getQuickstartWriteConfigs)
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "name")
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "name")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .mode(Append)
>   .save(basePath)
> {code}
> If I create the books record with all the fields (at least 2), it works as 
> expected.
> The relevant part of the exception is this:
> {noformat}
> Caused by: java.lang.ClassCastException: required binary bookName (UTF8) is 
> not a groupCaused by: java.lang.ClassCastException: required binary bookName 
> (UTF8) is not a group at 
> org.apache.parquet.schema.Type.asGroupType(Type.java:207) at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:279)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:232)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.access$100(AvroRecordConverter.java:78)
>  at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter$ElementConverter.(AvroRecordConverter.java:536)
>  at 
> org.apache.parquet.avro.AvroRecordConverter$AvroCollectionConverter.(AvroRecordConverter.java:486)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.newConverter(AvroRecordConverter.java:289)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:141)
>  at 
> org.apache.parquet.avro.AvroRecordConverter.(AvroRecordConverter.java:95)
>  at 
> org.apache.parquet.avro.AvroRecordMaterializer.(AvroRecordMaterializer.java:33)
>  at 
> org.apache.parquet.avro.AvroReadSupport.prepareForRead(AvroReadSupport.java:138)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:183)
>  at 
> org.apache.parquet.hadoop.ParquetReader.initReader(ParquetReader.java:156) at 
> org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:135) at 
> org.apache.hudi.client.utils.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
>  at 
> org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ... 4 
> more{noformat}
> Another way to test is by changing the generated data in the tips to just the 
> amount, by dropping the currency on the tips_history field, tests will start 
> failing:
>  
> [https://github.com/apa

[jira] [Updated] (HUDI-1033) Remove redundant CLI tests

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1033:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Remove redundant CLI tests 
> ---
>
> Key: HUDI-1033
> URL: https://issues.apache.org/jira/browse/HUDI-1033
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.6.1
>
>
> There are some tests like ITTestRepairsCommand vs TestRepairsCommand, 
> ITTestCleanerCommand vs TestCleanerCommand. Please consolidate if they are 
> redundant.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1055) Ensure hardcoded storage type ".parquet" is removed from tests

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1055:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Ensure hardcoded storage type ".parquet" is removed from tests
> --
>
> Key: HUDI-1055
> URL: https://issues.apache.org/jira/browse/HUDI-1055
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Prashant Wason
>Priority: Major
> Fix For: 0.6.1
>
>
>  Follow up : https://github.com/apache/hudi/pull/1687#issuecomment-649754943



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1036) HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1036:

Fix Version/s: (was: 0.6.0)
   0.6.1

> HoodieCombineHiveInputFormat not picking up HoodieRealtimeFileSplit
> ---
>
> Key: HUDI-1036
> URL: https://issues.apache.org/jira/browse/HUDI-1036
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Bhavani Sudha
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.1
>
>
> Opening this Jira based on the GitHub issue reported here - 
> [https://github.com/apache/hudi/issues/1735] when hive.input.format = 
> org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat it is not able to 
> create HoodieRealtimeFileSplit for querying _rt table. Please see the GitHub 
> issue more details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1038) Adding perf benchmark using jmh to Hudi

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1038:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Adding perf benchmark using jmh to Hudi
> ---
>
> Key: HUDI-1038
> URL: https://issues.apache.org/jira/browse/HUDI-1038
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Performance
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-483) Fix unit test for Archiving to reflect empty instant files for requested commit/deltacommits

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-483:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Fix unit test for Archiving to reflect empty instant files for requested 
> commit/deltacommits
> 
>
> Key: HUDI-483
> URL: https://issues.apache.org/jira/browse/HUDI-483
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: Balaji Varadarajan
>Assignee: lamber-ken
>Priority: Minor
> Fix For: 0.6.1
>
>
> This came up during review:
> [https://github.com/apache/incubator-hudi/pull/1128#discussion_r361734393]
> HoodieTestDataGenerator.createCommitFile() creates requested files with 
> proper commit metadata for test data generation. It needs to create en empty 
> file to reflect reality.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1193) Upgrade http dependent version

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1193:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Upgrade http dependent version
> --
>
> Key: HUDI-1193
> URL: https://issues.apache.org/jira/browse/HUDI-1193
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Upgrade http dependent version。
>  
> Hudi currently uses
> 
>  org.apache.httpcomponents
>  httpcore
>  4.3.2
> 
> 
>  org.apache.httpcomponents
>  httpclient
>  4.3.6
> 
>  
> This will cause hui to be unable to write to oss, oss requires version 4.4.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-691) hoodie..consume. should be set whitelist in hive-site.xml

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-691:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> hoodie.*.consume.* should be set whitelist in hive-site.xml
> ---
>
> Key: HUDI-691
> URL: https://issues.apache.org/jira/browse/HUDI-691
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs, newbie
>Reporter: Bhavani Sudha
>Assignee: GarudaGuo
>Priority: Minor
> Fix For: 0.6.1
>
>
> More details in this GH issue - 
> https://github.com/apache/incubator-hudi/issues/910



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1024) Document S3 related guide and tips

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1024:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Document S3 related guide and tips
> --
>
> Key: HUDI-1024
> URL: https://issues.apache.org/jira/browse/HUDI-1024
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Raymond Xu
>Priority: Minor
>  Labels: documentation
> Fix For: 0.6.1
>
>
> Create a section in docs website for Hudi on S3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-778) Add code coverage badge to README file

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-778:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Add code coverage badge to README file
> --
>
> Key: HUDI-778
> URL: https://issues.apache.org/jira/browse/HUDI-778
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Ramachandran M S
>Assignee: Ramachandran M S
>Priority: Minor
>  Labels: pull-request-available, test
> Fix For: 0.6.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-977) deal the test module hudi-integ-test

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-977:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> deal the test module hudi-integ-test
> 
>
> Key: HUDI-977
> URL: https://issues.apache.org/jira/browse/HUDI-977
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: lichangfu
>Priority: Minor
> Fix For: 0.6.1
>
>
> when i build the project on windows,the following error occurred,could delete 
> the unimport moudle?
>  
> [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.6.0:exec 
> (Setup HUDI_WS) on project hudi-integ-test: Command execution failed.: Cannot 
> run program "\bin\bash" (in directory 
> "D:\code-repository\github\hudi\hudi-integ-test"): CreateProcess error=2, 
> ϵͳ▒Ҳ▒▒▒ָ▒ļ▒▒▒ -> [Help 1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1053) Make ComplexKeyGenerator also support non partitioned Hudi dataset

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1053:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Make ComplexKeyGenerator also support non partitioned Hudi dataset
> --
>
> Key: HUDI-1053
> URL: https://issues.apache.org/jira/browse/HUDI-1053
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Storage Management, Writer Core
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
> Fix For: 0.6.1
>
>
> Currently When using ComplexKeyGenerator a `default` partition is assumed. 
> Recently there has been interest in supporting non partitioned Hudi datasets 
> that uses ComplexKeyGenerator. This GitHub issue has context - 
> https://github.com/apache/hudi/issues/1747



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1081) Document AWS Hudi integration

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1081:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Document AWS Hudi integration
> -
>
> Key: HUDI-1081
> URL: https://issues.apache.org/jira/browse/HUDI-1081
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, Usability
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Minor
>  Labels: documentation
> Fix For: 0.6.1
>
>
> Often times AWS Hudi users seek documentation on setting up Hudi and 
> integrating Hive megastore and GLUE configurations. This has been one of the 
> popular thread in Slack. It would serve well if documented.
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1103) Improve the code format of Delete data demo in Quick-Start Guide

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1103:

Component/s: Docs

> Improve the code format of Delete data demo in Quick-Start Guide
> 
>
> Key: HUDI-1103
> URL: https://issues.apache.org/jira/browse/HUDI-1103
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Docs
>Reporter: wangxianghu
>Assignee: Trevorzhang
>Priority: Minor
> Fix For: 0.6.0
>
>
> Currently, the delete data demo code is not runnable in spark-shell 
> {code:java}
> scala> val df = spark
> df: org.apache.spark.sql.SparkSession = 
> org.apache.spark.sql.SparkSession@74e7d97bscala>   .read
> :1: error: illegal start of definition
>   .read
>   ^scala>   .json(spark.sparkContext.parallelize(deletes, 2))
> :1: error: illegal start of definition
>   .json(spark.sparkContext.parallelize(deletes, 2))
>   ^
> {code}
> This dot symbol should be  at the end of the line or put a "\" at the end
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1139) Add support for JuiceFS

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1139:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Add support for JuiceFS
> ---
>
> Key: HUDI-1139
> URL: https://issues.apache.org/jira/browse/HUDI-1139
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: cadl
>Priority: Minor
> Fix For: 0.6.1
>
>
> JuiceFS is a POSIX-compatible shared filesystem base on object storage 
> service. It also provides Hadoop fs SDK, we can access juicefs with `jfs://` 
> scheme.
>  
> JuiceFS: [https://juicefs.com|https://juicefs.com/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-251) JDBC incremental load to HUDI with DeltaStreamer

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-251:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> JDBC incremental load to HUDI with DeltaStreamer
> 
>
> Key: HUDI-251
> URL: https://issues.apache.org/jira/browse/HUDI-251
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Taher Koitawala
>Assignee: Purushotham Pushpavanthar
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Mirroring RDBMS to HUDI is one of the most basic use cases of HUDI. Hence, 
> for such use cases, DeltaStreamer should provide inbuilt support.
> DeltaSteamer should accept something like jdbc-source.properties where users 
> can define the RDBMS connection properties along with a timestamp column and 
> an interval which allows users to express how frequently HUDI should check 
> with RDBMS data source for new inserts or updates.
> Details are documented in RFC-14
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+14+%3A+JDBC+incremental+puller



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-623) Remove UpgradePayloadFromUberToApache

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-623:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Remove UpgradePayloadFromUberToApache
> -
>
> Key: HUDI-623
> URL: https://issues.apache.org/jira/browse/HUDI-623
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Trivial
> Fix For: 0.6.1
>
>
> {{UpgradePayloadFromUberToApache}} used to covert the package names from the 
> pattern {{com.uber.hoodie}} to {{org.apache.hudi}}. It's a one-shot work. 
> Since we have done this work. IMO, we can remove this class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1010:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Yanjia Gary Li
>Assignee: Nishith Agarwal
>Priority: Major
>  Labels: help-wanted
> Fix For: 0.6.1
>
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1007) When earliestOffsets is greater than checkpoint, Hudi will not be able to successfully consume data

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1007:

Fix Version/s: (was: 0.6.0)
   0.6.1

> When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> ---
>
> Key: HUDI-1007
> URL: https://issues.apache.org/jira/browse/HUDI-1007
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Use deltastreamer to consume kafka,
>  When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen#checkupValidOffsets
> boolean checkpointOffsetReseter = checkpointOffsets.entrySet().stream()
>  .anyMatch(offset -> offset.getValue() < 
> earliestOffsets.get(offset.getKey()));
> return checkpointOffsetReseter ? earliestOffsets : checkpointOffsets;
> Kafka data is continuously generated, which means that some data will 
> continue to expire.
>  When earliestOffsets is greater than checkpoint, earliestOffsets will be 
> taken. But at this moment, some data expired. In the end, consumption fails. 
> This process is an endless cycle. I can understand that this design may be to 
> avoid the loss of data, but it will lead to such a situation, I want to fix 
> this problem, I want to hear your opinion  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

2020-08-14 Thread GitBox



rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674230980


   This is my write function:
   
   def write_hudi_dataset(spark_data_frame, write_folder_path, hudi_options, 
write_mode):
   
   spark_data_frame \
 .write \
 .options(**hudi_options) \
 .mode(write_mode) \
 .format('hudi')\
 .save(write_folder_path)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-978) Specify version information for each component separately

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-978:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Specify version information for each component  separately
> --
>
> Key: HUDI-978
> URL: https://issues.apache.org/jira/browse/HUDI-978
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-984) Support Hive 1.x out of box

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-984:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Support Hive 1.x out of box
> ---
>
> Key: HUDI-984
> URL: https://issues.apache.org/jira/browse/HUDI-984
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Hive Integration
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> With 0.5.0, Hudi is using 2.x as part of its compile time dependency and 
> works with Hive 2.x servers out of the box.
> We need similar support for Hive 1.x as it is still being used.
> 1. Hive 1.x servers can run queries with Hudi table
> 2. Hive Sync must happen successfully between Hudi tables and Hive 1.x 
> services
>  
> Important Note: Hive 1.x has 2 classes of versions:
>  # pre 1.2.0
>  # 1.2.0 and later
> We had earlier found out that those 2 classes are not compatible with each 
> other unfortunately. CDH version of Hive used to have pre 1.2.0. We need to 
> look at the feasibility, cost and impact of supporting of one or more of this 
> class.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-974) Fields out of order in MOR mode when using Hive

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-974.

Resolution: Fixed

Looks like this is resolved from the PR status. Feel free to re-open if needed.

> Fields out of order in MOR mode when using Hive
> ---
>
> Key: HUDI-974
> URL: https://issues.apache.org/jira/browse/HUDI-974
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: leesf
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-05-28-21-06-02-396.png, 
> image-2020-05-28-21-07-30-803.png
>
>
> When querying MOR hudi dataset via hive
> hive table:
> CREATE EXTERNAL TABLE `unknown_rt`(
>  `_hoodie_commit_time` string,
>  `_hoodie_commit_seqno` string,
>  `_hoodie_record_key` string,
>  `_hoodie_partition_path` string,
>  `_hoodie_file_name` string,
>  `age` bigint,
>  `name` string,
>  `sex` string,
>  `ts` bigint)
>  PARTITIONED BY (
>  `location` string)
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  STORED AS INPUTFORMAT
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  LOCATION
>  'file:/Users/sflee/personal/backup_demo'
>  TBLPROPERTIES (
>  'last_commit_time_sync'='20200528153331',
>  'transient_lastDdlTime'='1590650733')
>  
> sql:
> set hoodie.realtime.merge.skip = true;
> select sex, name, age from unknown_rt;
> result:
> !image-2020-05-28-21-06-02-396.png!
> the fields is out of order when setting hoodie.realtime.merge.skip = true;
> sql:
> set hoodie.realtime.merge.skip = false;
> select sex, name, age from unknown_rt
> !image-2020-05-28-21-07-30-803.png!
> query result is ok when setting hoodie.realtime.merge.skip = false;
> after debugging, I found that hudi use getWriterSchema in 
> RealtimeUnmergedRecordReader instead of getHiveSchema, we need fix it.
>  
> cc [~vbalaji]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-974) Fields out of order in MOR mode when using Hive

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-974:
---
Status: In Progress  (was: Open)

> Fields out of order in MOR mode when using Hive
> ---
>
> Key: HUDI-974
> URL: https://issues.apache.org/jira/browse/HUDI-974
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: leesf
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-05-28-21-06-02-396.png, 
> image-2020-05-28-21-07-30-803.png
>
>
> When querying MOR hudi dataset via hive
> hive table:
> CREATE EXTERNAL TABLE `unknown_rt`(
>  `_hoodie_commit_time` string,
>  `_hoodie_commit_seqno` string,
>  `_hoodie_record_key` string,
>  `_hoodie_partition_path` string,
>  `_hoodie_file_name` string,
>  `age` bigint,
>  `name` string,
>  `sex` string,
>  `ts` bigint)
>  PARTITIONED BY (
>  `location` string)
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  STORED AS INPUTFORMAT
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  LOCATION
>  'file:/Users/sflee/personal/backup_demo'
>  TBLPROPERTIES (
>  'last_commit_time_sync'='20200528153331',
>  'transient_lastDdlTime'='1590650733')
>  
> sql:
> set hoodie.realtime.merge.skip = true;
> select sex, name, age from unknown_rt;
> result:
> !image-2020-05-28-21-06-02-396.png!
> the fields is out of order when setting hoodie.realtime.merge.skip = true;
> sql:
> set hoodie.realtime.merge.skip = false;
> select sex, name, age from unknown_rt
> !image-2020-05-28-21-07-30-803.png!
> query result is ok when setting hoodie.realtime.merge.skip = false;
> after debugging, I found that hudi use getWriterSchema in 
> RealtimeUnmergedRecordReader instead of getHiveSchema, we need fix it.
>  
> cc [~vbalaji]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-974) Fields out of order in MOR mode when using Hive

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-974:
---
Status: Open  (was: New)

> Fields out of order in MOR mode when using Hive
> ---
>
> Key: HUDI-974
> URL: https://issues.apache.org/jira/browse/HUDI-974
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: leesf
>Assignee: liwei
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: image-2020-05-28-21-06-02-396.png, 
> image-2020-05-28-21-07-30-803.png
>
>
> When querying MOR hudi dataset via hive
> hive table:
> CREATE EXTERNAL TABLE `unknown_rt`(
>  `_hoodie_commit_time` string,
>  `_hoodie_commit_seqno` string,
>  `_hoodie_record_key` string,
>  `_hoodie_partition_path` string,
>  `_hoodie_file_name` string,
>  `age` bigint,
>  `name` string,
>  `sex` string,
>  `ts` bigint)
>  PARTITIONED BY (
>  `location` string)
>  ROW FORMAT SERDE
>  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>  STORED AS INPUTFORMAT
>  'org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat'
>  OUTPUTFORMAT
>  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>  LOCATION
>  'file:/Users/sflee/personal/backup_demo'
>  TBLPROPERTIES (
>  'last_commit_time_sync'='20200528153331',
>  'transient_lastDdlTime'='1590650733')
>  
> sql:
> set hoodie.realtime.merge.skip = true;
> select sex, name, age from unknown_rt;
> result:
> !image-2020-05-28-21-06-02-396.png!
> the fields is out of order when setting hoodie.realtime.merge.skip = true;
> sql:
> set hoodie.realtime.merge.skip = false;
> select sex, name, age from unknown_rt
> !image-2020-05-28-21-07-30-803.png!
> query result is ok when setting hoodie.realtime.merge.skip = false;
> after debugging, I found that hudi use getWriterSchema in 
> RealtimeUnmergedRecordReader instead of getHiveSchema, we need fix it.
>  
> cc [~vbalaji]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-944) Support more complete concurrency control when writing data

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-944:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Support more complete  concurrency control when writing data
> 
>
> Key: HUDI-944
> URL: https://issues.apache.org/jira/browse/HUDI-944
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.1
>
>
> Now hudi just support write、compaction concurrency control. But some scenario 
> need write concurrency control.Such as two spark job with different data 
> source ,need to write to the same hudi table.
> I have two Proposal：
> 1. first step :support write concurrency control on different partition
>  but now when two client write data to different partition, will meet these 
> error
> a、Rolling back commits failed
> b、instants version already exist
> {code:java}
>  [2020-05-25 21:20:34,732] INFO Checking for file exists 
> ?/tmp/HudiDLATestPartition/.hoodie/20200525212031.clean.inflight 
> (org.apache.hudi.common.table.timeline.HoodieActiveTimeline)
>  Exception in thread "main" org.apache.hudi.exception.HoodieIOException: 
> Failed to create file /tmp/HudiDLATestPartition/.hoodie/20200525212031.clean
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.createImmutableFileInPath(HoodieActiveTimeline.java:437)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:327)
>  at 
> org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionCleanInflightToComplete(HoodieActiveTimeline.java:290)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:183)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:142)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  {code}
> c、two client's archiving conflict
> d、the read client meets "Unable to infer schema for Parquet. It must be 
> specified manually.;"
> 2. second step:support insert、upsert、compaction concurrency control on 
> different isolation level such as Serializable、WriteSerializable.
> hudi can design a mechanism to check the confict in 
> AbstractHoodieWriteClient.commit()
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-945) Cleanup spillable map files eagerly as part of close

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-945:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Cleanup spillable map files eagerly as part of close
> 
>
> Key: HUDI-945
> URL: https://issues.apache.org/jira/browse/HUDI-945
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Currently, files used by external spillable map or deleted on exits. For 
> spark-streaming/deltastreamer continuous-mode cases which runs several 
> iterations, it is better to eagerly delete files on closing the handles using 
> it. 
> We need to eagerly delete the files on following cases:
>  # MergeHandle
>  # HoodieMergedLogRecordScanner
>  # SpillableMapBasedFileSystemView



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-922) [UMBRELLA] Transfer out of the Incubator

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-922:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> [UMBRELLA] Transfer out of the Incubator
> 
>
> Key: HUDI-922
> URL: https://issues.apache.org/jira/browse/HUDI-922
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Release & Administrative
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.6.1
>
>
> Umbrella task for all work needed to complete the graduation process out of 
> the incubator.
> https://incubator.apache.org/guides/transferring.html 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-940) Audit bad/dangling configs and code

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-940:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Audit bad/dangling configs and code 
> 
>
> Key: HUDI-940
> URL: https://issues.apache.org/jira/browse/HUDI-940
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Motivation : Avoid bad configs like the one fixed in  
> [https://github.com/apache/hudi/pull/1654]
> We need to take a pass on the code to remove dead/bad configs and code



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-916) Add support for multiple date/time formats in TimestampBasedKeyGenerator

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-916.

Resolution: Fixed

Looks like PR is merged and this is resolved. Feel free to re-open if needed.

> Add support for multiple date/time formats in TimestampBasedKeyGenerator
> 
>
> Key: HUDI-916
> URL: https://issues.apache.org/jira/browse/HUDI-916
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently TimestampBasedKeyGenerator supports only one input date/time format 
> creating custom partition paths using timestamp based logic. Need to support 
> multiple input formats there. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-916) Add support for multiple date/time formats in TimestampBasedKeyGenerator

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-916:
---
Status: In Progress  (was: Open)

> Add support for multiple date/time formats in TimestampBasedKeyGenerator
> 
>
> Key: HUDI-916
> URL: https://issues.apache.org/jira/browse/HUDI-916
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer, newbie
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently TimestampBasedKeyGenerator supports only one input date/time format 
> creating custom partition paths using timestamp based logic. Need to support 
> multiple input formats there. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-908) Consistent test data generation with data type coverage

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-908.

Resolution: Fixed

>From the PR it looks like this is resolved. Feel free to re-open against 0.6.1 
>version if needed.

> Consistent test data generation with data type coverage
> ---
>
> Key: HUDI-908
> URL: https://issues.apache.org/jira/browse/HUDI-908
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Testing
>Reporter: Vinoth Chandar
>Assignee: shenh062326
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Let's clean up Clean up HoodieTestDataGenerator in the process. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-908) Consistent test data generation with data type coverage

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-908:
---
Status: In Progress  (was: Open)

> Consistent test data generation with data type coverage
> ---
>
> Key: HUDI-908
> URL: https://issues.apache.org/jira/browse/HUDI-908
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Testing
>Reporter: Vinoth Chandar
>Assignee: shenh062326
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Let's clean up Clean up HoodieTestDataGenerator in the process. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-908) Consistent test data generation with data type coverage

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-908:
---
Status: Open  (was: New)

> Consistent test data generation with data type coverage
> ---
>
> Key: HUDI-908
> URL: https://issues.apache.org/jira/browse/HUDI-908
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Testing
>Reporter: Vinoth Chandar
>Assignee: shenh062326
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Let's clean up Clean up HoodieTestDataGenerator in the process. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-907) Test Presto mor query support changes in HDFS Env

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-907:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Test Presto mor query support changes in HDFS Env
> -
>
> Key: HUDI-907
> URL: https://issues.apache.org/jira/browse/HUDI-907
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Presto Integration
>Reporter: Bhavani Sudha
>Assignee: Bhavani Sudha
>Priority: Major
> Fix For: 0.6.1
>
>
> Test presto integration for HDFS environment as well in addition to S3.
>  
> Blockers faced so far
> [~bdscheller] I tried to apply your presto patch to test mor queries on 
> Presto. The way I set it up was create a docker image from your presto patch 
> and use that image in hudi local docker environment. I observed couple of 
> issues there:
>  * I got NoClassDefFoundError for these classes:
>  ** org/apache/parquet/avro/AvroSchemaConverter
>  ** org/apache/parquet/hadoop/ParquetFileReader
>  ** org/apache/parquet/io/InputFile
>  ** org/apache/parquet/format/TypeDefinedOrder
> I was able to get around the first three errors by shading org.apache.parquet 
> inside hudi-presto-bundle and changing presto-hive to depend on the 
> hudi-presto-bundle. However, for the last one shading dint help because its 
> already a Thrift generated class. I am wondering you  also ran into similar 
> issues while testing S3.  
> Could you please elaborate your test set up so we can do similar thing for 
> HDFS as well. If we need to add more changes to hudi-presto-bundle, we would 
> need to prioritize that for 0.5.3 release asap.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-242) [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-242:
---
Fix Version/s: (was: 0.6.0)

> [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi
> --
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-901) Bug Bash 0.6.0 Tracking Ticket

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-901:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Bug Bash 0.6.0 Tracking Ticket
> --
>
> Key: HUDI-901
> URL: https://issues.apache.org/jira/browse/HUDI-901
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.1
>
>
> This is a tracking ticket for all bug bash 0.6.0 tickets. 
> We have done our best to assign tickets to those who might have good context 
> and to those who volunteered for bug bash. The cursory assignment is just to 
> help you out, and by no means forcing you to work on it. If you feel you 
> can't work on it, please unassign yourself, or you could swap with someone 
> here. 
> All tickets are labelled with "bug-bash-0.6.0". If anyone feels to pitch in 
> with any of the work you have or currently doing, feel free to add the label, 
> but don't remove from existing ones. 
> Some tickets are support ones, which might need follow up 
> questions/clarifications with the reporter of the ticket. For those try to 
> start working right away so we can drive to completion by end of day 10. 
> We are looking to time it for 10 days. Planning to wrap it up by 27th of May. 
> Again, we totally understand that some tickets may not be completed by the 
> time due to various reasons, like support questions, not able to repro 
> locally, env mis-match, swamped with some PR work, or don't have cycles 
> during these 10 days, etc. Let's try our best to take these to completion.
> We are all ears for any questions or clarification. Please respond here in 
> Jira or you could send an email to our mailing list in bug bash thread.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-307) Dataframe written with Date,Timestamp, Decimal is read with same types

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-307:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Dataframe written with Date,Timestamp, Decimal is read with same types
> --
>
> Key: HUDI-307
> URL: https://issues.apache.org/jira/browse/HUDI-307
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Spark Integration
>Reporter: Cosmin Iordache
>Assignee: Udit Mehrotra
>Priority: Minor
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.1
>
>
> Small test for COW table to check the persistence of Date, Timestamp ,Decimal 
> types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-897) hudi support log append scenario with better write and asynchronous compaction

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-897:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> hudi support log append scenario with better write and asynchronous compaction
> --
>
> Key: HUDI-897
> URL: https://issues.apache.org/jira/browse/HUDI-897
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Compaction, Performance
>Reporter: liwei
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.1
>
> Attachments: image-2020-05-14-19-51-37-938.png, 
> image-2020-05-14-20-14-59-429.png
>
>
> 一、scenario
> The business scenarios of the data lake mainly include analysis of databases, 
> logs, and files.
> !image-2020-05-14-20-14-59-429.png|width=444,height=286!
> Databricks delta lake also aim at these three  scenario. [1]
>  
> 二、Hudi current situation
> At present, hudi can better support the scenario where the database cdc is 
> incrementally written to hudi, and it is also doing bulkload files to hudi. 
> However, there is no good native support for log scenarios (requiring 
> high-throughput writes, no updates, deletions, and focusing on small file 
> scenarios);now can write through inserts without deduplication, but they will 
> still merge on the write side.
>  * In copy on write mode when "hoodie.parquet.small.file.limit" is 100MB, but 
>  every batch small  will cost some time for merge,it will reduce write 
> throughput.  
>  * This scene is not suitable for  merge on read. 
>  * the actual scenario only needs to write parquet in batches when writing, 
> and then provide reverse compaction (similar to delta lake )
> 三、what we can do
>   
>  1.On the write side, just write every batch to parquet file base on the 
> snapshot mechanism,default open the merge,use can close the auto merge for 
> more  write throughput.  
> 2. hudi support asynchronous merge small parquet files like databricks delta 
> lake's  OPTIMIZE command [2] 
>  
> [1] [https://databricks.com/product/delta-lake-on-databricks]
> [2] [https://docs.databricks.com/delta/optimizations/file-mgmt.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-857) Overhaul unit-tests for Cleaner and ROllbacks

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-857:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Overhaul unit-tests for Cleaner and ROllbacks
> -
>
> Key: HUDI-857
> URL: https://issues.apache.org/jira/browse/HUDI-857
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: help-requested
> Fix For: 0.6.1
>
>
> Unit tests for these components do not clearly tests their functionality. 
> Instead some of them seem to be written to make them pass with the initial 
> code. We would need to overhaul these tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-847) Umbrella ticket for tuning default configs for 0.6.0

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-847:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Umbrella ticket for tuning default configs for 0.6.0
> 
>
> Key: HUDI-847
> URL: https://issues.apache.org/jira/browse/HUDI-847
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Utilities, Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This is for vetting, tuning default configurations in Hoodie for 0.6.0 release



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-849) Turn on incremental Syncing by default for DeltaStreamer and spark streaming cases

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-849:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Turn on incremental Syncing by default for DeltaStreamer and spark streaming 
> cases
> --
>
> Key: HUDI-849
> URL: https://issues.apache.org/jira/browse/HUDI-849
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer, Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-801) Add a way to postprocess schema after it is loaded from the schema provider

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-801:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Add a way to postprocess schema after it is loaded from the schema provider
> ---
>
> Key: HUDI-801
> URL: https://issues.apache.org/jira/browse/HUDI-801
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Sometimes it is needed to postprocess schemas after they are fetched from the 
> external sources. Some examples of postprocessing:
>  * make sure all the defaults are set correctly, and update schema if not.
>  * insert marker columns into records with no fields (no writable as parquest)
>  * ...
> Would be great to have a way to plug in custom post processors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-826) Spark to avro schema in 0.6 incompatible with 0.5 for fixed types

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-826:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Spark to avro schema in 0.6 incompatible with 0.5 for fixed types
> -
>
> Key: HUDI-826
> URL: https://issues.apache.org/jira/browse/HUDI-826
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
> Fix For: 0.6.1
>
>
> Let's say we had some dataset created with SQL transformer using a query: 
> {code:java}
> // select bla AS DECIMAL(20, 9)) bla
> {code}
> In 0.5 spark->avro converter (Databrics) would generate something like:
>  
> {code:java}
> // {
>   "name": "bla",
>   "type": [
> "string", 
> "null"
>   ]
> },
> {code}
> in 0.6 (Spark):
>  
>  
> {code:java}
> // {
>   "name": "bla",
>   "type": [
> {
>   "type": "fixed",
>   "name": "order_subtotal",
>   "namespace": "",
>   "size": 16,
>   "logicalType": "decimal",
>   "precision": 38,
>   "scale": 17
> }, "null"
>   ]
> },
> {code}
> types are very different in that case. During the merge reader would fail 
> with:
> {code:java}
> //at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
>   at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
>   at 
> org.apache.hudi.utilities.TestCss.testParquetWithSchema(TestCss.java:270)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
>   at 
> com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
>   at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>   at java.lang.System.arraycopy(Native Method)
>   at 
> org.apache.avro.generic.GenericData.createFixed(GenericData.java:1168)
>   at 
> org.apache.parquet.avro.AvroConverters$FieldFixedConverter.convert(AvroConverters.java:310
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-783.

Resolution: Fixed

Based on the PRs, it looks like this is done. Please feel free to open newer 
issues if there is any work needed!

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-783) Add official python support to create hudi datasets using pyspark

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-783:
---
Status: In Progress  (was: Open)

> Add official python support to create hudi datasets using pyspark
> -
>
> Key: HUDI-783
> URL: https://issues.apache.org/jira/browse/HUDI-783
> Project: Apache Hudi
>  Issue Type: Wish
>  Components: Utilities
>Reporter: Vinoth Govindarajan
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: features, pull-request-available
> Fix For: 0.6.0
>
>
> *Goal:*
>  As a pyspark user, I would like to read/write hudi datasets using pyspark.
> There are several components to achieve this goal.
>  # Create a hudi-pyspark package that users can import and start 
> reading/writing hudi datasets.
>  # Explain how to read/write hudi datasets using pyspark in a blog 
> post/documentation.
>  # Add the hudi-pyspark module to the hudi demo docker along with the 
> instructions.
>  # Make the package available as part of the [spark packages 
> index|https://spark-packages.org/] and [python package 
> index|https://pypi.org/]
> hudi-pyspark packages should implement HUDI data source API for Apache Spark 
> using which HUDI files can be read as DataFrame and write to any Hadoop 
> supported file system.
> Usage pattern after we launch this feature should be something like this:
> Install the package using:
> {code:java}
> pip install hudi-pyspark{code}
> or
> Include hudi-pyspark package in your Spark Applications using:
> spark-shell, pyspark, or spark-submit
> {code:java}
> > $SPARK_HOME/bin/spark-shell --packages 
> > org.apache.hudi:hudi-pyspark_2.11:0.5.2{code}
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-774) Spark to Avro converter incorrectly generates optional fields

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-774:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Spark to Avro converter incorrectly generates optional fields
> -
>
> Key: HUDI-774
> URL: https://issues.apache.org/jira/browse/HUDI-774
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Alexander Filipchik
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I think https://issues.apache.org/jira/browse/SPARK-28008 is a good 
> descriptions of what is happening.
>  
> It can cause a situation when schema in the MOR log files is incompatible 
> with the schema produced by RowBasedSchemaProvider, so compactions will stall.
>  
> I have a fix which is a bit hacky -> postprocess schema produced by the 
> converter and
> 1) Make sure unions with null types have those null types at position 0
> 2) They have default values set to null
> I couldn't find a way to do a clean fix as some classes that are problematic 
> are from Hive and called from Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-747) Implement Rollback like API in HoodieWriteClient which can revert all actions

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-747:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Implement Rollback like API in HoodieWriteClient which can revert all actions 
> --
>
> Key: HUDI-747
> URL: https://issues.apache.org/jira/browse/HUDI-747
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Related to HUDI-716 and PR-1432
> The PR address the specific issue of deleting orphaned inflight/requested 
> clean actions by older versions of Hudi. 
> Currently rollback performs reverting only commit and delta-commit 
> operations. We can introduce a new API which will consistently revert all 
> pending actions clean, compact, commit  and delta-commit.  Currently, we dont 
> rollback clean. Instead we expect future clean operations to first finish up 
> pending clean first. By having this new API (rollbackPendingActions), we can 
> let users consistently revert any actions if they want.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-718) java.lang.ClassCastException during upsert

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-718:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> java.lang.ClassCastException during upsert
> --
>
> Key: HUDI-718
> URL: https://issues.apache.org/jira/browse/HUDI-718
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: lamber-ken
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.1
>
> Attachments: image-2020-03-21-16-49-28-905.png
>
>
> Dataset was created using hudi 0.5 and now trying to migrate it to the latest 
> master. The table is written using SqlTransformer. Exception:
>  
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Failed to merge 
> old record into new file for key bla.bla from old file 
> gs://../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_0-35-1196_20200316234140.parquet
>  to new file 
> gs://.../2020/03/15/7b75931f-ff2f-4bf4-8949-5c437112be79-0_1-39-1506_20200317190948.parquet
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:246)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:433)
>  at 
> org.apache.hudi.table.HoodieCopyOnWriteTable$UpdateHandler.consumeOneRecord(HoodieCopyOnWriteTable.java:423)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryQueueConsumer.consume(BoundedInMemoryQueueConsumer.java:37)
>  at 
> org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$2(BoundedInMemoryExecutor.java:121)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  ... 3 more
> Caused by: java.lang.ClassCastException: org.apache.avro.util.Utf8 cannot be 
> cast to org.apache.avro.generic.GenericFixed
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:336)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:275)
>  at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)
>  at org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)
>  at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128)
>  at org.apache.parquet.hadoop.ParquetWriter.write(ParquetWriter.java:299)
>  at 
> org.apache.hudi.io.storage.HoodieParquetWriter.writeAvro(HoodieParquetWriter.java:103)
>  at org.apache.hudi.io.HoodieMergeHandle.write(HoodieMergeHandle.java:242)
>  ... 8 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-722) IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when writing parquet

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-722:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> IndexOutOfBoundsException in MessageColumnIORecordConsumer.addBinary when 
> writing parquet
> -
>
> Key: HUDI-722
> URL: https://issues.apache.org/jira/browse/HUDI-722
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Alexander Filipchik
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.1
>
>
> Some writes fail with java.lang.IndexOutOfBoundsException : Invalid array 
> range: X to X inside MessageColumnIORecordConsumer.addBinary call.
> Specifically: getColumnWriter().write(value, r[currentLevel], 
> currentColumnIO.getDefinitionLevel());
> fails as size of r is the same as current level. What can be causing it?
>  
> It gets executed via: ParquetWriter.write(IndexedRecord) Library version: 
> 1.10.1 Avro is a very complex object (~2.5k columns, highly nested, arrays of 
> unions present).
> But what is surprising is that it fails to write top level field: 
> PrimitiveColumnIO _hoodie_commit_time r:0 d:1 [_hoodie_commit_time] which is 
> the first top level field in Avro: {"_hoodie_commit_time": "20200317215711", 
> "_hoodie_commit_seqno": "20200317215711_0_650",



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-736) Simplify ReflectionUtils#getTopLevelClasses

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-736:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Simplify ReflectionUtils#getTopLevelClasses 
> 
>
> Key: HUDI-736
> URL: https://issues.apache.org/jira/browse/HUDI-736
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-637) Investigate slower hudi queries in S3 vs HDFS

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-637:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Investigate slower hudi queries in S3 vs HDFS
> -
>
> Key: HUDI-637
> URL: https://issues.apache.org/jira/browse/HUDI-637
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Performance
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> Hudi queries in S3 takes abnormally longer time compared to hdfs. 
> S3 listing itself is not taking that long of time. 
> PERFORMANCE BUG:
> the metadata list performance is likely causing performance issues with hudi.
>  
> {{scala> stopwatch(\{ sql("SELECT * FROM 
> ap_invoices_all_compacted_s3").count})}}
> {{Elapsed time: 1m 55.078473113s 
>  res2: Long = }}
> {{}}
> {{scala> stopwatch(\{ sql("SELECT * FROM ap_invoices_all_compacted").count}) 
> – this is the exact same table in hdfs}}
> {{Elapsed time: 6.581217052s 
>  res3: Long = xxx}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-435) Make async compaction/cleaning extensible to new usages

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-435:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Make async compaction/cleaning extensible to new usages
> ---
>
> Key: HUDI-435
> URL: https://issues.apache.org/jira/browse/HUDI-435
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Compaction, Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.1
>
>
> Once HFile based index is available, next step is to make compaction 
> extensible to be available for all components.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-338) Reduce Hoodie commit/instant time granularity to millis from secs

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-338:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Reduce Hoodie commit/instant time granularity to millis from secs
> -
>
> Key: HUDI-338
> URL: https://issues.apache.org/jira/browse/HUDI-338
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Common Core
>Reporter: Nishith Agarwal
>Assignee: Nishith Agarwal
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-84) Benchmark write/read paths on Hudi vs non-Hudi datasets

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-84?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-84:
--
Fix Version/s: (was: 0.6.0)
   0.6.1

> Benchmark write/read paths on Hudi vs non-Hudi datasets
> ---
>
> Key: HUDI-84
> URL: https://issues.apache.org/jira/browse/HUDI-84
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Performance
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: realtime-data-lakes
> Fix For: 0.6.1
>
> Attachments: df-toRdd-write.pdf, df-write-stage.pdf
>
>
> * Index performance
>  * SparkSQL 
> (https://github.com/apache/incubator-hudi/issues/588#issuecomment-468055059)
>  * Query planning Planning 
>  * Bulk_insert, log ingest
>  * upsert, database change log. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-304) Bring back spotless plugin

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-304:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Bring back spotless plugin 
> ---
>
> Key: HUDI-304
> URL: https://issues.apache.org/jira/browse/HUDI-304
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Code Cleanup, Testing
>Reporter: Balaji Varadarajan
>Assignee: leesf
>Priority: Major
>  Labels: bug-bash-0.6.0, help-wanted, pull-request-available
> Fix For: 0.6.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> spotless plugin has been turned off as the eclipse style format it was 
> referencing was removed due to compliance reasons. 
> We use google style eclipse format with some changes
> 90c90
> < 
> ---
> > 
> 242c242
> <  value="100"/>
> ---
> >  > value="120"/>
>  
> The eclipse style sheet was originally obtained from 
> [https://github.com/google/styleguide] which CC -By 3.0 license which is not 
> compatible for source distribution (See 
> [https://www.apache.org/legal/resolved.html#cc-by]) 
>  
> We need to figure out a way to bring this back
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-83) Map Timestamp type in spark to corresponding Timestamp type in Hive during Hive sync

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-83?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-83:
--
Fix Version/s: (was: 0.6.0)
   0.6.1

> Map Timestamp type in spark to corresponding Timestamp type in Hive during 
> Hive sync
> 
>
> Key: HUDI-83
> URL: https://issues.apache.org/jira/browse/HUDI-83
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration, Usability
>Reporter: Vinoth Chandar
>Assignee: cdmikechen
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.1
>
>
> [https://github.com/apache/incubator-hudi/issues/543] &; related issues 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-29) Patch to Hive-sync to enable stats on Hive tables #393

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-29?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-29:
--
Fix Version/s: (was: 0.6.0)
   0.6.1

> Patch to Hive-sync to enable stats on Hive tables #393
> --
>
> Key: HUDI-29
> URL: https://issues.apache.org/jira/browse/HUDI-29
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> https://github.com/uber/hudi/issues/393



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1174) Hudi changes for bootstrapped tables integration with Presto

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1174.
-
Resolution: Fixed

Resolving since the corresponding PR is merged. Please feel free to re-open if 
needed.

> Hudi changes for bootstrapped tables integration with Presto
> 
>
> Key: HUDI-1174
> URL: https://issues.apache.org/jira/browse/HUDI-1174
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Hudi changes for bootstrapped tables integration with Presto.
>  * Annotation *UseRecordReaderFromInputFormat* is required on 
> *HoodieParquetInputFormat* as well, because the reading for bootstrapped 
> tables needs to happen through record reader to be able to perform the merge. 
> On presto side, this annotation is already handled.
>  * We need to internally maintain *VIRTUAL_COLUMN_NAMES* because presto's 
> internal hive version *hive-apache-1.2.2* has *VirutalColumn* as a *class*, 
> versus the one we depend on in hudi which is an *enum*. This results in 
> following error in presto:
>  
> {noformat}
> 2020-08-10T21:59:58.957Z ERROR remote-task-callback-2 
> com.facebook.presto.execution.StageExecutionStateMachine Stage execution 
> 20200810_215953_6_34kqg.1.0 failed
> java.lang.NoSuchFieldError: VIRTUAL_COLUMN_NAMES
>  at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.lambda$getRecordReader$2(HoodieParquetInputFormat.java:201)
>  at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>  at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:203)
>  at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:253)
>  at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>  at 
> com.facebook.presto.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1824)
>  at 
> com.facebook.presto.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)
>  at 
> com.facebook.presto.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)
>  at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:82)
>  at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:73)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:374)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:137)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:113)
>  at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:52)
> {noformat}
>  
>  * Dependency changes in *hudi-presto-bundle* to avoid runtime exceptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-242) [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-242:
---
Fix Version/s: 0.6.1

> [RFC-12] Support Efficient bootstrap of large parquet datasets to Hudi
> --
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0, 0.6.1
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1174) Hudi changes for bootstrapped tables integration with Presto

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1174:

Status: Open  (was: New)

> Hudi changes for bootstrapped tables integration with Presto
> 
>
> Key: HUDI-1174
> URL: https://issues.apache.org/jira/browse/HUDI-1174
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Hudi changes for bootstrapped tables integration with Presto.
>  * Annotation *UseRecordReaderFromInputFormat* is required on 
> *HoodieParquetInputFormat* as well, because the reading for bootstrapped 
> tables needs to happen through record reader to be able to perform the merge. 
> On presto side, this annotation is already handled.
>  * We need to internally maintain *VIRTUAL_COLUMN_NAMES* because presto's 
> internal hive version *hive-apache-1.2.2* has *VirutalColumn* as a *class*, 
> versus the one we depend on in hudi which is an *enum*. This results in 
> following error in presto:
>  
> {noformat}
> 2020-08-10T21:59:58.957Z ERROR remote-task-callback-2 
> com.facebook.presto.execution.StageExecutionStateMachine Stage execution 
> 20200810_215953_6_34kqg.1.0 failed
> java.lang.NoSuchFieldError: VIRTUAL_COLUMN_NAMES
>  at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.lambda$getRecordReader$2(HoodieParquetInputFormat.java:201)
>  at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>  at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:203)
>  at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:253)
>  at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>  at 
> com.facebook.presto.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1824)
>  at 
> com.facebook.presto.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)
>  at 
> com.facebook.presto.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)
>  at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:82)
>  at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:73)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:374)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:137)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:113)
>  at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:52)
> {noformat}
>  
>  * Dependency changes in *hudi-presto-bundle* to avoid runtime exceptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1174) Hudi changes for bootstrapped tables integration with Presto

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1174:

Status: In Progress  (was: Open)

> Hudi changes for bootstrapped tables integration with Presto
> 
>
> Key: HUDI-1174
> URL: https://issues.apache.org/jira/browse/HUDI-1174
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Hudi changes for bootstrapped tables integration with Presto.
>  * Annotation *UseRecordReaderFromInputFormat* is required on 
> *HoodieParquetInputFormat* as well, because the reading for bootstrapped 
> tables needs to happen through record reader to be able to perform the merge. 
> On presto side, this annotation is already handled.
>  * We need to internally maintain *VIRTUAL_COLUMN_NAMES* because presto's 
> internal hive version *hive-apache-1.2.2* has *VirutalColumn* as a *class*, 
> versus the one we depend on in hudi which is an *enum*. This results in 
> following error in presto:
>  
> {noformat}
> 2020-08-10T21:59:58.957Z ERROR remote-task-callback-2 
> com.facebook.presto.execution.StageExecutionStateMachine Stage execution 
> 20200810_215953_6_34kqg.1.0 failed
> java.lang.NoSuchFieldError: VIRTUAL_COLUMN_NAMES
>  at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.lambda$getRecordReader$2(HoodieParquetInputFormat.java:201)
>  at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
>  at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
>  at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>  at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
>  at 
> org.apache.hudi.hadoop.HoodieParquetInputFormat.getRecordReader(HoodieParquetInputFormat.java:203)
>  at com.facebook.presto.hive.HiveUtil.createRecordReader(HiveUtil.java:253)
>  at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.lambda$createRecordCursor$0(GenericHiveRecordCursorProvider.java:74)
>  at 
> com.facebook.presto.hive.authentication.UserGroupInformationUtils.lambda$executeActionInDoAs$0(UserGroupInformationUtils.java:29)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1824)
>  at 
> com.facebook.presto.hive.authentication.UserGroupInformationUtils.executeActionInDoAs(UserGroupInformationUtils.java:27)
>  at 
> com.facebook.presto.hive.authentication.ImpersonatingHdfsAuthentication.doAs(ImpersonatingHdfsAuthentication.java:39)
>  at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:82)
>  at 
> com.facebook.presto.hive.GenericHiveRecordCursorProvider.createRecordCursor(GenericHiveRecordCursorProvider.java:73)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createHivePageSource(HivePageSourceProvider.java:374)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:137)
>  at 
> com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:113)
>  at 
> com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:52)
> {noformat}
>  
>  * Dependency changes in *hudi-presto-bundle* to avoid runtime exceptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-683) Prototype classes/abstractions to encapsule SparkContext and RDD

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-683:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Prototype classes/abstractions to encapsule SparkContext and RDD 
> -
>
> Key: HUDI-683
> URL: https://issues.apache.org/jira/browse/HUDI-683
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.6.1
>
>
> This issue is to track a prototype (after we knock off all the refactoring in 
> this ticket and HUDI-667 , HUDI-43 ) and open a WIP PR for broader discussion



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-538) Restructuring hudi client module for multi engine support

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-538:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Restructuring hudi client module for multi engine support
> -
>
> Key: HUDI-538
> URL: https://issues.apache.org/jira/browse/HUDI-538
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: vinoyang
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.6.1
>
>
> Hudi is currently tightly coupled with the Spark framework. It caused the 
> integration with other computing engine more difficult. We plan to decouple 
> it with Spark. This umbrella issue used to track this work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-837) Fix AvroKafkaSource to use the latest schema for reading

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-837:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Fix AvroKafkaSource to use the latest schema for reading
> 
>
> Key: HUDI-837
> URL: https://issues.apache.org/jira/browse/HUDI-837
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.1
>
>
> Currently we specify KafkaAvroDeserializer as the value for 
> value.deserializer in AvroKafkaSource. This implies the published record is 
> read using the same schema with which it was written even though the schema 
> got evolved in between. As a result, messages in incoming batch can have 
> different schemas. This has to be handled at the time of actually writing 
> records in parquet. 
> This Jira aims at providing an option to read all the messages with the same 
> schema by implementing a new custom deserializer class. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-766) Update Apache Hudi website with usage info about HoodieMultiTableDeltaStreamer

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-766.

Resolution: Fixed

> Update Apache Hudi website with usage info about HoodieMultiTableDeltaStreamer
> --
>
> Key: HUDI-766
> URL: https://issues.apache.org/jira/browse/HUDI-766
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, docs-chinese
>Reporter: Balaji Varadarajan
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Relevant Section : 
> [https://hudi.apache.org/docs/writing_data.html#deltastreamer]
> Add high-level description about this tool 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-859) Improve documentation around key generators

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-859.

Resolution: Fixed

> Improve documentation around key generators
> ---
>
> Key: HUDI-859
> URL: https://issues.apache.org/jira/browse/HUDI-859
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>
> Proper documentation is required to help users understand what all key 
> generators are currently supported, how to use them etc. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-979) AWSDMSPayload delete handling with MOR

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-979:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> AWSDMSPayload delete handling with MOR
> --
>
> Key: HUDI-979
> URL: https://issues.apache.org/jira/browse/HUDI-979
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: liwei
>Priority: Major
> Fix For: 0.6.1
>
>
> [https://github.com/apache/hudi/issues/1549] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1059) Add test to verify partition path gets updated with global bloom

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-1059.
-
Resolution: Fixed

Looks like this is resolved with 
[https://github.com/apache/hudi/pull/1793/files#diff-0e4d95438438a372019699b7d92f27db.]
 Please re-open if needed.

> Add test to verify partition path gets updated with global bloom
> 
>
> Key: HUDI-1059
> URL: https://issues.apache.org/jira/browse/HUDI-1059
> Project: Apache Hudi
>  Issue Type: Test
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Add a test to verify that partition path gets updated with global bloom if 
> the resp config is set appropriately, when records are upserted with a diff 
> partition path 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1120) Support spotless for scala

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1120:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Support spotless for scala
> --
>
> Key: HUDI-1120
> URL: https://issues.apache.org/jira/browse/HUDI-1120
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Code Cleanup
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1177) fix TimestampBasedKeyGenerator Task not serializableException

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1177:

Fix Version/s: (was: 0.6.0)
   0.6.1

> fix TimestampBasedKeyGenerator  Task not serializableException
> --
>
> Key: HUDI-1177
> URL: https://issues.apache.org/jira/browse/HUDI-1177
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] rubenssoto edited a comment on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

2020-08-14 Thread GitBox



rubenssoto edited a comment on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674194952


   **Commit Files:** 
[Hudi.zip](https://github.com/apache/hudi/files/5076229/Hudi.zip)
   
   
   I think it is the commit files.
   
   https://user-images.githubusercontent.com/36298331/90279124-f3edad80-de3e-11ea-8c3c-76872a1738f9.png";>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-1192) Fix the failure of creating hive database

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1192:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Fix the failure of creating hive database
> -
>
> Key: HUDI-1192
> URL: https://issues.apache.org/jira/browse/HUDI-1192
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> {code:java}
> org.apache.hudi.hive.HoodieHiveSyncException: Failed in executing SQL create 
> database if not exists data_lake
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:352)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:121)
>   at 
> org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:94)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncMeta(DeltaSync.java:510)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:425)
>   at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:244)
>   at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:579)
>   at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException No valid privileges
>  User lingqu does not have privileges for CREATEDATABASE
>  The required privileges: Server=server1->action=create->grantOption=false;
>   at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:266)
>   at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:252)
>   at 
> org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:309)
>   at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:250)
>   at 
> org.apache.hudi.hive.HoodieHiveClient.updateHiveSQL(HoodieHiveClient.java:350)
>   ... 10 more
> Caused by: org.apache.hive.service.cli.HiveSQLException: Error while 
> compiling statement: FAILED: SemanticException No valid privileges
>  User lingqu does not have privileges for CREATEDATABASE
>  The required privileges: Server=server1->action=create->grantOption=false;
>   at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:329)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:207)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:290)
>   at 
> org.apache.hive.service.cli.operation.Operation.run(Operation.java:260)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:505)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:491)
>   at 
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:295)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:507)
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   ... 3 more
> Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: No valid 
> privileges
>  User lingqu does not have privileges for CREATEDATABASE
>  The required privileges: Server=server1->action=create->grantOption=false;
>   at 
> org.apache.sentry.binding.hive.HiveAuthzBindingHook.postAnalyze(HiveAuthzBindingHook.java:371)
>   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:600)
>   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1425)
>   at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:1398)
>   at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:205)
>   ... 15 more
> Caused by: org.apache.hadoop.h

[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

2020-08-14 Thread GitBox



rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674194952


   [Hudi.zip](https://github.com/apache/hudi/files/5076229/Hudi.zip)
   
   
   I think it is the commit files.
   
   https://user-images.githubusercontent.com/36298331/90279124-f3edad80-de3e-11ea-8c3c-76872a1738f9.png";>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Updated] (HUDI-915) Partition Columns missing in files upserted after Metadata Bootstrap

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-915:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Partition Columns missing in files upserted after Metadata Bootstrap
> 
>
> Key: HUDI-915
> URL: https://issues.apache.org/jira/browse/HUDI-915
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Common Core
>Reporter: Udit Mehrotra
>Assignee: Udit Mehrotra
>Priority: Blocker
> Fix For: 0.6.1
>
>
> This issue happens in when the source data is partitioned using _*hive-style 
> partitioning*_ which is also the default behavior of spark when it writes the 
> data. With this partitioning, the partition column/schema is never stored in 
> the files but instead retrieved on the fly from the file paths which have 
> partition folder in the form *_partition_key=partition_value_*.
> Now, during metadata bootstrap we store only the metadata columns in the hudi 
> table folder. Also the *bootstrap schema* we are computing directly reads 
> schema from the source data file which does not have the *partition column 
> schema* in it. Thus it is not complete.
> All this manifests into issues when we ultimately do *upserts* on these 
> bootstrapped files and they are fully bootstrapped. During upsert time the 
> schema evolves because the upsert dataframe needs to have partition column in 
> it for performing upserts. Thus ultimately the *upserted rows* have the 
> correct partition column value stored, while the other records which are 
> simply copied over from the metadata bootstrap file have missing partition 
> column in them. Thus, we observe a different behavior here with 
> *bootstrapped* vs *non-bootstrapped* tables.
> While this is not at the moment creating issues with *Hive* because it is 
> able to determine the partition columns becuase of all the metadata it 
> stores, however it creates a problem with other engines like *Spark* where 
> the partition columns will show up as *null* when the upserted files are read.
> Thus, the proposal is to fix the following issues:
>  * When performing bootstrap, figure out the partition schema and store it in 
> the *bootstrap schema* in the commit metadata file. This would provide the 
> following benefits:
>  ** From a completeness perspective this is good so that there is no 
> behavioral changes between bootstrapped vs non-bootstrapped tables.
>  ** In spark bootstrap relation and incremental query relation where we need 
> to figure out the latest schema, once can simply get the accurate schema from 
> the commit metadata file instead of having to determine whether or not 
> partition column is present in the schema obtained from the metadata file and 
> if not figure out the partition schema everytime and merge (which can be 
> expensive).
>  * When doing upsert on files that are metadata bootstrapped, the partition 
> column values should be correctly determined and copied to the upserted file 
> to avoid missing and null values.
>  ** Again this is consistent behavior with non-bootstrapped tables and even 
> though Hive seems to somehow handle this, we should consider other engines 
> like *Spark* where it cannot be automatically handled.
>  ** Without this it will be significantly more complicated to be able to 
> provide the partition value on read side in spark, to be able to determine 
> everytime whether partition value is null and somehow filling it in.
>  ** Once the table is fully bootstrapped at some point in future, and the 
> bootstrap commit is say cleaned up and spark querying happens through 
> *parquet* datasource instead of *new bootstrapped datasource*, the *parquet 
> datasource* will return null values wherever it find the missing partition 
> values. In that case, we have no control over the *parquet* datasource as it 
> is simply reading from the file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1021) [Bug] Unable to update bootstrapped table using rows from the written bootstrapped table

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1021:

Fix Version/s: (was: 0.6.0)
   0.6.1

> [Bug] Unable to update bootstrapped table using rows from the written 
> bootstrapped table
> 
>
> Key: HUDI-1021
> URL: https://issues.apache.org/jira/browse/HUDI-1021
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: bootstrap
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
> Fix For: 0.6.1
>
>
> Reproduction Steps:
>  
> {code:java}
> import spark.implicits._
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.DataSourceReadOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.hudi.HoodieDataSourceHelpers
> import org.apache.hudi.common.model.HoodieTableType
> import org.apache.spark.sql.SaveMode
> val sourcePath = 
> "s3://uditme-iad/hudi/tables/events/events_data_partitioned_non_null"
> val sourceDf = spark.read.parquet(sourcePath + "/*")
> var tableName = "events_data_partitioned_non_null_00"
> var tablePath = "s3://emr-users/uditme/hudi/tables/events/" + tableName
> sourceDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Overwrite)
>  .save(tablePath)
> val readDf = spark.read.format("org.apache.hudi").load(tablePath + "/*")
> val updateDf = readDf.filter($"event_id" === "106")
>  .withColumn("event_name", lit("udit_event_106"))
>  
> updateDf.write.format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, tableName)
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.COW_STORAGE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>  .mode(SaveMode.Append)
>  .save(tablePath)
> {code}
>  
> Full Stack trace:
> {noformat}
> Caused by: org.apache.hudi.exception.HoodieUpsertException: Error upserting 
> bucketType UPDATE for partition :0
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:276)
>  at 
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:102)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:875)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:359)
>  at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:357)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1181)
>  at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1155)
>  at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1090)
>  at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1155)
>  at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:881)
>  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:357)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:308)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Tas

[GitHub] [hudi] rubenssoto commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

2020-08-14 Thread GitBox



rubenssoto commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-674193769


   Hi @bhasudha ,
   
   Yes, this is a simple job, reads parquet generated by aws DMS and writes 
Hudi.
   
   My spark submit:
   spark-submit --deploy-mode cluster --conf 
spark.dynamicAllocation.cachedExecutorIdleTimeout=60s --conf 
spark.dynamicAllocation.executorIdleTimeout=60s --conf 
spark.dynamicAllocation.maxExecutors=1 --conf 
spark.executor.memoryOverhead=2048 --conf spark.executor.cores=3 --conf 
spark.executor.memory=10g --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.sql.hive.convertMetastoreParquet=false --packages 
org.apache.hudi:hudi-spark-bundle_2.11:0.5.3,org.apache.spark:spark-avro_2.11:2.4.4
 --py-files python_modules main.py poc_available_history
   
   
   
   https://user-images.githubusercontent.com/36298331/90278335-afaddd80-de3d-11ea-8cf0-9fd168fa643a.png";>
   https://user-images.githubusercontent.com/36298331/90278349-b63c5500-de3d-11ea-91a9-02f3ae5bfc76.png";>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] n3nash commented on pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-14 Thread GitBox



n3nash commented on pull request #1964:
URL: https://github.com/apache/hudi/pull/1964#issuecomment-674193349


   @bvaradar can you review this ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Resolved] (HUDI-808) Support for cleaning source data

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha resolved HUDI-808.

Resolution: Fixed

Looks like the corresponding PR is merged. If there is anything left feel free 
to re-open [~wenningd] [~uditme]

> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-808) Support for cleaning source data

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-808:
---
Status: Closed  (was: Patch Available)

> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Reopened] (HUDI-808) Support for cleaning source data

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha reopened HUDI-808:


> Support for cleaning source data
> 
>
> Key: HUDI-808
> URL: https://issues.apache.org/jira/browse/HUDI-808
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is an important requirement from GDPR perspective. When performing 
> deletion on a metadata only bootstrapped partition, users should have the 
> ability to tell to clean up the original data from the source location 
> because as per this new bootstrapping mechanism the original data serves as 
> the data in original commit for Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [hudi] n3nash commented on a change in pull request #1963: [HUDI-1188] Hbase index MOR tables records not being deduplicated

2020-08-14 Thread GitBox



n3nash commented on a change in pull request #1963:
URL: https://github.com/apache/hudi/pull/1963#discussion_r470768575



##
File path: hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
##
@@ -182,6 +182,7 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty()
 && (commitTimeline.containsInstant(new HoodieInstant(false, 
HoodieTimeline.COMMIT_ACTION, commitTs))

Review comment:
   Yes, @rmpifer please use the above API Vinoth mentioned to refactor this 
code so we can include both DELTA_COMMIT & COMMIT actions





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[jira] [Closed] (HUDI-1182) Stabilize CI

2020-08-14 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan closed HUDI-1182.


> Stabilize CI 
> -
>
> Key: HUDI-1182
> URL: https://issues.apache.org/jira/browse/HUDI-1182
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Documenting all failure cases here :
>  
> Integration Tests:
>  
> 1. ITTestHoodieSanity :  Command 
> ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, 
> --table-path, hdfs://namenode/docker_hoodie_single_partition_key_cow_test, 
> --hive-url, jdbc:hive2://hiveserver:1, --table-type, COPY_ON_WRITE, 
> --hive-table, docker_hoodie_single_partition_key_cow_test]) 
> {code:java}
> 05:17:48.384 [main] ERROR HoodieJavaStreamingApp - Got error running app 
> java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: 
> Expecting 100 records, Got 50 at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212] at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_212] at 
> HoodieJavaStreamingApp.run(HoodieJavaStreamingApp.java:193) 
> ~[test-classes/:?] at 
> HoodieJavaStreamingApp.main(HoodieJavaStreamingApp.java:126) 
> [test-classes/:?] Caused by: java.lang.IllegalArgumentException: Expecting 
> 100 records, Got 50 at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>  ~[hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar:0.6.0-SNAPSHOT] at 
> HoodieJavaStreamingApp.addInputAndValidateIngestion(HoodieJavaStreamingApp.java:352)
>  ~[test-classes/:?] at 
> HoodieJavaStreamingApp.lambda$run$1(HoodieJavaStreamingApp.java:186) 
> ~[test-classes/:?] at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]
> {code}
>  
>  Comments:  From the context, it looks like out of 100 records in 1st batch, 
> only 50 records were updated but all 100 records were expected to be upddated.
>  
> 2. ITTestHoodieSanity :   Command 
> ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, --t
>  able-path, hdfs://namenode/docker_hoodie_single_partition_key_mor_test, 
> --hive-url, jdbc:hive2://hiveserver:1, --t
>  able-type, MERGE_ON_READ, --hive-table, 
> docker_hoodie_single_partition_key_mor_test]) expected to succeed. Exit (255) 
>  ==> expected: <0> but was: <255>
>  
> {code:java}
>  Instants :[[20200812012618__deltacommit__COMPLETED], 
> [20200812012629__deltacommit__COMPLETED]]
>  Instants :[[20200812012618__deltacommit__COMPLETED], 
> [20200812012629__deltacommit__COMPLETED]]
>  01:29:35.754 [main] ERROR HoodieJavaStreamingApp - Got error running app 
>  java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Timedout waiting for 3 commits to appear in 
>  hdfs://namenode/docker_hoodie_single_partition_key_mor_test
>  at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212]
>  at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_212]
>  at HoodieJavaStreamingApp.run(HoodieJavaStreamingApp.java:185) 
> ~[test-classes/:?]
>  at HoodieJavaStreamingApp.main(HoodieJavaStreamingApp.java:118) 
> [test-classes/:?]
>  Caused by: java.lang.IllegalStateException: Timedout waiting for 3 commits 
> to appear in hdfs://namenode/docker_hoodie_
>  single_partition_key_mor_test
>  at HoodieJavaStreamingApp.waitTillNCommits(HoodieJavaStreamingApp.java:261) 
> ~[test-classes/:?]
>  at 
> HoodieJavaStreamingApp.addInputAndValidateIngestion(HoodieJavaStreamingApp.java:298)
>  ~[test-classes/:?]
>  at HoodieJavaStreamingApp.lambda$run$1(HoodieJavaStreamingApp.java:178) 
> ~[test-classes/:?]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_212]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_212]
>  at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]{code}
>  
> From the context: Compaction should have been scheduled and executed but that 
> did not happen here. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1182) Stabilize CI

2020-08-14 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1182:
-
Fix Version/s: 0.6.0

> Stabilize CI 
> -
>
> Key: HUDI-1182
> URL: https://issues.apache.org/jira/browse/HUDI-1182
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Documenting all failure cases here :
>  
> Integration Tests:
>  
> 1. ITTestHoodieSanity :  Command 
> ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, 
> --table-path, hdfs://namenode/docker_hoodie_single_partition_key_cow_test, 
> --hive-url, jdbc:hive2://hiveserver:1, --table-type, COPY_ON_WRITE, 
> --hive-table, docker_hoodie_single_partition_key_cow_test]) 
> {code:java}
> 05:17:48.384 [main] ERROR HoodieJavaStreamingApp - Got error running app 
> java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: 
> Expecting 100 records, Got 50 at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212] at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_212] at 
> HoodieJavaStreamingApp.run(HoodieJavaStreamingApp.java:193) 
> ~[test-classes/:?] at 
> HoodieJavaStreamingApp.main(HoodieJavaStreamingApp.java:126) 
> [test-classes/:?] Caused by: java.lang.IllegalArgumentException: Expecting 
> 100 records, Got 50 at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>  ~[hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar:0.6.0-SNAPSHOT] at 
> HoodieJavaStreamingApp.addInputAndValidateIngestion(HoodieJavaStreamingApp.java:352)
>  ~[test-classes/:?] at 
> HoodieJavaStreamingApp.lambda$run$1(HoodieJavaStreamingApp.java:186) 
> ~[test-classes/:?] at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]
> {code}
>  
>  Comments:  From the context, it looks like out of 100 records in 1st batch, 
> only 50 records were updated but all 100 records were expected to be upddated.
>  
> 2. ITTestHoodieSanity :   Command 
> ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, --t
>  able-path, hdfs://namenode/docker_hoodie_single_partition_key_mor_test, 
> --hive-url, jdbc:hive2://hiveserver:1, --t
>  able-type, MERGE_ON_READ, --hive-table, 
> docker_hoodie_single_partition_key_mor_test]) expected to succeed. Exit (255) 
>  ==> expected: <0> but was: <255>
>  
> {code:java}
>  Instants :[[20200812012618__deltacommit__COMPLETED], 
> [20200812012629__deltacommit__COMPLETED]]
>  Instants :[[20200812012618__deltacommit__COMPLETED], 
> [20200812012629__deltacommit__COMPLETED]]
>  01:29:35.754 [main] ERROR HoodieJavaStreamingApp - Got error running app 
>  java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Timedout waiting for 3 commits to appear in 
>  hdfs://namenode/docker_hoodie_single_partition_key_mor_test
>  at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212]
>  at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_212]
>  at HoodieJavaStreamingApp.run(HoodieJavaStreamingApp.java:185) 
> ~[test-classes/:?]
>  at HoodieJavaStreamingApp.main(HoodieJavaStreamingApp.java:118) 
> [test-classes/:?]
>  Caused by: java.lang.IllegalStateException: Timedout waiting for 3 commits 
> to appear in hdfs://namenode/docker_hoodie_
>  single_partition_key_mor_test
>  at HoodieJavaStreamingApp.waitTillNCommits(HoodieJavaStreamingApp.java:261) 
> ~[test-classes/:?]
>  at 
> HoodieJavaStreamingApp.addInputAndValidateIngestion(HoodieJavaStreamingApp.java:298)
>  ~[test-classes/:?]
>  at HoodieJavaStreamingApp.lambda$run$1(HoodieJavaStreamingApp.java:178) 
> ~[test-classes/:?]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_212]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_212]
>  at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]{code}
>  
> From the context: Compaction should have been scheduled and executed but that 
> did not happen here. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (HUDI-1182) Stabilize CI

2020-08-14 Thread Balaji Varadarajan (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan resolved HUDI-1182.
--
Resolution: Fixed

> Stabilize CI 
> -
>
> Key: HUDI-1182
> URL: https://issues.apache.org/jira/browse/HUDI-1182
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>
> Documenting all failure cases here :
>  
> Integration Tests:
>  
> 1. ITTestHoodieSanity :  Command 
> ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, 
> --table-path, hdfs://namenode/docker_hoodie_single_partition_key_cow_test, 
> --hive-url, jdbc:hive2://hiveserver:1, --table-type, COPY_ON_WRITE, 
> --hive-table, docker_hoodie_single_partition_key_cow_test]) 
> {code:java}
> 05:17:48.384 [main] ERROR HoodieJavaStreamingApp - Got error running app 
> java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: 
> Expecting 100 records, Got 50 at 
> java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212] at 
> java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_212] at 
> HoodieJavaStreamingApp.run(HoodieJavaStreamingApp.java:193) 
> ~[test-classes/:?] at 
> HoodieJavaStreamingApp.main(HoodieJavaStreamingApp.java:126) 
> [test-classes/:?] Caused by: java.lang.IllegalArgumentException: Expecting 
> 100 records, Got 50 at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>  ~[hudi-spark-bundle_2.11-0.6.0-SNAPSHOT.jar:0.6.0-SNAPSHOT] at 
> HoodieJavaStreamingApp.addInputAndValidateIngestion(HoodieJavaStreamingApp.java:352)
>  ~[test-classes/:?] at 
> HoodieJavaStreamingApp.lambda$run$1(HoodieJavaStreamingApp.java:186) 
> ~[test-classes/:?] at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_212] at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_212] at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]
> {code}
>  
>  Comments:  From the context, it looks like out of 100 records in 1st batch, 
> only 50 records were updated but all 100 records were expected to be upddated.
>  
> 2. ITTestHoodieSanity :   Command 
> ([/var/hoodie/ws/hudi-spark/run_hoodie_streaming_app.sh, --hive-sync, --t
>  able-path, hdfs://namenode/docker_hoodie_single_partition_key_mor_test, 
> --hive-url, jdbc:hive2://hiveserver:1, --t
>  able-type, MERGE_ON_READ, --hive-table, 
> docker_hoodie_single_partition_key_mor_test]) expected to succeed. Exit (255) 
>  ==> expected: <0> but was: <255>
>  
> {code:java}
>  Instants :[[20200812012618__deltacommit__COMPLETED], 
> [20200812012629__deltacommit__COMPLETED]]
>  Instants :[[20200812012618__deltacommit__COMPLETED], 
> [20200812012629__deltacommit__COMPLETED]]
>  01:29:35.754 [main] ERROR HoodieJavaStreamingApp - Got error running app 
>  java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> Timedout waiting for 3 commits to appear in 
>  hdfs://namenode/docker_hoodie_single_partition_key_mor_test
>  at java.util.concurrent.FutureTask.report(FutureTask.java:122) ~[?:1.8.0_212]
>  at java.util.concurrent.FutureTask.get(FutureTask.java:192) ~[?:1.8.0_212]
>  at HoodieJavaStreamingApp.run(HoodieJavaStreamingApp.java:185) 
> ~[test-classes/:?]
>  at HoodieJavaStreamingApp.main(HoodieJavaStreamingApp.java:118) 
> [test-classes/:?]
>  Caused by: java.lang.IllegalStateException: Timedout waiting for 3 commits 
> to appear in hdfs://namenode/docker_hoodie_
>  single_partition_key_mor_test
>  at HoodieJavaStreamingApp.waitTillNCommits(HoodieJavaStreamingApp.java:261) 
> ~[test-classes/:?]
>  at 
> HoodieJavaStreamingApp.addInputAndValidateIngestion(HoodieJavaStreamingApp.java:298)
>  ~[test-classes/:?]
>  at HoodieJavaStreamingApp.lambda$run$1(HoodieJavaStreamingApp.java:178) 
> ~[test-classes/:?]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_212]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  ~[?:1.8.0_212]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  ~[?:1.8.0_212]
>  at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_212]{code}
>  
> From the context: Compaction should have been scheduled and executed but that 
> did not happen here. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-920) Incremental view on MOR table using Spark Datasource

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-920:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Incremental view on MOR table using Spark Datasource
> 
>
> Key: HUDI-920
> URL: https://issues.apache.org/jira/browse/HUDI-920
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-686:
---
Fix Version/s: (was: 0.6.0)
   0.6.1

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.1
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-1108) Allow parallel listing of dataset partitions for various actions during write

2020-08-14 Thread Bhavani Sudha (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-1108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bhavani Sudha updated HUDI-1108:

Fix Version/s: (was: 0.6.0)
   0.6.1

> Allow parallel listing of dataset partitions for various actions during write
> -
>
> Key: HUDI-1108
> URL: https://issues.apache.org/jira/browse/HUDI-1108
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Assignee: Ryan Pifer
>Priority: Blocker
> Fix For: 0.6.1
>
>
> Currently we rely on FSUtils.getAllPartitionPaths to return all partitions of 
> a dataset. This implementation is slow for AWS S3 file-systems. We need to 
> provide option to allow the listing to be parallelizable.
> GH Issue : [https://github.com/apache/hudi/issues/1837]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 139 matches

Mail list logo