[GitHub] [hudi] bvaradar opened a new pull request #1967: Fix Integration test flakiness in HoodieJavaStreamingApp

2020-08-13 Thread GitBox


bvaradar opened a new pull request #1967:
URL: https://github.com/apache/hudi/pull/1967


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1190:
-
Status: Closed  (was: Patch Available)

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1190:
-
Status: Patch Available  (was: In Progress)

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-08-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1013:
-
Status: Closed  (was: Patch Available)

> Bulk Insert w/o converting to RDD
> -
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Our bulk insert(not just bulk insert, all operations infact) does dataset to 
> rdd conversion in HoodieSparkSqlWriter and our HoodieClient deals with 
> JavaRDDs. We are trying to see if we can improve our 
> performance by avoiding the rdd conversion.  We will first start off w/ bulk 
> insert and get end to end working before we decide if we wanna do this for 
> other operations too after doing some perf analysis. 
>  
> On a high level, this is the idea
> 1. Dataset will be passed in all the way from spark sql writer to the 
> storage writer. We do not convert to HoodieRecord at any point in time. 
> 2. We need to use 
> [ParquetWriteSupport|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala]]
>  to write to Parquet as InternalRows.
> 3. So, gist of what we wanna do is, with the Datasets, sort by 
> partition path and record keys, repartition by parallelism config, and do 
> mapPartitions. Within MapPartitions, we will iterate through the Rows, encode 
> to InternalRows and write to Parquet using the write support linked above. 
> We first wanted to check if our strategy will actually improve the perf. So, 
> I did a quick hack of just the mapPartition func in HoodieSparkSqlWriter just 
> to see how the numbers look like. Check for operation 
> "bulk_insert_direct_parquet_write_support" 
> [here|#diff-5317f4121df875e406876f9f0f012fac]]. 
> These are the numbers I got. (1) is existing hoodie bulk insert which does 
> the rdd conversion to JavaRdd. (2) is writing directly to 
> parquet in spark. Code given below. (3) is the modified hoodie code i.e. 
> operation bulk_insert_direct_parquet_write_support)
>  
> | |5M records 100 parallelism input size 2.5 GB|
> |(1) Orig hoodie(unmodified)|169 secs. output size 2.7 GB|
> |(2) Parquet |62 secs. output size 2.5 GB|
> |(3) Modified hudi code. Direct Parquet Write |73 secs. output size 2.5 GB|
>  
> So, essentially our existing code for bulk insert is > 2x that of parquet. 
> Our modified hudi code (i.e. operation 
> bulk_insert_direct_parquet_write_support) is close to direct Parquet write in 
> spark, which shows that our strategy should work. 
> // This is the Parquet write in spark. (2) above. 
> transformedDF.sort(*"partition"*, *"key"*)
> .coalesce(parallelism)
>  .write.format(*"parquet"*)
>  .partitionBy(*"partition"*)
>  .mode(saveMode)
>  .save(*s"**$*outputPath*/**$*format*"*)
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1190:
-
Status: In Progress  (was: Open)

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch asf-site updated: Travis CI build asf-site

2020-08-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new ba57c52  Travis CI build asf-site
ba57c52 is described below

commit ba57c5295e4dda7b51659de9d6f58d65274c3b6f
Author: CI 
AuthorDate: Fri Aug 14 06:29:36 2020 +

Travis CI build asf-site
---
 content/community.html | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/content/community.html b/content/community.html
index 9ec95e4..f1db5e9 100644
--- a/content/community.html
+++ b/content/community.html
@@ -232,7 +232,7 @@
 
 
   For quick pings & 1-1 chats
-  Join our https://join.slack.com/t/apache-hudi/signup";>slack 
group. In case your mail domain is not there in pre-approved list for 
joining slack group, please check out the https://github.com/apache/hudi/issues/143";>github issue
+  Join our https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE";>slack
 group. In case this does not work, please leave a comment on this https://github.com/apache/hudi/issues/143";>github issue
 
 
   For proposing large features, changes



[hudi] branch master updated: [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs (#1965)

2020-08-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 9bde6d6  [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod 
annotations to mark public APIs (#1965)
9bde6d6 is described below

commit 9bde6d616c5ee2ef131037af5b5527a0717cc527
Author: vinoth chandar 
AuthorDate: Thu Aug 13 23:28:17 2020 -0700

[HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to 
mark public APIs (#1965)

- Maturity levels one of : evolving, stable, deprecated
- Took a pass and marked out most of the existing public API
---
 .../java/org/apache/hudi/index/HoodieIndex.java| 11 ++
 .../java/org/apache/hudi/ApiMaturityLevel.java | 42 ++
 .../main/java/org/apache/hudi/PublicAPIClass.java  | 34 --
 .../main/java/org/apache/hudi/PublicAPIMethod.java | 35 --
 .../hudi/common/model/HoodieRecordPayload.java |  8 +
 .../org/apache/hudi/HoodieDataSourceHelpers.java   |  5 +++
 .../java/org/apache/hudi/keygen/KeyGenerator.java  |  8 +
 .../checkpointing/InitialCheckPointProvider.java   |  6 
 .../hudi/utilities/schema/SchemaProvider.java  |  6 
 .../org/apache/hudi/utilities/sources/Source.java  |  5 +++
 .../hudi/utilities/transform/Transformer.java  |  5 +++
 11 files changed, 125 insertions(+), 40 deletions(-)

diff --git a/hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java 
b/hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java
index 03a965a..4043586 100644
--- a/hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java
+++ b/hudi-client/src/main/java/org/apache/hudi/index/HoodieIndex.java
@@ -18,6 +18,9 @@
 
 package org.apache.hudi.index;
 
+import org.apache.hudi.ApiMaturityLevel;
+import org.apache.hudi.PublicAPIClass;
+import org.apache.hudi.PublicAPIMethod;
 import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.model.FileSlice;
 import org.apache.hudi.common.model.HoodieKey;
@@ -45,6 +48,7 @@ import java.io.Serializable;
 /**
  * Base class for different types of indexes to determine the mapping from 
uuid.
  */
+@PublicAPIClass(maturity = ApiMaturityLevel.EVOLVING)
 public abstract class HoodieIndex implements 
Serializable {
 
   protected final HoodieWriteConfig config;
@@ -85,6 +89,7 @@ public abstract class HoodieIndex implements Seri
* Checks if the given [Keys] exists in the hoodie table and returns [Key, 
Option[partitionPath, fileID]] If the
* optional is empty, then the key is not found.
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   public abstract JavaPairRDD>> 
fetchRecordLocation(
   JavaRDD hoodieKeys, final JavaSparkContext jsc, 
HoodieTable hoodieTable);
 
@@ -92,6 +97,7 @@ public abstract class HoodieIndex implements Seri
* Looks up the index and tags each incoming record with a location of a 
file that contains the row (if it is actually
* present).
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   public abstract JavaRDD> 
tagLocation(JavaRDD> recordRDD, JavaSparkContext jsc,
HoodieTable 
hoodieTable) throws HoodieIndexException;
 
@@ -100,12 +106,14 @@ public abstract class HoodieIndex implements Seri
* 
* TODO(vc): We may need to propagate the record as well in a WriteStatus 
class
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   public abstract JavaRDD updateLocation(JavaRDD 
writeStatusRDD, JavaSparkContext jsc,
   HoodieTable 
hoodieTable) throws HoodieIndexException;
 
   /**
* Rollback the efffects of the commit made at instantTime.
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   public abstract boolean rollbackCommit(String instantTime);
 
   /**
@@ -115,6 +123,7 @@ public abstract class HoodieIndex implements Seri
*
* @return whether or not, the index implementation is global in nature
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   public abstract boolean isGlobal();
 
   /**
@@ -123,12 +132,14 @@ public abstract class HoodieIndex implements Seri
*
* @return Returns true/false depending on whether the impl has this 
capability
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public abstract boolean canIndexLogFiles();
 
   /**
* An index is "implicit" with respect to storage, if just writing new data 
to a file slice, updates the index as
* well. This is used by storage, to save memory footprint in certain cases.
*/
+  @PublicAPIMethod(maturity = ApiMaturityLevel.STABLE)
   public abstract boolean isImplicitWithStorage();
 
   /**
diff --git a/hudi-common/src/main/java/org/apache/hudi/ApiMaturityLevel.java 
b/hudi-common/src/main/java/org/apache/hudi/ApiMaturit

[GitHub] [hudi] vinothchandar commented on issue #1786: [SUPPORT] Bulk insert slow on MOR

2020-08-13 Thread GitBox


vinothchandar commented on issue #1786:
URL: https://github.com/apache/hudi/issues/1786#issuecomment-673911154


   all the changes for sort mode and spark native writing are on master and 
will be in the release candidate. @rvd8345 are you interested in helping test 
these things? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1965: [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs

2020-08-13 Thread GitBox


vinothchandar merged pull request #1965:
URL: https://github.com/apache/hudi/pull/1965


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1957: [SUPPORT] Small Table Upsert sometimes take a lot of time

2020-08-13 Thread GitBox


bhasudha commented on issue #1957:
URL: https://github.com/apache/hudi/issues/1957#issuecomment-673910635


   @rubenssoto  how are you writing? could you paste the spark submit command ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


bhasudha commented on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673910051


   @brandon-stanley Based on your description above, you could try this:
   
   Instead of skipping the precombine field, you could add the 
COALESCE(update_date, create_date) as new column before writing to Hudi and 
pass in that new column as the precombine field. I think you could use 
withColumn() in Spark to do this. Here duplicates are handled based on the 
latest value of the precombine field which is the COALESCE() described above. 
You wouldn't need to worry about Payload class then. 
   
   Please correct me if I am missing something.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1966: [DOCS] Update Slack signup with auto signup link

2020-08-13 Thread GitBox


vinothchandar merged pull request #1966:
URL: https://github.com/apache/hudi/pull/1966


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: [DOCS] Update Slack signup with auto signup link (#1966)

2020-08-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 870b8bd  [DOCS] Update Slack signup with auto signup link (#1966)
870b8bd is described below

commit 870b8bd16d5df3866f605cd43b60d07b6bf11538
Author: vinoth chandar 
AuthorDate: Thu Aug 13 23:20:30 2020 -0700

[DOCS] Update Slack signup with auto signup link (#1966)
---
 docs/_pages/community.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/_pages/community.md b/docs/_pages/community.md
index a3bc585..291c096 100644
--- a/docs/_pages/community.md
+++ b/docs/_pages/community.md
@@ -14,7 +14,7 @@ There are several ways to get in touch with the Hudi 
community.
 | For development discussions | Dev Mailing list 
([Subscribe](mailto:dev-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:dev-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?d...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
 | For any general questions, user support | Users Mailing list 
([Subscribe](mailto:users-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:users-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?us...@hudi.apache.org)). Empty 
email works for subscribe/unsubscribe. Please use 
[gists](https://gist.github.com) to share code/stacktraces on the email. |
 | For reporting bugs or issues or discover known issues | Please use [ASF Hudi 
JIRA](https://issues.apache.org/jira/projects/HUDI/summary). See 
[#here](#accounts) for access |
-| For quick pings & 1-1 chats | Join our [slack 
group](https://join.slack.com/t/apache-hudi/signup). In case your mail domain 
is not there in pre-approved list for joining slack group, please check out the 
[github issue](https://github.com/apache/hudi/issues/143) |
+| For quick pings & 1-1 chats | Join our [slack 
group](https://join.slack.com/t/apache-hudi/shared_invite/enQtODYyNDAxNzc5MTg2LTE5OTBlYmVhYjM0N2ZhOTJjOWM4YzBmMWU2MjZjMGE4NDc5ZDFiOGQ2N2VkYTVkNzU3ZDQ4OTI1NmFmYWQ0NzE).
 In case this does not work, please leave a comment on this [github 
issue](https://github.com/apache/hudi/issues/143) |
 | For proposing large features, changes | Start a RFC. Instructions 
[here](https://cwiki.apache.org/confluence/display/HUDI/RFC+Process).
  See [#here](#accounts) for wiki access |
 | For stream of commits, pull requests etc | Commits Mailing list 
([Subscribe](mailto:commits-subscr...@hudi.apache.org), 
[Unsubscribe](mailto:commits-unsubscr...@hudi.apache.org), 
[Archives](https://lists.apache.org/list.html?commits@hudi.apache.org)) |



[GitHub] [hudi] vinothchandar commented on a change in pull request #1963: [HUDI-1188] Hbase index MOR tables records not being deduplicated

2020-08-13 Thread GitBox


vinothchandar commented on a change in pull request #1963:
URL: https://github.com/apache/hudi/pull/1963#discussion_r470431566



##
File path: hudi-client/src/main/java/org/apache/hudi/index/hbase/HBaseIndex.java
##
@@ -182,6 +182,7 @@ private boolean checkIfValidCommit(HoodieTableMetaClient 
metaClient, String comm
 // 2) is less than the first commit ts in the timeline
 return !commitTimeline.empty()
 && (commitTimeline.containsInstant(new HoodieInstant(false, 
HoodieTimeline.COMMIT_ACTION, commitTs))

Review comment:
   we could use the `metaClient.getCommitsTimeline()` or one of those 
methods which will automatically determine the commit instant type based on 
table type?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1963: [HUDI-1188] Hbase index MOR tables records not being deduplicated

2020-08-13 Thread GitBox


vinothchandar commented on pull request #1963:
URL: https://github.com/apache/hudi/pull/1963#issuecomment-673908103


   @n3nash can you please review this. (we can land after the release branch is 
cut) 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar opened a new pull request #1966: [DOCS] Update Slack signup with auto signup link

2020-08-13 Thread GitBox


vinothchandar opened a new pull request #1966:
URL: https://github.com/apache/hudi/pull/1966


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #1961: [SUPPORT] Jetty Not able to find method java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Databri

2020-08-13 Thread GitBox


vinothchandar commented on issue #1961:
URL: https://github.com/apache/hudi/issues/1961#issuecomment-673905819


   @saumyasuhagiya https://github.com/apache/hudi/blob/master/README.md has the 
auto signup link. I just opened a PR to update the site to using this, instead



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #369

2020-08-13 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.58 KB...]
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${scala.binary.version}:[unknown-version],
 

 line 27, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effec

[jira] [Updated] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1190:
-
Labels: pull-request-available  (was: )

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar opened a new pull request #1965: [HUDI-1190] Introduce @PublicAPIClass and @PublicAPIMethod annotations to mark public APIs

2020-08-13 Thread GitBox


vinothchandar opened a new pull request #1965:
URL: https://github.com/apache/hudi/pull/1965


- Maturity levels one of : evolving, stable, deprecated
- Took a pass and marked out most of the existing public API
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1961: [SUPPORT] Jetty Not able to find method java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Databricks c

2020-08-13 Thread GitBox


bhasudha commented on issue #1961:
URL: https://github.com/apache/hudi/issues/1961#issuecomment-673892377


   > Not able to join the channel as I don't have any email id with the 
mentioned domain. Can you help me to get in? @nsivabalan
   
   https://hudi.apache.org/community.html#engage-with-us There are some 
resources here. Can you please give this a try ? If that doesn't work, please 
send your email id.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] saumyasuhagiya commented on issue #1961: [SUPPORT] Jetty Not able to find method java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Databr

2020-08-13 Thread GitBox


saumyasuhagiya commented on issue #1961:
URL: https://github.com/apache/hudi/issues/1961#issuecomment-673890895


   Not able to join the channel as I don't have any email id with the mentioned 
domain. Can you help me to get in? @nsivabalan 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1868: [HUDI-1083] Optimization in determining insert bucket location for a given key

2020-08-13 Thread GitBox


shenh062326 commented on pull request #1868:
URL: https://github.com/apache/hudi/pull/1868#issuecomment-673855323


   @nsivabalan Can you take a look at this PR?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha opened a new pull request #1964: [HUDI-1191] Add incremental meta client API to query partitions changed

2020-08-13 Thread GitBox


satishkotha opened a new pull request #1964:
URL: https://github.com/apache/hudi/pull/1964


   
   ## What is the purpose of the pull request
   
   Add IncrementalMetaClient as separate class to query partitions affected in 
a specified time window
   
   ## Brief change log
   - Add IncrementalMetaClient as separate abstraction
   - Modify HiveSync to use this new MetaClient
   - We also need this for other use cases to easily query affected partitions 
(example: sync a table between multiple regions)
   
   ## Verify this pull request
   This change added tests 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1191) create incremental meta client abstraction to query modified partitions

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1191:
-
Labels: pull-request-available  (was: )

> create incremental meta client abstraction to query modified partitions
> ---
>
> Key: HUDI-1191
> URL: https://issues.apache.org/jira/browse/HUDI-1191
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: satish
>Assignee: satish
>Priority: Minor
>  Labels: pull-request-available
>
> Create incremental client abstraction to query modified partitions for a 
> timeline.
> This can be reused in HiveSync and InputFormats. We also need this as an API 
> for other usecases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1191) create incremental meta client abstraction to query modified partitions

2020-08-13 Thread satish (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

satish reassigned HUDI-1191:


Assignee: satish

> create incremental meta client abstraction to query modified partitions
> ---
>
> Key: HUDI-1191
> URL: https://issues.apache.org/jira/browse/HUDI-1191
> Project: Apache Hudi
>  Issue Type: Wish
>Reporter: satish
>Assignee: satish
>Priority: Minor
>
> Create incremental client abstraction to query modified partitions for a 
> timeline.
> This can be reused in HiveSync and InputFormats. We also need this as an API 
> for other usecases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1191) create incremental meta client abstraction to query modified partitions

2020-08-13 Thread satish (Jira)
satish created HUDI-1191:


 Summary: create incremental meta client abstraction to query 
modified partitions
 Key: HUDI-1191
 URL: https://issues.apache.org/jira/browse/HUDI-1191
 Project: Apache Hudi
  Issue Type: Wish
Reporter: satish


Create incremental client abstraction to query modified partitions for a 
timeline.

This can be reused in HiveSync and InputFormats. We also need this as an API 
for other usecases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-13 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1190:
-
Status: Open  (was: New)

> Annotate all public APIs classes with stability indication
> --
>
> Key: HUDI-1190
> URL: https://issues.apache.org/jira/browse/HUDI-1190
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1190) Annotate all public APIs classes with stability indication

2020-08-13 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-1190:


 Summary: Annotate all public APIs classes with stability indication
 Key: HUDI-1190
 URL: https://issues.apache.org/jira/browse/HUDI-1190
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Code Cleanup
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1184) Support updatePartitionPath for HBaseIndex

2020-08-13 Thread Udit Mehrotra (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Udit Mehrotra reassigned HUDI-1184:
---

Assignee: Ryan Pifer

> Support updatePartitionPath for HBaseIndex
> --
>
> Key: HUDI-1184
> URL: https://issues.apache.org/jira/browse/HUDI-1184
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 0.6.1
>Reporter: sivabalan narayanan
>Assignee: Ryan Pifer
>Priority: Major
>
> In implicit global indexes, we have a config named updatePartitionPath. When 
> an already existing record is upserted to a new partition (compared to where 
> it is in storage), if the config is set to true, record is inserted to new 
> partition and deleted in old partition. If the config is set to false, record 
> is upserted to old partition ignoring the new partition. 
>  
> Don't think we have this fix for HBase. We need similar support in HBase too. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1188) MOR hbase index tables not deduplicating records

2020-08-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1188:
-
Labels: pull-request-available  (was: )

> MOR hbase index tables not deduplicating records
> 
>
> Key: HUDI-1188
> URL: https://issues.apache.org/jira/browse/HUDI-1188
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ryan Pifer
>Assignee: Ryan Pifer
>Priority: Major
>  Labels: pull-request-available
>
> After fetching hbase index for a record, Hudi performs a validation that the 
> commit timestamp stored in hbase for that record is a commit on the timeline. 
> This makes any record that is stored to hbase index during a deltacommit 
> (upsert on MOR table) considered an invalid commit and treated as a new 
> record. This causes the hbase index to be updated every time which leads to 
> records being able to be in multiple partitions and even in different file 
> groups within same partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] rmpifer opened a new pull request #1963: [HUDI-1188] Hbase index MOR tables records not being deduplicated

2020-08-13 Thread GitBox


rmpifer opened a new pull request #1963:
URL: https://github.com/apache/hudi/pull/1963


   ## What is the purpose of the pull request
   
   After fetching hbase index for a record, Hudi performs validation that the 
commit timestamp stored in hbase for that record is a `commit` on the timeline. 
This makes any record that is stored to hbase index during a `deltacommit` 
considered an invalid index and treated as a new record. This causes the hbase 
index to be updated every time which leads to records being able to be in 
multiple partitions and even in different file groups within same partition.
   
   ## Brief change log
   
   * Modify HbaseIndex.checkIfValidCommit to consider DELTA_COMMIT timestamp as 
valid index timestamp
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - Verified test failed on MOR table before change and succeeding now
 - Manually verified by:
  * Uploaded patched JAR to EMR 5.30.1 cluster
  * Create MOR table w/ HBASE index
  * Upsert record
  * Upsert record with new partition
  * Validate new partition was not created and existing partition 
displayed update to record



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1189) Change in UserDefinedBulkInsertPartitioner

2020-08-13 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1189:
-

 Summary: Change in UserDefinedBulkInsertPartitioner
 Key: HUDI-1189
 URL: https://issues.apache.org/jira/browse/HUDI-1189
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: sivabalan narayanan


With adding support for multiple sort modes, we have renamed the interface for 
UserDefinedBulkInsertPartitioner to BulkInsertPartitioner. Also there is an 
extra method added in the interface.

 

/**
 * @return \{@code true} if the records within a RDD partition are sorted; 
\{@code false} otherwise.
 */
boolean arePartitionRecordsSorted();



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on issue #1911: [SUPPORT] GLOBAL_BLOOM index errors on Upsert operation

2020-08-13 Thread GitBox


nsivabalan commented on issue #1911:
URL: https://github.com/apache/hudi/issues/1911#issuecomment-673747835


   Can you try w/ spark datasource, as you see in quick start utils and let us 
know if you could reproduce the issue. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on issue #1961: [SUPPORT] Jetty Not able to find method java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Dat

2020-08-13 Thread GitBox


nsivabalan edited a comment on issue #1961:
URL: https://github.com/apache/hudi/issues/1961#issuecomment-673696662


   I don't have any exp in Azure databricks. Can you post it in [hudi's slack 
channel](https://github.com/apache/hudi/issues/1961#issuecomment-673696662). 
someone with experience might help.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1961: [SUPPORT] Jetty Not able to find method java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Databricks

2020-08-13 Thread GitBox


nsivabalan commented on issue #1961:
URL: https://github.com/apache/hudi/issues/1961#issuecomment-673696662


   I don't have any exp in Azure databricks. Can you post it in hudi's slack 
channel. someone with experience might help.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1962: [SUPPORT] Unable to filter hudi table in hive on partition column

2020-08-13 Thread GitBox


nsivabalan commented on issue #1962:
URL: https://github.com/apache/hudi/issues/1962#issuecomment-673694878


   Did you set hive input format ? Also can you confirm you settings given 
[here](https://hudi.apache.org/docs/docker_demo.html#step-4-a-run-hive-queries) 
are set. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] brandon-stanley edited a comment on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


brandon-stanley edited a comment on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673462785


   @bhasudha Thanks for the response. Does the precombine field have to be a 
non-nullable field/column as well? My dataset may have duplicates but I have 
implemented custom logic to deduplicate since there are two columns within my 
dataset that are used to determine which is the latest record: 
COALESCE(update_date, create_date). I implemented it this way because it is an 
[SCD type 2 
table.](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row)
   
   Also, how would I specify the payload class that would ignore the precombine 
field? I receive the following error when specifying the 
`hoodie.datasource.write.payload.class` configuration property as 
`org.apache.hudi.common.model.HoodieAvroPayload`. Do I need to create a custom 
class that implements the [HoodieRecordPayload 
interface](https://github.com/apache/hudi/blob/release-0.5.2/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java)?
   
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling o152.save.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
in stage 21.0 failed 1 times, most recent failure: Lost task 1.0 in stage 21.0 
(TID 529, localhost, executor driver): java.io.IOException: Could not create 
payload for class: org.apache.hudi.common.model.HoodieAvroPayload
   at 
org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:128)
   at 
org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:181)
   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:103)
   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:100)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:347)
   at scala.collection.Iterator$class.foreach(Iterator.scala:743)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
   at 
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:296)
   at scala.collection.AbstractIterator.to(Iterator.scala:1174)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:288)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1174)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:275)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1174)
   at 
org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
   at 
org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:121)
   at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate 
class
   at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
   at 
org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:125)
   ... 28 more
   Caused by: java.lang.NoSuchMethodException: 
org.apache.hudi.common.model.HoodieAvroPayload.(org.apache.avro.generic.GenericRecord,
 java.lang.Comparable)
   at java.lang.Class.getConstructor0(Class.java:3082)
   at java.lang.Class.getConstructor(Class.java:1825)
   at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78)
   ... 29 more
   
   Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.sc

[jira] [Created] (HUDI-1188) MOR hbase index tables not deduplicating records

2020-08-13 Thread Ryan Pifer (Jira)
Ryan Pifer created HUDI-1188:


 Summary: MOR hbase index tables not deduplicating records
 Key: HUDI-1188
 URL: https://issues.apache.org/jira/browse/HUDI-1188
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Ryan Pifer
Assignee: Ryan Pifer


After fetching hbase index for a record, Hudi performs a validation that the 
commit timestamp stored in hbase for that record is a commit on the timeline. 
This makes any record that is stored to hbase index during a deltacommit 
(upsert on MOR table) considered an invalid commit and treated as a new record. 
This causes the hbase index to be updated every time which leads to records 
being able to be in multiple partitions and even in different file groups 
within same partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] brandon-stanley edited a comment on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


brandon-stanley edited a comment on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673462785


   @bhasudha Thanks for the response. Does the precombine field have to be a 
non-nullable field/column as well? My dataset may have duplicates but I have 
implemented custom logic to deduplicate since there are two columns within my 
dataset that are used to determine which is the latest record: 
COALESCE(update_date, create_date). I implemented it this way because it is an 
[SCD type 2 
table.](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row)
   
   Also, how would I specify the payload class that would ignore the precombine 
field? I receive the following error when specifying the 
`hoodie.datasource.write.payload.class` configuration property as 
`org.apache.hudi.common.model.HoodieAvroPayload`:
   
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling o152.save.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
in stage 21.0 failed 1 times, most recent failure: Lost task 1.0 in stage 21.0 
(TID 529, localhost, executor driver): java.io.IOException: Could not create 
payload for class: org.apache.hudi.common.model.HoodieAvroPayload
   at 
org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:128)
   at 
org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:181)
   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:103)
   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:100)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:347)
   at scala.collection.Iterator$class.foreach(Iterator.scala:743)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
   at 
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:296)
   at scala.collection.AbstractIterator.to(Iterator.scala:1174)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:288)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1174)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:275)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1174)
   at 
org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
   at 
org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:121)
   at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate 
class
   at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
   at 
org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:125)
   ... 28 more
   Caused by: java.lang.NoSuchMethodException: 
org.apache.hudi.common.model.HoodieAvroPayload.(org.apache.avro.generic.GenericRecord,
 java.lang.Comparable)
   at java.lang.Class.getConstructor0(Class.java:3082)
   at java.lang.Class.getConstructor(Class.java:1825)
   at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78)
   ... 29 more
   
   Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
   at 
org.apache.spark.sche

[GitHub] [hudi] brandon-stanley edited a comment on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


brandon-stanley edited a comment on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673462785


   @bhasudha Thanks for the response. Does the precombine field have to be a 
non-nullable field/column as well? My dataset may have duplicates but I have 
implemented custom logic to deduplicate since there are two columns within my 
dataset that are used to determine which is the latest record: 
COALESCE(update_date, create_date). It is an [SCD type 2 
table.](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row)
   
   Also, how would I specify the payload class that would ignore the precombine 
field? I receive the following error when specifying the 
`hoodie.datasource.write.payload.class` configuration property as 
`org.apache.hudi.common.model.HoodieAvroPayload`:
   
   ```
   py4j.protocol.Py4JJavaError: An error occurred while calling o152.save.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
in stage 21.0 failed 1 times, most recent failure: Lost task 1.0 in stage 21.0 
(TID 529, localhost, executor driver): java.io.IOException: Could not create 
payload for class: org.apache.hudi.common.model.HoodieAvroPayload
   at 
org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:128)
   at 
org.apache.hudi.DataSourceUtils.createHoodieRecord(DataSourceUtils.java:181)
   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:103)
   at 
org.apache.hudi.HoodieSparkSqlWriter$$anonfun$1.apply(HoodieSparkSqlWriter.scala:100)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:363)
   at scala.collection.Iterator$$anon$10.next(Iterator.scala:347)
   at scala.collection.Iterator$class.foreach(Iterator.scala:743)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1174)
   at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
   at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
   at 
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:296)
   at scala.collection.AbstractIterator.to(Iterator.scala:1174)
   at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:288)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1174)
   at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:275)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1174)
   at 
org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
   at 
org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
   at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   at org.apache.spark.scheduler.Task.run(Task.scala:121)
   at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.hudi.exception.HoodieException: Unable to instantiate 
class
   at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:80)
   at 
org.apache.hudi.DataSourceUtils.createPayload(DataSourceUtils.java:125)
   ... 28 more
   Caused by: java.lang.NoSuchMethodException: 
org.apache.hudi.common.model.HoodieAvroPayload.(org.apache.avro.generic.GenericRecord,
 java.lang.Comparable)
   at java.lang.Class.getConstructor0(Class.java:3082)
   at java.lang.Class.getConstructor(Class.java:1825)
   at 
org.apache.hudi.common.util.ReflectionUtils.loadClass(ReflectionUtils.java:78)
   ... 29 more
   
   Driver stacktrace:
   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
   at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGSc

[GitHub] [hudi] tooptoop4 commented on issue #1948: [SUPPORT] DMS example complains about dfs-source.properties

2020-08-13 Thread GitBox


tooptoop4 commented on issue #1948:
URL: https://github.com/apache/hudi/issues/1948#issuecomment-673519517


   I use hoodie-conf as shown in the description, but is property file 
mandatory?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] sassai opened a new issue #1962: [SUPPORT] Unable to filter hudi table in hive on partition column

2020-08-13 Thread GitBox


sassai opened a new issue #1962:
URL: https://github.com/apache/hudi/issues/1962


   **Describe the problem you faced**
   
   I'm running a spark structured streaming application that reads data from 
kafka and saves it to a partitioned Hudi MERGE_ON_READ table. Hive sync is 
enabled and I'm able to query the table with the Hive CLI, e.g.:
   
   SELECT * FROM iot_device_ro LIMIT 5; 
   
   ```console
   
++-++-++-+-++---+---+--+-+--++-+---+
   | iot_device_ro._hoodie_commit_time  | iot_device_ro._hoodie_commit_seqno  | 
 iot_device_ro._hoodie_record_key  |
iot_device_ro._hoodie_partition_path |  
iot_device_ro._hoodie_file_name   | iot_device_ro.deviceid  | 
iot_device_ro.sensorid  | iot_device_ro.measurement  | iot_device_ro.measure_ts 
 |  iot_device_ro.uuid   |  iot_device_ro.its   | 
iot_device_ro.year  | iot_device_ro.month  | iot_device_ro.day  | 
iot_device_ro.hour  | iot_device_ro.minute  |
   
++-++-++-+-++---+---+--+-+--++-+---+
   | 20200813121124 | 20200813121124_0_1  | 
uuid:3d387a37-f288-456b-87b7-2b6865cf32e0  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4  | 1   | 30.228266831690732 
| 2020-08-13T08:39:04.528Z  | 3d387a37-f288-456b-87b7-2b6865cf32e0  | 
2020-08-13 12:11:24  | 2020| 8| 13  
   | 12  | 11|
   | 20200813121124 | 20200813121124_0_2  | 
uuid:5bed809e-758f-46dc-b1ab-837ad3eb5a6a  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4  | 1   | 31.453188991515226 
| 2020-08-13T08:39:19.588Z  | 5bed809e-758f-46dc-b1ab-837ad3eb5a6a  | 
2020-08-13 12:11:24  | 2020| 8| 13  
   | 12  | 11|
   | 20200813121124 | 20200813121124_0_3  | 
uuid:6d37be34-6e4b-49b0-b3fe-e6552c2aee22  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4  | 1   | 34.68735798194983  
| 2020-08-13T07:45:05.958Z  | 6d37be34-6e4b-49b0-b3fe-e6552c2aee22  | 
2020-08-13 12:11:24  | 2020| 8| 13  
   | 12  | 11|
   | 20200813121124 | 20200813121124_0_4  | 
uuid:5c2dbea8-9668-4652-84c6-c82d06aa2805  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4  | 1   | 33.680806905962264 
| 2020-08-12T13:33:20.159Z  | 5c2dbea8-9668-4652-84c6-c82d06aa2805  | 
2020-08-13 12:11:24  | 2020| 8| 13  
   | 12  | 11|
   | 20200813121124 | 20200813121124_0_5  | 
uuid:528e6c74-bb44-49da-aa76-059781cc7676  | 
year=2020/month=8/day=13/hour=12/minute=11  | 
53c3c919-ff1c-49f6-ba74-4498b635dfb6-0_0-21-23_20200813121124.parquet | 
iotdevice4  | 1   | 31.38529683936205  
| 2020-08-13T10:57:58.448Z  | 528e6c74-bb44-49da-aa76-059781cc7676  | 
2020-08-13 12:11:24  | 2020| 8| 13  
   | 12  | 11|
   
++-++-++-+-++---+---+--

[GitHub] [hudi] saumyasuhagiya opened a new issue #1961: [SUPPORT] Jetty Not able to find method java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Databr

2020-08-13 Thread GitBox


saumyasuhagiya opened a new issue #1961:
URL: https://github.com/apache/hudi/issues/1961


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   Getting java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V on Azure 
Databricks 
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
 
   com.microsoft.pnp
   spark-listeners_2.11_2.4.3
   1.0.0
   
   
   org.eclipse.jetty.aggregate
   jetty-all
   
   
   
   
   
   
   org.apache.hudi
   hudi-spark_2.11
   0.5.3
   
   
   org.eclipse.jetty.aggregate
   jetty-all
   
   
   org.apache.hive
   hive-shims
   
   
   
   
   org.apache.hudi
   hudi-hadoop-mr
   0.5.3
   
   
   
   
   org.apache.hudi
   hudi-spark-bundle_2.11
   0.5.3
   
   
   org.eclipse.jetty.aggregate
   jetty-all
   
   
   
   
   
   org.eclipse.jetty
   jetty-server
   9.4.31.v20200723
   
   
   I have the above relevant dependencies in Spark Job. Also adding 
hudi_spark_bundle_2.11_0.5.3 in --jars option.
   
   Compile and 
   
   
   **Expected behavior**
   
   It should run succesfully
   
   **Environment Description**
   
   * Hudi version :
   
   * Spark version : 2.4.5
   
   * Hive version : 0.5.3
   
   * Hadoop version : As shown in dependency
   
   * Storage (HDFS/S3/GCS..) :  ADLSGen2
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
support was removed in 8.0
   20/08/13 08:10:18 ERROR Uncaught throwable from user code: 
java.lang.NoSuchMethodError: 
org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
at 
io.javalin.core.util.JettyServerUtil.defaultSessionHandler(JettyServerUtil.kt:50)
at io.javalin.Javalin.(Javalin.java:94)
at io.javalin.Javalin.create(Javalin.java:107)
at 
org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:102)
at 
org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
at 
org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:102)
at 
org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:69)
at 
org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:83)
at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:137)
at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:124)
at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:120)
at 
org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:195)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:135)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:147)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:135)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:188)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:184)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:135)
at 
org.apache.spark.sql.exe

[jira] [Created] (HUDI-1187) Improvements/Follow up on Bulk Insert V2

2020-08-13 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1187:
-

 Summary: Improvements/Follow up on Bulk Insert V2 
 Key: HUDI-1187
 URL: https://issues.apache.org/jira/browse/HUDI-1187
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: sivabalan narayanan


* Add java docs to KeyGeneratorInterface methods. 
 * validateRecordKeyFields() in CustomKeyGenerator could be moved up and used 
by other keyGens. Check and fix it. 
 * Unify usage of getters. For eg. in SimpleKeyGen  we have something like 
RowKeyGeneratorHelper.getRecordKeyFromRow(row, getRecordKeyFields(), 
recordKeyPositions, false);
for recordKeyFields, we use getRecordKeyFields(), where as for 
recordKeyPositions we use instance variable directly. Make it uniform in all 
key gen classes. 
 * Remove line 82 in 
[TestGlobalDeleteKeyGenerator.java|https://github.com/apache/hudi/commit/5dc8182ec308dba7ffd04ef159bd3041ede1b117#diff-4c306975590fe7bf2b27a6f5a9d9ff7e]
 
keyGenerator.buildFieldPositionMapIfNeeded(KeyGeneratorTestUtilities.structType);
 * make buildFieldPositionMapIfNeeded(StructType structType) in BuildInKeyGen 
as protected.
 * Introduce private method (and re-use) to generate positions for recordkeys 
and partition paths.
 * boolean positionMapInitialized in buildMapPositionsIfNeeded



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] brandon-stanley edited a comment on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


brandon-stanley edited a comment on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673462785


   @bhasudha Thanks for the response. Does the precombine field have to be a 
non-nullable field/column as well? My dataset may have duplicates but I have 
implemented custom logic to deduplicate since there are two columns within my 
dataset that are used to determine which is the latest record: 
COALESCE(update_date, create_date). It is an [SCD type 2 
table.](https://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2:_add_new_row)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] brandon-stanley commented on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


brandon-stanley commented on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673462785


   @bhasudha Thanks for the response. Does the precombine field have to be a 
non-nullable field/column as well? My dataset may have duplicates but I have 
implemented custom logic to deduplicate since there are two columns within my 
dataset that are used to determine which is the latest record: 
COALESCE(update_date, create_date). It is an SCD type 2 table.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-13 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r469922118



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -85,71 +84,40 @@ public final HoodieKey getKey(GenericRecord record) {
 }).collect(Collectors.toList());
   }
 
-  @Override
-  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
-// parse simple feilds
-getRecordKeyFields().stream()
-.filter(f -> !(f.contains(".")))
-.forEach(f -> {
-  if (structType.getFieldIndex(f).isDefined()) {
-recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
-  } else {
-throw new HoodieKeyException("recordKey value not found for field: 
\"" + f + "\"");
-  }
-});
-// parse nested fields
-getRecordKeyFields().stream()
-.filter(f -> f.contains("."))
-.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
-// parse simple fields
-if (getPartitionPathFields() != null) {
-  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  void buildFieldPositionMapIfNeeded(StructType structType) {
+if (this.structType == null) {
+  // parse simple fields
+  getRecordKeyFields().stream()
+  .filter(f -> !(f.contains(".")))
   .forEach(f -> {
 if (structType.getFieldIndex(f).isDefined()) {
-  partitionPathPositions.put(f,
-  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
 } else {
-  partitionPathPositions.put(f, Collections.singletonList(-1));
+  throw new HoodieKeyException("recordKey value not found for 
field: \"" + f + "\"");
 }
   });
   // parse nested fields
-  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
-  .forEach(f -> partitionPathPositions.put(f,
-  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
-}
-this.structName = structName;
-this.structType = structType;
-this.recordNamespace = recordNamespace;
-  }
-
-  /**
-   * Fetch record key from {@link Row}.
-   *
-   * @param row instance of {@link Row} from which record key is requested.
-   * @return the record key of interest from {@link Row}.
-   */
-  @Override
-  public String getRecordKey(Row row) {
-if (null == converterFn) {
-  converterFn = AvroConversionHelper.createConverterToAvro(structType, 
structName, recordNamespace);
+  getRecordKeyFields().stream()
+  .filter(f -> f.contains("."))
+  .forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+  // parse simple fields
+  if (getPartitionPathFields() != null) {
+getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f 
-> !(f.contains(".")))
+.forEach(f -> {
+  if (structType.getFieldIndex(f).isDefined()) {
+partitionPathPositions.put(f,
+Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  } else {
+partitionPathPositions.put(f, Collections.singletonList(-1));
+  }
+});
+// parse nested fields
+getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f 
-> f.contains("."))
+.forEach(f -> partitionPathPositions.put(f,
+RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+  }
+  this.structType = structType;

Review comment:
   may I know where is the structType being used ? 
AvroConversionHelper.createConverterToAvro used row.Schema() and so we may not 
need it. probably we should rename this to boolean positionMapInitialized.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-13 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r469922118



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -85,71 +84,40 @@ public final HoodieKey getKey(GenericRecord record) {
 }).collect(Collectors.toList());
   }
 
-  @Override
-  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
-// parse simple feilds
-getRecordKeyFields().stream()
-.filter(f -> !(f.contains(".")))
-.forEach(f -> {
-  if (structType.getFieldIndex(f).isDefined()) {
-recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
-  } else {
-throw new HoodieKeyException("recordKey value not found for field: 
\"" + f + "\"");
-  }
-});
-// parse nested fields
-getRecordKeyFields().stream()
-.filter(f -> f.contains("."))
-.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
-// parse simple fields
-if (getPartitionPathFields() != null) {
-  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  void buildFieldPositionMapIfNeeded(StructType structType) {
+if (this.structType == null) {
+  // parse simple fields
+  getRecordKeyFields().stream()
+  .filter(f -> !(f.contains(".")))
   .forEach(f -> {
 if (structType.getFieldIndex(f).isDefined()) {
-  partitionPathPositions.put(f,
-  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
 } else {
-  partitionPathPositions.put(f, Collections.singletonList(-1));
+  throw new HoodieKeyException("recordKey value not found for 
field: \"" + f + "\"");
 }
   });
   // parse nested fields
-  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
-  .forEach(f -> partitionPathPositions.put(f,
-  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
-}
-this.structName = structName;
-this.structType = structType;
-this.recordNamespace = recordNamespace;
-  }
-
-  /**
-   * Fetch record key from {@link Row}.
-   *
-   * @param row instance of {@link Row} from which record key is requested.
-   * @return the record key of interest from {@link Row}.
-   */
-  @Override
-  public String getRecordKey(Row row) {
-if (null == converterFn) {
-  converterFn = AvroConversionHelper.createConverterToAvro(structType, 
structName, recordNamespace);
+  getRecordKeyFields().stream()
+  .filter(f -> f.contains("."))
+  .forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+  // parse simple fields
+  if (getPartitionPathFields() != null) {
+getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f 
-> !(f.contains(".")))
+.forEach(f -> {
+  if (structType.getFieldIndex(f).isDefined()) {
+partitionPathPositions.put(f,
+Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  } else {
+partitionPathPositions.put(f, Collections.singletonList(-1));
+  }
+});
+// parse nested fields
+getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f 
-> f.contains("."))
+.forEach(f -> partitionPathPositions.put(f,
+RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+  }
+  this.structType = structType;

Review comment:
   may I know where is the structType being used ? 
AvroConversionHelper.createConverterToAvro used row.Schema() and so we may not 
need it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-13 Thread GitBox


nsivabalan commented on a change in pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#discussion_r469922118



##
File path: 
hudi-spark/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java
##
@@ -85,71 +84,40 @@ public final HoodieKey getKey(GenericRecord record) {
 }).collect(Collectors.toList());
   }
 
-  @Override
-  public void initializeRowKeyGenerator(StructType structType, String 
structName, String recordNamespace) {
-// parse simple feilds
-getRecordKeyFields().stream()
-.filter(f -> !(f.contains(".")))
-.forEach(f -> {
-  if (structType.getFieldIndex(f).isDefined()) {
-recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
-  } else {
-throw new HoodieKeyException("recordKey value not found for field: 
\"" + f + "\"");
-  }
-});
-// parse nested fields
-getRecordKeyFields().stream()
-.filter(f -> f.contains("."))
-.forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
-// parse simple fields
-if (getPartitionPathFields() != null) {
-  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
!(f.contains(".")))
+  void buildFieldPositionMapIfNeeded(StructType structType) {
+if (this.structType == null) {
+  // parse simple fields
+  getRecordKeyFields().stream()
+  .filter(f -> !(f.contains(".")))
   .forEach(f -> {
 if (structType.getFieldIndex(f).isDefined()) {
-  partitionPathPositions.put(f,
-  Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
 } else {
-  partitionPathPositions.put(f, Collections.singletonList(-1));
+  throw new HoodieKeyException("recordKey value not found for 
field: \"" + f + "\"");
 }
   });
   // parse nested fields
-  getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f -> 
f.contains("."))
-  .forEach(f -> partitionPathPositions.put(f,
-  RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
-}
-this.structName = structName;
-this.structType = structType;
-this.recordNamespace = recordNamespace;
-  }
-
-  /**
-   * Fetch record key from {@link Row}.
-   *
-   * @param row instance of {@link Row} from which record key is requested.
-   * @return the record key of interest from {@link Row}.
-   */
-  @Override
-  public String getRecordKey(Row row) {
-if (null == converterFn) {
-  converterFn = AvroConversionHelper.createConverterToAvro(structType, 
structName, recordNamespace);
+  getRecordKeyFields().stream()
+  .filter(f -> f.contains("."))
+  .forEach(f -> recordKeyPositions.put(f, 
RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, true)));
+  // parse simple fields
+  if (getPartitionPathFields() != null) {
+getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f 
-> !(f.contains(".")))
+.forEach(f -> {
+  if (structType.getFieldIndex(f).isDefined()) {
+partitionPathPositions.put(f,
+Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+  } else {
+partitionPathPositions.put(f, Collections.singletonList(-1));
+  }
+});
+// parse nested fields
+getPartitionPathFields().stream().filter(f -> !f.isEmpty()).filter(f 
-> f.contains("."))
+.forEach(f -> partitionPathPositions.put(f,
+RowKeyGeneratorHelper.getNestedFieldIndices(structType, f, 
false)));
+  }
+  this.structType = structType;

Review comment:
   may I know where is the structType being used ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-909) Introduce hudi-client-flink module to support flink engine

2020-08-13 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-909:
-
Fix Version/s: (was: 0.6.0)
   0.6.1

> Introduce hudi-client-flink module to support flink engine
> --
>
> Key: HUDI-909
> URL: https://issues.apache.org/jira/browse/HUDI-909
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
> Fix For: 0.6.1
>
>
> Introduce hudi-client-flink module to support flink engine based on new 
> abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1150) Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator

2020-08-13 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1150:
--
Status: Open  (was: New)

> Fix unable to parse input partition field :1 exception when using 
> TimestampBasedKeyGenerator 
> -
>
> Key: HUDI-1150
> URL: https://issues.apache.org/jira/browse/HUDI-1150
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> scene to reproduce:
>  # use TimestampBasedKeyGenerator
>  # set 
> {color:#33}hoodie.deltastreamer.keygen.timebased.timestamp.type{color} = 
> DATE_STRING
>  # partitionpath field value is null
> when partitionpath field value is null TimestampBasedKeyGenerator will set it 
> to1L, which can not be parsed correctly.
>  
> {code:java}
> //
> User class threw exception: java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4, prod-t3-data-lake-007, executor 6): 
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to parse input 
> partition field :1
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:156)
>  at 
> org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:108)
>  at 
> org.apache.hudi.keygen.CustomKeyGenerator.getKey(CustomKeyGenerator.java:78)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$9fce03f0$1(DeltaSync.java:343)
>  at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:394)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>  at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
>  at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.lang.RuntimeException: 
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit is not 
> specified but scalar it supplied as time value
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.convertLongTimeToMillis(TimestampBasedKeyGenerator.java:163)
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:138)
>  ... 29 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1089) Refactor hudi-client to support multi-engine

2020-08-13 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1089:
--
Fix Version/s: (was: 0.6.0)
   0.6.1

> Refactor hudi-client to support multi-engine
> 
>
> Key: HUDI-1089
> URL: https://issues.apache.org/jira/browse/HUDI-1089
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Usability
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> To make hudi support more engines, we should abstract the current hudi-client 
> module.
> This Jira aims to abstract hudi-client module and implements spark engine 
> code.
> The structure looks like this:
> hudi-client
>  ├── hudi-client-common
> ├── hudi-client-spark
> ├── hudi-client-flink
> ├── hudi-client-java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1150) Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator

2020-08-13 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1150:
--
Status: In Progress  (was: Open)

> Fix unable to parse input partition field :1 exception when using 
> TimestampBasedKeyGenerator 
> -
>
> Key: HUDI-1150
> URL: https://issues.apache.org/jira/browse/HUDI-1150
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> scene to reproduce:
>  # use TimestampBasedKeyGenerator
>  # set 
> {color:#33}hoodie.deltastreamer.keygen.timebased.timestamp.type{color} = 
> DATE_STRING
>  # partitionpath field value is null
> when partitionpath field value is null TimestampBasedKeyGenerator will set it 
> to1L, which can not be parsed correctly.
>  
> {code:java}
> //
> User class threw exception: java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4, prod-t3-data-lake-007, executor 6): 
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to parse input 
> partition field :1
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:156)
>  at 
> org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:108)
>  at 
> org.apache.hudi.keygen.CustomKeyGenerator.getKey(CustomKeyGenerator.java:78)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$9fce03f0$1(DeltaSync.java:343)
>  at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:394)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>  at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
>  at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.lang.RuntimeException: 
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit is not 
> specified but scalar it supplied as time value
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.convertLongTimeToMillis(TimestampBasedKeyGenerator.java:163)
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:138)
>  ... 29 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1150) Fix unable to parse input partition field :1 exception when using TimestampBasedKeyGenerator

2020-08-13 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1150:
--
Fix Version/s: (was: 0.6.0)
   0.6.1

> Fix unable to parse input partition field :1 exception when using 
> TimestampBasedKeyGenerator 
> -
>
> Key: HUDI-1150
> URL: https://issues.apache.org/jira/browse/HUDI-1150
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> scene to reproduce:
>  # use TimestampBasedKeyGenerator
>  # set 
> {color:#33}hoodie.deltastreamer.keygen.timebased.timestamp.type{color} = 
> DATE_STRING
>  # partitionpath field value is null
> when partitionpath field value is null TimestampBasedKeyGenerator will set it 
> to1L, which can not be parsed correctly.
>  
> {code:java}
> //
> User class threw exception: java.util.concurrent.ExecutionException: 
> org.apache.hudi.exception.HoodieException: Job aborted due to stage failure: 
> Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in 
> stage 1.0 (TID 4, prod-t3-data-lake-007, executor 6): 
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to parse input 
> partition field :1
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:156)
>  at 
> org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:108)
>  at 
> org.apache.hudi.keygen.CustomKeyGenerator.getKey(CustomKeyGenerator.java:78)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$9fce03f0$1(DeltaSync.java:343)
>  at 
> org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:394)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1334)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1334)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1334)
>  at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
>  at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1364)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2121)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>  at org.apache.spark.scheduler.Task.run(Task.scala:121)
>  at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$11.apply(Executor.scala:407)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1408)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:413)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
>  Caused by: java.lang.RuntimeException: 
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit is not 
> specified but scalar it supplied as time value
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.convertLongTimeToMillis(TimestampBasedKeyGenerator.java:163)
>  at 
> org.apache.hudi.keygen.TimestampBasedKeyGenerator.getPartitionPath(TimestampBasedKeyGenerator.java:138)
>  ... 29 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1186) Add description of write commit callback by kafka to document

2020-08-13 Thread wangxianghu (Jira)
wangxianghu created HUDI-1186:
-

 Summary: Add description of  write commit callback  by kafka to 
document
 Key: HUDI-1186
 URL: https://issues.apache.org/jira/browse/HUDI-1186
 Project: Apache Hudi
  Issue Type: Task
  Components: Docs
Reporter: wangxianghu
Assignee: wangxianghu
 Fix For: 0.6.1






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1122) Introduce a kafka implementation of hoodie write commit callback

2020-08-13 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1122:
--
Fix Version/s: (was: 0.6.0)
   0.6.1

> Introduce a kafka implementation of hoodie write commit callback 
> -
>
> Key: HUDI-1122
> URL: https://issues.apache.org/jira/browse/HUDI-1122
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Discussed 
> here:[https://lists.apache.org/thread.html/r2b29fa11ac06b9c93141afcde78ae84592a50123d92cf004c4a7e41b%40%3Cdev.hudi.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bhasudha commented on issue #1948: [SUPPORT] DMS example complains about dfs-source.properties

2020-08-13 Thread GitBox


bhasudha commented on issue #1948:
URL: https://github.com/apache/hudi/issues/1948#issuecomment-673347421


   You would need to set the `--props` config for DeltaStreamer with a valid 
property file - 
https://github.com/apache/hudi/blob/379cf0786fe9fea94ec8c0da7d467ae2fb30dd0b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java#L217/
 . Or pass in the props individually using `--hoodie-conf `. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1956: [SUPPORT] DMS for table without PK

2020-08-13 Thread GitBox


bhasudha commented on issue #1956:
URL: https://github.com/apache/hudi/issues/1956#issuecomment-673341610


   > so on single column table it was 
https://github.com/apache/hudi/blob/release-0.5.3/hudi-spark/src/main/java/org/apache/hudi/keygen/SimpleKeyGenerator.java#L58
   > 
   > can I use complexkey class even for single column table ?
   
   It should be possible. It would split record keys separated by comma (in 
case of multiple columns) into a list. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha edited a comment on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


bhasudha edited a comment on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673336838


   @brandon-stanley  the `hoodie.datasource.write.precombine.field` is a 
mandatory field. If not specified a default field name `ts` is assumed. Since 
your table does not have this field you are seeing the above error.  The 
payload class invocation is not an issue since the stack trace you are pointing 
to here is happening way before the payload class is being invoked. You might 
want to point the `hoodie.datasource.write.precombine.field` to a valid column 
in the table and then also pass in a payload class that would ignore the 
precombine field. You can try that way. 
   
   But this aside,  does your dataset not have duplicates ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1960: How do you change the 'hoodie.datasource.write.payload.class' configuration property?

2020-08-13 Thread GitBox


bhasudha commented on issue #1960:
URL: https://github.com/apache/hudi/issues/1960#issuecomment-673336838


   @brandon-stanley  the `hoodie.datasource.write.precombine.field` is a 
mandatory field. If not specified a default field name `ts` is assumed. Since 
your table does not have this field you are seeing the above error.  The 
payload class invocation is not an issue since the stack trace you are pointing 
to here is happening way before the payload class is being invoked. You might 
want to point the `hoodie.datasource.write.precombine.field` to a valid column 
in the table and then also pass in a payload class that would ignore the 
precombine field.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch asf-site updated: Travis CI build asf-site

2020-08-13 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new ac4e0c3  Travis CI build asf-site
ac4e0c3 is described below

commit ac4e0c3d492976d73dd8b23ac15fb8c791b71e24
Author: CI 
AuthorDate: Thu Aug 13 08:07:00 2020 +

Travis CI build asf-site
---
 content/docs/writing_data.html | 68 --
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/content/docs/writing_data.html b/content/docs/writing_data.html
index d18be96..8b3212e 100644
--- a/content/docs/writing_data.html
+++ b/content/docs/writing_data.html
@@ -370,6 +370,7 @@
   DeltaStreamer
   MultiTableDeltaStreamer
   Datasource Writer
+  Key Generation
   Syncing to Hive
   Deletes
   Optimized DFS Access
@@ -602,9 +603,7 @@ Available values:
 Available values:
 COW_TABLE_TYPE_OPT_VAL (default), MOR_TABLE_TYPE_OPT_VAL
 
-KEYGENERATOR_CLASS_OPT_KEY: Key generator class, that will 
extract the key out of incoming record. If single column key use SimpleKeyGenerator. For multiple column keys 
use ComplexKeyGenerator. Note: A custom 
key generator class can be written/provided here as well. Primary key columns 
should be provided via RECORDKEY_FIELD_OPT_KEY option.
-Available values:
-classOf[SimpleKeyGenerator].getName 
(default), classOf[NonpartitionedKeyGenerator].getName 
(Non-partitioned tables can currently only have a single key column, https://issues.apache.org/jira/browse/HUDI-1053";>HUDI-1053), classOf[ComplexKeyGenerator].getName
+KEYGENERATOR_CLASS_OPT_KEY: Refer to Key Generation section below.
 
 HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY: If using hive, 
specify if the table should or should not be partitioned.
 Available values:
@@ -624,6 +623,69 @@ Upsert a DataFrame, specifying the necessary field names 
for .save(basePath);
 
 
+Key Generation
+
+Hudi maintains hoodie keys (record key + partition path) for uniquely 
identifying a particular record. Key generator class will extract these out of 
incoming record. Both the tools above have configs to specify the 
+hoodie.datasource.write.keygenerator.class 
property. For DeltaStreamer this would come from the property file specified in 
--props and 
+DataSource writer takes this config directly using DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY().
+The default value for this config is SimpleKeyGenerator. Note: A custom key 
generator class can be written/provided here as well. Primary key columns 
should be provided via RECORDKEY_FIELD_OPT_KEY option.
+
+Hudi currently supports different combinations of record keys and partition 
paths as below -
+
+
+  Simple record key (consisting of only one field) and simple partition 
path (with optional hive style partitioning)
+  Simple record key and custom timestamp based partition path (with 
optional hive style partitioning)
+  Composite record keys (combination of multiple fields) and composite 
partition paths
+  Composite record keys and timestamp based partition paths (composite 
also supported)
+  Non partitioned table
+
+
+CustomKeyGenerator.java (part of 
hudi-spark module) class provides great support for generating hoodie keys of 
all the above listed types. All you need to do is supply values for the 
following properties properly to create your desired keys -
+
+hoodie.datasource.write.recordkey.field
+hoodie.datasource.write.partitionpath.field
+hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.
+
+For having composite record keys, you need to provide comma separated 
fields like
+hoodie.datasource.write.recordkey.field=field1,field2
+
+
+This will create your record key in the format field1:value1,field2:value2 and so on, 
otherwise you can specify only one field in case of simple record keys. CustomKeyGenerator class defines an enum PartitionKeyType for configuring partition 
paths. It can take two possible values - SIMPLE and TIMESTAMP. 
+The value for hoodie.datasource.write.partitionpath.field 
property in case of partitioned tables needs to be provided in the format field1:PartitionKeyType1,field2:PartitionKeyType2
 and so on. For example, if you want to create partition path using 2 fields 
country and date where the latter has timestamp based 
values and needs to be c [...]
+
+hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:
+This will create the partition path in the format / or country=/date= 
depending on whether you want hive style partitioning or not.
+
+TimestampBasedKeyGenerator class 
defines the following properties which can be used for doing the customizations 
for timestamp based partition paths
+
+hoodie.deltastreamer.keygen.timebased.timestamp.type
+  This defines the type of 
the value that your field 
contains. It can be in string  [...]
+hoodie.deltastreamer.keygen.timebased.timestamp.scalar

[hudi] branch asf-site updated: [HUDI-859]: Added section for key generation in writing data docs (#1816)

2020-08-13 Thread bhavanisudha
This is an automated email from the ASF dual-hosted git repository.

bhavanisudha pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 3df2674  [HUDI-859]: Added section for key generation in writing data 
docs (#1816)
3df2674 is described below

commit 3df26743ebf1f125c072804a6e50775d5c53f6e0
Author: Pratyaksh Sharma 
AuthorDate: Thu Aug 13 13:34:49 2020 +0530

[HUDI-859]: Added section for key generation in writing data docs (#1816)

* [HUDI-859]: Added section for key generation in Writing data page

Co-authored-by: Bhavani Sudha Saktheeswaran 
---
 docs/_docs/2_2_writing_data.md | 70 +++---
 1 file changed, 66 insertions(+), 4 deletions(-)

diff --git a/docs/_docs/2_2_writing_data.md b/docs/_docs/2_2_writing_data.md
index 43fc046..9b63d13 100644
--- a/docs/_docs/2_2_writing_data.md
+++ b/docs/_docs/2_2_writing_data.md
@@ -238,10 +238,7 @@ Available values:
 Available values:
 [`COW_TABLE_TYPE_OPT_VAL`](/docs/concepts.html#copy-on-write-table) (default), 
[`MOR_TABLE_TYPE_OPT_VAL`](/docs/concepts.html#merge-on-read-table)
 
-**KEYGENERATOR_CLASS_OPT_KEY**: Key generator class, that will extract the key 
out of incoming record. If single column key use `SimpleKeyGenerator`. For 
multiple column keys use `ComplexKeyGenerator`. Note: A custom key generator 
class can be written/provided here as well. Primary key columns should be 
provided via `RECORDKEY_FIELD_OPT_KEY` option.
-Available values:
-`classOf[SimpleKeyGenerator].getName` (default), 
`classOf[NonpartitionedKeyGenerator].getName` (Non-partitioned tables can 
currently only have a single key column, 
[HUDI-1053](https://issues.apache.org/jira/browse/HUDI-1053)), 
`classOf[ComplexKeyGenerator].getName`
-
+**KEYGENERATOR_CLASS_OPT_KEY**: Refer to [Key Generation](#key-generation) 
section below.
 
 **HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY**: If using hive, specify if the 
table should or should not be partitioned.
 Available values:
@@ -263,6 +260,71 @@ inputDF.write()
.save(basePath);
 ```
 
+## Key Generation
+
+Hudi maintains hoodie keys (record key + partition path) for uniquely 
identifying a particular record. Key generator class will extract these out of 
incoming record. Both the tools above have configs to specify the 
+`hoodie.datasource.write.keygenerator.class` property. For DeltaStreamer this 
would come from the property file specified in `--props` and 
+DataSource writer takes this config directly using 
`DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY()`.
+The default value for this config is `SimpleKeyGenerator`. Note: A custom key 
generator class can be written/provided here as well. Primary key columns 
should be provided via `RECORDKEY_FIELD_OPT_KEY` option.
+ 
+Hudi currently supports different combinations of record keys and partition 
paths as below - 
+
+ - Simple record key (consisting of only one field) and simple partition path 
(with optional hive style partitioning)
+ - Simple record key and custom timestamp based partition path (with optional 
hive style partitioning)
+ - Composite record keys (combination of multiple fields) and composite 
partition paths
+ - Composite record keys and timestamp based partition paths (composite also 
supported)
+ - Non partitioned table
+
+`CustomKeyGenerator.java` (part of hudi-spark module) class provides great 
support for generating hoodie keys of all the above listed types. All you need 
to do is supply values for the following properties properly to create your 
desired keys - 
+
+```java
+hoodie.datasource.write.recordkey.field
+hoodie.datasource.write.partitionpath.field
+hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
+```
+
+For having composite record keys, you need to provide comma separated fields 
like
+```java
+hoodie.datasource.write.recordkey.field=field1,field2
+```
+
+This will create your record key in the format `field1:value1,field2:value2` 
and so on, otherwise you can specify only one field in case of simple record 
keys. `CustomKeyGenerator` class defines an enum `PartitionKeyType` for 
configuring partition paths. It can take two possible values - SIMPLE and 
TIMESTAMP. 
+The value for `hoodie.datasource.write.partitionpath.field` property in case 
of partitioned tables needs to be provided in the format 
`field1:PartitionKeyType1,field2:PartitionKeyType2` and so on. For example, if 
you want to create partition path using 2 fields `country` and `date` where the 
latter has timestamp based values and needs to be customised in a given format, 
you can specify the following 
+
+```java
+hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMP
+``` 
+This will create the partition path in the format `/` or 
`country=/date=` depending on whether you want hive style 
partitioning or not.
+
+`TimestampBasedKeyGenerator` class

[GitHub] [hudi] bhasudha merged pull request #1816: [HUDI-859]: Added section for key generation in writing data docs

2020-08-13 Thread GitBox


bhasudha merged pull request #1816:
URL: https://github.com/apache/hudi/pull/1816


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on issue #1947: datadog monitor hudi

2020-08-13 Thread GitBox


xushiyan commented on issue #1947:
URL: https://github.com/apache/hudi/issues/1947#issuecomment-673322896


   hi i haven't tried this myself but a cursory look gives that 
`option("hoodie.metrics.on",true)` may be a problem as it takes a `boolean` for 
value. can you try `option("hoodie.metrics.on","true")`? and have you tried the 
settings with other metrics reporter?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-13 Thread GitBox


vinothchandar merged pull request #1834:
URL: https://github.com/apache/hudi/pull/1834


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1834: [HUDI-1013] Adding Bulk Insert V2 implementation

2020-08-13 Thread GitBox


vinothchandar commented on pull request #1834:
URL: https://github.com/apache/hudi/pull/1834#issuecomment-673310825


   @nsivabalan this is ready. I am going ahead and merging. I also re-ran the 
benchmark again . Seems to clock the same 30 mins against spark.write.parquet. 
   
   Please carefully go over the changes I have made in the last commits here.. 
and see if anything needs follow on fixing. Our timelines are tight. we need to 
do it tomorrow, if at all 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] saumyasuhagiya commented on issue #827: java.lang.ClassNotFoundException: com.uber.hoodie.hadoop.HoodieInputFormat

2020-08-13 Thread GitBox


saumyasuhagiya commented on issue #827:
URL: https://github.com/apache/hudi/issues/827#issuecomment-673310364


   Thanks @bvaradar. Currently I have resolved it by putting as external jar 
using --jars. I will open new issue if required



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org