date:20191031

[GitHub] [incubator-hudi] taherk77 commented on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-10-31 Thread GitBox

taherk77 commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-548677043
 
 
   > @taherk77 if you could resovle comments after addressing them, that would 
be very helpful for reviewing incrementally. small tip :)
   
   Apologies. Will keep and mind


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] 02/02: synchronized lock on conf object instead of class

2019-10-31 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git

commit ee0fd06de73e0191365549fa7c4f6c71c1bbc08d
Author: Wenning Ding 
AuthorDate: Wed Oct 30 11:48:21 2019 -0700

synchronized lock on conf object instead of class
---
 .../realtime/HoodieParquetRealtimeInputFormat.java  | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
index 3e42724..ba325e1 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
@@ -200,14 +200,17 @@ public class HoodieParquetRealtimeInputFormat extends 
HoodieParquetInputFormat i
   /**
* Hive will append read columns' ids to old columns' ids during 
getRecordReader. In some cases, e.g. SELECT COUNT(*),
* the read columns' id is an empty string and Hive will combine it with 
Hoodie required projection ids and becomes
-   * e.g. ",2,0,3" and will cause an error. This method is used to avoid this 
situation.
+   * e.g. ",2,0,3" and will cause an error. Actually this method is a 
temporary solution because the real bug is from
+   * Hive. Hive has fixed this bug after 3.0.0, but the version before that 
would still face this problem. (HIVE-22438)
*/
-  private static synchronized Configuration 
cleanProjectionColumnIds(Configuration conf) {
-String columnIds = 
conf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR);
-if (!columnIds.isEmpty() && columnIds.charAt(0) == ',') {
-  conf.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, 
columnIds.substring(1));
-  if (LOG.isDebugEnabled()) {
-LOG.debug("The projection Ids: {" + columnIds + "} start with ','. 
First comma is removed");
+  private static Configuration cleanProjectionColumnIds(Configuration conf) {
+synchronized (conf) {
+  String columnIds = 
conf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR);
+  if (!columnIds.isEmpty() && columnIds.charAt(0) == ',') {
+conf.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, 
columnIds.substring(1));
+if (LOG.isDebugEnabled()) {
+  LOG.debug("The projection Ids: {" + columnIds + "} start with ','. 
First comma is removed");
+}
   }
 }
 return conf;

[GitHub] [incubator-hudi] n3nash merged pull request #972: [HUDI-313] Fix select count star error when querying a realtime table

2019-10-31 Thread GitBox

n3nash merged pull request #972: [HUDI-313] Fix select count star error when 
querying a realtime table
URL: https://github.com/apache/incubator-hudi/pull/972
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #972: [HUDI-313] Fix select count star error when querying a realtime table

2019-10-31 Thread GitBox

n3nash commented on issue #972: [HUDI-313] Fix select count star error when 
querying a realtime table
URL: https://github.com/apache/incubator-hudi/pull/972#issuecomment-548666016
 
 
   @zhedoubushishi Thanks for addressing the comments. I'm planning to add some 
more changes on top of this PR and will add the JIRA in the comments when I 
open the PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] 01/02: [HUDI-313] Fix select count star error when querying a realtime table

2019-10-31 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git

commit 3251d62bd3c740b25139029a1913d1cf5a57173f
Author: Wenning Ding 
AuthorDate: Wed Oct 23 13:53:57 2019 -0700

[HUDI-313] Fix select count star error when querying a realtime table
---
 .../realtime/HoodieParquetRealtimeInputFormat.java  | 17 +
 1 file changed, 17 insertions(+)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
index d37ae2a..3e42724 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
@@ -197,10 +197,27 @@ public class HoodieParquetRealtimeInputFormat extends 
HoodieParquetInputFormat i
 return configuration;
   }
 
+  /**
+   * Hive will append read columns' ids to old columns' ids during 
getRecordReader. In some cases, e.g. SELECT COUNT(*),
+   * the read columns' id is an empty string and Hive will combine it with 
Hoodie required projection ids and becomes
+   * e.g. ",2,0,3" and will cause an error. This method is used to avoid this 
situation.
+   */
+  private static synchronized Configuration 
cleanProjectionColumnIds(Configuration conf) {
+String columnIds = 
conf.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR);
+if (!columnIds.isEmpty() && columnIds.charAt(0) == ',') {
+  conf.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, 
columnIds.substring(1));
+  if (LOG.isDebugEnabled()) {
+LOG.debug("The projection Ids: {" + columnIds + "} start with ','. 
First comma is removed");
+  }
+}
+return conf;
+  }
+
   @Override
   public RecordReader getRecordReader(final 
InputSplit split, final JobConf job,
   final Reporter reporter) throws IOException {
 
+this.conf = cleanProjectionColumnIds(job);
 LOG.info("Before adding Hoodie columns, Projections :" + 
job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR)
 + ", Ids :" + job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR));

[incubator-hudi] branch master updated (eda472a -> ee0fd06)

2019-10-31 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git.


from eda472a  [MINOR] Fix avro schema warnings in build
 new 3251d62  [HUDI-313] Fix select count star error when querying a 
realtime table
 new ee0fd06  synchronized lock on conf object instead of class

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../realtime/HoodieParquetRealtimeInputFormat.java   | 20 
 1 file changed, 20 insertions(+)

[GitHub] [incubator-hudi] n3nash merged pull request #988: [MINOR] Fix avro schema warnings in builds

2019-10-31 Thread GitBox

n3nash merged pull request #988: [MINOR] Fix avro schema warnings in builds
URL: https://github.com/apache/incubator-hudi/pull/988
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated: [MINOR] Fix avro schema warnings in build

2019-10-31 Thread nagarwal

This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new eda472a  [MINOR] Fix avro schema warnings in build
eda472a is described below

commit eda472adb09ce82063f58200eafcb7be361b4cb8
Author: Guru107 
AuthorDate: Wed Oct 30 11:48:29 2019 +0530

[MINOR] Fix avro schema warnings in build
---
 hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc  | 16 +---
 hudi-common/src/main/avro/HoodieCommitMetadata.avsc |  6 --
 hudi-common/src/main/avro/HoodieCompactionMetadata.avsc | 15 ++-
 hudi-common/src/main/avro/HoodieRestoreMetadata.avsc|  4 ++--
 4 files changed, 25 insertions(+), 16 deletions(-)

diff --git a/hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc 
b/hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
index 626b478..7c57e04 100644
--- a/hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
+++ b/hudi-common/src/main/avro/HoodieArchivedMetaEntry.avsc
@@ -26,7 +26,7 @@
 "null",
 "HoodieCommitMetadata"
  ],
- "default": "null"
+ "default": null
   },
   {
  "name":"hoodieCleanMetadata",
@@ -34,7 +34,7 @@
 "null",
 "HoodieCleanMetadata"
  ],
- "default": "null"
+ "default": null
   },
   {
  "name":"hoodieCompactionMetadata",
@@ -42,7 +42,7 @@
 "null",
 "HoodieCompactionMetadata"
  ],
- "default": "null"
+ "default": null
   },
   {
  "name":"hoodieRollbackMetadata",
@@ -50,7 +50,7 @@
 "null",
 "HoodieRollbackMetadata"
  ],
- "default": "null"
+ "default": null
   },
   {
  "name":"hoodieSavePointMetadata",
@@ -58,15 +58,17 @@
 "null",
 "HoodieSavepointMetadata"
  ],
- "default": "null"
+ "default": null
   },
   {
  "name":"commitTime",
- "type":["null","string"]
+ "type":["null","string"],
+ "default": null
   },
   {
  "name":"actionType",
- "type":["null","string"]
+ "type":["null","string"],
+ "default": null
   },
   {
  "name":"version",
diff --git a/hudi-common/src/main/avro/HoodieCommitMetadata.avsc 
b/hudi-common/src/main/avro/HoodieCommitMetadata.avsc
index 7796d99..bdd2aca 100644
--- a/hudi-common/src/main/avro/HoodieCommitMetadata.avsc
+++ b/hudi-common/src/main/avro/HoodieCommitMetadata.avsc
@@ -118,14 +118,16 @@
   ]
}
 }
- }]
+ }],
+ "default": null
   },
   {
  "name":"extraMetadata",
  "type":["null", {
 "type":"map",
 "values":"string"
- }]
+ }],
+ "default": null
   },
   {
  "name":"version",
diff --git a/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc 
b/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc
index b4da80e..3d2ac43 100644
--- a/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc
+++ b/hudi-common/src/main/avro/HoodieCompactionMetadata.avsc
@@ -32,23 +32,28 @@
   "fields":[
  {
 "name":"partitionPath",
-"type":["null","string"]
+"type":["null","string"],
+"default": null
  },
  {
 "name":"totalLogRecords",
-"type":["null","long"]
+"type":["null","long"],
+"default": null
  },
  {
 "name":"totalLogFiles",
-"type":["null","long"]
+"type":["null","long"],
+"default": null
  },
  {
 "name":"totalUpdatedRecordsCompacted",
-"type":["null","long"]
+"type":["null","long"],
+"default": null
  },
  {
 "name":"hoodieWriteStat",
-"type":["null","HoodieWriteStat"]
+"type":["null","HoodieWriteStat"],
+"default": null
  }
   ]
}
diff --git a/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc 
b/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc
index 03defbb..28e111d 100644
--- a/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc
+++ b/hudi-common/src/main/avro/HoodieRestoreMetadata.avsc
@@ -25,8 +25,8 @@
  {"name": "hoodieRestoreMetadata", "type": {

[GitHub] [incubator-hudi] n3nash commented on issue #988: [MINOR] Fix avro schema warnings in builds

2019-10-31 Thread GitBox

n3nash commented on issue #988: [MINOR] Fix avro schema warnings in builds
URL: https://github.com/apache/incubator-hudi/pull/988#issuecomment-548665278
 
 
   LGTM, default values!
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #623: Hudi Test Suite

2019-10-31 Thread GitBox

vinothchandar commented on issue #623: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/623#issuecomment-548663470
 
 
   Just want to ensure we have a plan going forward.. @yanghua how far along 
are you? @n3nash can you please work with vino to see how/if both the works can 
be merged?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Build failed in Jenkins: hudi-snapshot-deployment-0.5 #85

2019-10-31 Thread Apache Jenkins Server

See 


Changes:


--
[...truncated 2.17 KB...]
/home/jenkins/tools/maven/apache-maven-3.5.4/bin:
m2.conf
mvn
mvn.cmd
mvnDebug
mvnDebug.cmd
mvnyjp

/home/jenkins/tools/maven/apache-maven-3.5.4/boot:
plexus-classworlds-2.5.2.jar

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.5.1-SNAPSHOT'
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Hudi   [pom]
[INFO] hudi-common[jar]
[INFO] hudi-timeline-service  [jar]
[INFO] hudi-hadoop-mr [jar]
[INFO] hudi-client[jar]
[INFO] hudi-hive  [jar]
[INFO] hudi-spark [jar]
[INFO] hudi-utilities [jar]
[INFO] hudi-cli   [jar]
[INFO] hudi-hadoop-mr-bundle  [jar]
[INFO] hudi-hive-bundle   [jar]
[INFO] hudi-spark-bundle  [jar]
[INFO] hudi-presto-bundle [jar]
[INFO] hudi-utilities-bundle  [jar]
[INFO] hudi-timeline-server-bundle[ja

[GitHub] [incubator-hudi] yanghua commented on issue #623: Hudi Test Suite

2019-10-31 Thread GitBox

yanghua commented on issue #623: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/623#issuecomment-548649441
 
 
   Thanks for updating this PR. I have tried to pick this PR to my local then 
fix the conflicts and continue to work based on it. However, I found there are 
many conflicts so I gave up this thought.
   
   The things what I did in my local is:
   
   - created a new module named `hudi-end-to-end-test`;
   - copied your most file of this PR into my local module;
   - tried to fix the errors (there are still some errors to be fixed)
   - refactored dag package and provided an `ExecutionContext` POJO and unified 
`execute` method as an abstract method and refactored `DagScheduler`
   - understanding parsing yaml file business logic
   
   It's OK about working together. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-321) Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread Raymond Xu (Jira)

Raymond Xu created HUDI-321:
---

 Summary: Support bulkinsert in HDFSParquetImporter
 Key: HUDI-321
 URL: https://issues.apache.org/jira/browse/HUDI-321
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Utilities
Reporter: Raymond Xu


Currently, HDFSParquetImporter only support upsert and insert mode. It is 
useful to have bulk insert mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-320) Keep docs on master instead of asf-site branch

2019-10-31 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964472#comment-16964472
 ] 

Ethan Guo commented on HUDI-320:


It happened to me that my colleague followed the latest docs to use the renamed 
package name for running incremental pulls in production, where the release 
we're using is pre-0.5.0.  Then I realized that there's a gap here. 

 

[~vinoth] [~vbalaji]  Do you think we should do this going forward?

> Keep docs on master instead of asf-site branch
> --
>
> Key: HUDI-320
> URL: https://issues.apache.org/jira/browse/HUDI-320
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Ethan Guo
>Priority: Minor
>
> Given that each version has new features and improvements, some involving 
> configuration and parameter changes, compared to previous versions, it would 
> be good to keep the docs for each version.  This can be achieved by having 
> the docs on master, so each release has its version of docs.  Developers can 
> always refer to the docs of a specific release if they would like to.
>  
> Currently we only have one version of docs kept at the asf-site branch for 
> the latest release.  This can create confusion for users using a previous 
> release of Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HUDI-320) Keep docs on master instead of asf-site branch

2019-10-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-320:
---
Description: 
Given that each version has new features and improvements, some involving 
configuration and parameter changes, compared to previous versions, it would be 
good to keep the docs for each version.  This can be achieved by having the 
docs on master, so each release has its version of docs.  Developers can always 
refer to the docs of a specific release if they would like to.

 

Currently we only have one version of docs kept at the asf-site branch for the 
latest release.  This can create confusion for users using a previous release 
of Hudi.

  was:
Given that each version has new features and improvements, some involving 
configuration and parameter changes, compared to previous versions, it would be 
good to keep the docs for each version.  This can be achieved by having the 
docs on master, so each release has its version of docs.

 

Currently we only have one version of docs kept at the asf-site branch for the 
latest release.  This can create confusion for users using a previous release 
of Hudi.


> Keep docs on master instead of asf-site branch
> --
>
> Key: HUDI-320
> URL: https://issues.apache.org/jira/browse/HUDI-320
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Ethan Guo
>Priority: Minor
>
> Given that each version has new features and improvements, some involving 
> configuration and parameter changes, compared to previous versions, it would 
> be good to keep the docs for each version.  This can be achieved by having 
> the docs on master, so each release has its version of docs.  Developers can 
> always refer to the docs of a specific release if they would like to.
>  
> Currently we only have one version of docs kept at the asf-site branch for 
> the latest release.  This can create confusion for users using a previous 
> release of Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-320) Keep docs on master instead of asf-site branch

2019-10-31 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-320:
--

 Summary: Keep docs on master instead of asf-site branch
 Key: HUDI-320
 URL: https://issues.apache.org/jira/browse/HUDI-320
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
  Components: Docs
Reporter: Ethan Guo


Given that each version has new features and improvements, some involving 
configuration and parameter changes, compared to previous versions, it would be 
good to keep the docs for each version.  This can be achieved by having the 
docs on master, so each release has its version of docs.

 

Currently we only have one version of docs kept at the asf-site branch for the 
latest release.  This can create confusion for users using a previous release 
of Hudi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HUDI-319) Create online javadocs based on the jar

2019-10-31 Thread Ethan Guo (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo reassigned HUDI-319:
--

Assignee: Ethan Guo

> Create online javadocs based on the jar
> ---
>
> Key: HUDI-319
> URL: https://issues.apache.org/jira/browse/HUDI-319
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Docs
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Minor
>  Labels: Documentation
>
> It makes the development easier to have the online javadocs on the side and 
> understand the public APIs provided by Hudi when necessary, instead of always 
> going into the source code.
>  
> Example of Spark online javadocs: 
> [https://spark.apache.org/docs/latest/api/java/index.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-319) Create online javadocs based on the jar

2019-10-31 Thread Ethan Guo (Jira)

Ethan Guo created HUDI-319:
--

 Summary: Create online javadocs based on the jar
 Key: HUDI-319
 URL: https://issues.apache.org/jira/browse/HUDI-319
 Project: Apache Hudi (incubating)
  Issue Type: New Feature
  Components: Docs
Reporter: Ethan Guo


It makes the development easier to have the online javadocs on the side and 
understand the public APIs provided by Hudi when necessary, instead of always 
going into the source code.

 

Example of Spark online javadocs: 
[https://spark.apache.org/docs/latest/api/java/index.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (HUDI-76) CSV Source support for Hudi Delta Streamer

2019-10-31 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964457#comment-16964457
 ] 

Ethan Guo edited comment on HUDI-76 at 10/31/19 11:50 PM:
--

I'll write a PoC in the next few days and create a WIP PR.


was (Author: guoyihua):
I'll write a PoC in the next few days and create WIP PR.

> CSV Source support for Hudi Delta Streamer
> --
>
> Key: HUDI-76
> URL: https://issues.apache.org/jira/browse/HUDI-76
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer, Incremental Pull
>Reporter: Balaji Varadarajan
>Assignee: Ethan Guo
>Priority: Minor
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-76) CSV Source support for Hudi Delta Streamer

2019-10-31 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964457#comment-16964457
 ] 

Ethan Guo commented on HUDI-76:
---

I'll write a PoC in the next few days and create WIP PR.

> CSV Source support for Hudi Delta Streamer
> --
>
> Key: HUDI-76
> URL: https://issues.apache.org/jira/browse/HUDI-76
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer, Incremental Pull
>Reporter: Balaji Varadarajan
>Assignee: Ethan Guo
>Priority: Minor
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-76) CSV Source support for Hudi Delta Streamer

2019-10-31 Thread Ethan Guo (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-76?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964456#comment-16964456
 ] 

Ethan Guo commented on HUDI-76:
---

After some exploration, here are my initial thoughts on how to implement this 
feature.
 * Source type: add a new source type `CSV` in SourceType
 * Create `CSVSource`, `CSVDFSSource`, and `CSVKafkaSource` classes to fetch 
new data
 ** Internally, the class need to convert text of CSV format to Avro and Row 
format.  Given that the conversion from Row to Avro is expensive, the design 
choice is to implement the conversion from CSV to Avro (Avro to Row conversion 
has already been there in Hudi).
 ** For the conversion from CSV to Avro, I've looked at the following libraries
 *** avro-tools: supports Avro to CSV conversion, not the reverse
 *** spark: Spark can read CSV files to get DataFrames / set of rows.  It uses 
[Univocity 
parser|[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala]]
 to parse CSV text and construct InternalRow.  
[univocity-parsers|[https://www.univocity.com/pages/univocity_parsers_tutorial.html]]
 is a collection of extremely fast and reliable Java-based parsers for CSV, TSV 
and Fixed Width files.  We can reuse part of its logic to construct Avro 
records from CSV text.
 * In terms of the CSV parsing options, we can provide the same semantics as 
what Spark has to be consistent.  A set of new Hudi CSV options will be added.  
These CSV parsing options can be passed to the Univocity parser directly.
 * Bridge the gap in `SourceFormatAdapter` for the new CSV SourceType

> CSV Source support for Hudi Delta Streamer
> --
>
> Key: HUDI-76
> URL: https://issues.apache.org/jira/browse/HUDI-76
> Project: Apache Hudi (incubating)
>  Issue Type: Improvement
>  Components: deltastreamer, Incremental Pull
>Reporter: Balaji Varadarajan
>Assignee: Ethan Guo
>Priority: Minor
>
> DeltaStreamer does not have support to pull CSV data from sources (hdfs log 
> files/kafka). THis ticket is to provide support for csv sources.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-275) Translate Documentation -> Querying Data page

2019-10-31 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-275.
--
Resolution: Fixed

Fixed via asf-site: 4b3b197b8a6e983f20067ed3ef00694e19edf9f9

> Translate Documentation -> Querying Data page
> -
>
> Key: HUDI-275
> URL: https://issues.apache.org/jira/browse/HUDI-275
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: docs-chinese
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Translate this page into Chinese:
>  
> [http://hudi.apache.org/querying_data.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Closed] (HUDI-221) Translate concept page

2019-10-31 Thread leesf (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

leesf closed HUDI-221.
--
Fix Version/s: 0.5.1
   Resolution: Fixed

Fixed via asf-site: 4fd3c7f737a2cf2a5d506896ea641e4d62d103ce

> Translate concept page
> --
>
> Key: HUDI-221
> URL: https://issues.apache.org/jira/browse/HUDI-221
> Project: Apache Hudi (incubating)
>  Issue Type: Sub-task
>  Components: docs-chinese
>Reporter: vinoyang
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The online HTML web page: [https://hudi.apache.org/concepts.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] yihua commented on a change in pull request #986: [HUDI-317] change quickstart page spark-shell command

2019-10-31 Thread GitBox

yihua commented on a change in pull request #986: [HUDI-317] change quickstart 
page spark-shell command 
URL: https://github.com/apache/incubator-hudi/pull/986#discussion_r341387141
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -12,29 +12,14 @@ code snippets that allows you to insert and update a Hudi 
dataset of default sto
 [Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
-## Build Hudi spark bundle jar
-
-Hudi requires Java 8 to be installed on a *nix system. Check out 
[code](https://github.com/apache/incubator-hudi) and 
-normally build the maven project, from command line:
-
-``` 
-# checkout and build
-git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
-mvn clean install -DskipTests -DskipITs
-
-# Export the location of hudi-spark-bundle for later 
-mkdir -p /tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/tmp/hudi/hudi-spark-bundle.jar 
-export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
-```
-
 
 Review comment:
   Maybe we can put it in a separate page named sth like "Building Hudi", with 
all the commands to build Hudi and run tests?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #972: [HUDI-313] Fix select count star error when querying a realtime table

2019-10-31 Thread GitBox

zhedoubushishi commented on a change in pull request #972: [HUDI-313] Fix 
select count star error when querying a realtime table
URL: https://github.com/apache/incubator-hudi/pull/972#discussion_r341382778
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
 ##
 @@ -197,10 +197,27 @@ private static synchronized Configuration 
addRequiredProjectionFields(Configurat
 return configuration;
   }
 
+  /**
+   * Hive will append read columns' ids to old columns' ids during 
getRecordReader. In some cases, e.g. SELECT COUNT(*),
+   * the read columns' id is an empty string and Hive will combine it with 
Hoodie required projection ids and becomes
+   * e.g. ",2,0,3" and will cause an error. This method is used to avoid this 
situation.
+   */
+  private static synchronized Configuration 
cleanProjectionColumnIds(Configuration conf) {
 
 Review comment:
   > @zhedoubushishi : You can synchronize on the passed conf object instead of 
static synchronization which becomes a global lock at the JVM level.
   > 
   > You can do something like
   > synchronized(conf) {
   > 
   > }
   > inside your cleanProjectionColumnIds.
   
   That make sense. Code changes are done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] zhedoubushishi commented on a change in pull request #972: [HUDI-313] Fix select count star error when querying a realtime table

2019-10-31 Thread GitBox

zhedoubushishi commented on a change in pull request #972: [HUDI-313] Fix 
select count star error when querying a realtime table
URL: https://github.com/apache/incubator-hudi/pull/972#discussion_r341378841
 
 

 ##
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieParquetRealtimeInputFormat.java
 ##
 @@ -197,10 +197,27 @@ private static synchronized Configuration 
addRequiredProjectionFields(Configurat
 return configuration;
   }
 
+  /**
+   * Hive will append read columns' ids to old columns' ids during 
getRecordReader. In some cases, e.g. SELECT COUNT(*),
+   * the read columns' id is an empty string and Hive will combine it with 
Hoodie required projection ids and becomes
+   * e.g. ",2,0,3" and will cause an error. This method is used to avoid this 
situation.
+   */
 
 Review comment:
   Yeah, after the discussion and some investigations, Hive is the first place 
causes this bug and creates the projection column ids like ",2,0,3". What my 
code does actually is to handle this bug inside Hudi. 
   Hive has fixed this bug after 3.0.0, but before 3.0.0 we would still face 
this problem. The Jira for Hive is here: 
https://issues.apache.org/jira/browse/HIVE-22438.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] n3nash commented on issue #623: Hudi Test Suite

2019-10-31 Thread GitBox

n3nash commented on issue #623: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/623#issuecomment-548566462
 
 
   @vinothchandar I've updated the PR. I worked on it a while ago but never 
upstreamed the local changes. Ran into some issues making Hive queries work 
which is also fixed. I'll rebase this by tomorrow and hopefully address most of 
the comments by Monday.
   @yanghua Please take a look at this PR when you get a chance and we can 
follow it up with thoughts/ideas you may have.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #986: [HUDI-317] change quickstart page spark-shell command

2019-10-31 Thread GitBox

bhasudha commented on a change in pull request #986: [HUDI-317] change 
quickstart page spark-shell command 
URL: https://github.com/apache/incubator-hudi/pull/986#discussion_r341331911
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -12,29 +12,14 @@ code snippets that allows you to insert and update a Hudi 
dataset of default sto
 [Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
-## Build Hudi spark bundle jar
-
-Hudi requires Java 8 to be installed on a *nix system. Check out 
[code](https://github.com/apache/incubator-hudi) and 
-normally build the maven project, from command line:
-
-``` 
-# checkout and build
-git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
-mvn clean install -DskipTests -DskipITs
-
-# Export the location of hudi-spark-bundle for later 
-mkdir -p /tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/tmp/hudi/hudi-spark-bundle.jar 
-export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
-```
-
 
 Review comment:
   > I assume that the user doesn't need to build the Hudi spark bundle jar in 
any use case (e.g., integrating Hudi into Spark in production) but instead, the 
user can just get the bundles in maven, so we can remove this part. Is that the 
case?
   
   yes thats right @yihua . That said, may be we can keep the instruction 
somewhere to build Hudi if they need to. Wondering if that should be in 
quickstart page or FAQ. Any thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #986: [HUDI-317] change quickstart page spark-shell command

2019-10-31 Thread GitBox

bhasudha commented on a change in pull request #986: [HUDI-317] change 
quickstart page spark-shell command 
URL: https://github.com/apache/incubator-hudi/pull/986#discussion_r341330728
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -12,29 +12,14 @@ code snippets that allows you to insert and update a Hudi 
dataset of default sto
 [Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
-## Build Hudi spark bundle jar
-
-Hudi requires Java 8 to be installed on a *nix system. Check out 
[code](https://github.com/apache/incubator-hudi) and 
-normally build the maven project, from command line:
-
-``` 
-# checkout and build
-git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
-mvn clean install -DskipTests -DskipITs
-
-# Export the location of hudi-spark-bundle for later 
-mkdir -p /tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/tmp/hudi/hudi-spark-bundle.jar 
-export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
-```
-
 ## Setup spark-shell
 Hudi works with Spark-2.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for 
 setting up spark. 
 
 From the extracted directory run spark-shell with Hudi as:
 
 ```
-bin/spark-shell --jars $HUDI_SPARK_BUNDLE_PATH --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+bin/spark-shell bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
 Review comment:
   nice catch. my bad. fixed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] bhasudha commented on a change in pull request #986: [HUDI-317] change quickstart page spark-shell command

2019-10-31 Thread GitBox

bhasudha commented on a change in pull request #986: [HUDI-317] change 
quickstart page spark-shell command 
URL: https://github.com/apache/incubator-hudi/pull/986#discussion_r341329970
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -12,29 +12,14 @@ code snippets that allows you to insert and update a Hudi 
dataset of default sto
 [Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
-## Build Hudi spark bundle jar
-
-Hudi requires Java 8 to be installed on a *nix system. Check out 
[code](https://github.com/apache/incubator-hudi) and 
-normally build the maven project, from command line:
-
-``` 
-# checkout and build
-git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
-mvn clean install -DskipTests -DskipITs
-
-# Export the location of hudi-spark-bundle for later 
-mkdir -p /tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/tmp/hudi/hudi-spark-bundle.jar 
-export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
-```
-
 
 Review comment:
   I agree. Should we move that into FAQ ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on a change in pull request #987: Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread GitBox

xushiyan commented on a change in pull request #987: Support bulkinsert in 
HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/987#discussion_r341302256
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/UtilHelpers.java
 ##
 @@ -190,7 +190,9 @@ public static HoodieWriteClient 
createHoodieClient(JavaSparkContext jsc, String
 
.withCompactionStrategy(ReflectionUtils.loadClass(strategy)).build())
 
.orElse(HoodieCompactionConfig.newBuilder().withInlineCompaction(false).build());
 HoodieWriteConfig config =
-
HoodieWriteConfig.newBuilder().withPath(basePath).withParallelism(parallelism, 
parallelism)
+HoodieWriteConfig.newBuilder().withPath(basePath)
+.withParallelism(parallelism, parallelism)
+.withBulkInsertParallelism(parallelism)
 
 Review comment:
   @garyli1019 @vinothchandar I assume this won't have any side effect, will 
it? Though it is also used by `HoodieCompactor`, the `parallelism` setting 
should only take effect for one of the preset modes (upsert/insert/bulkinsert).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] xushiyan commented on issue #987: Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread GitBox

xushiyan commented on issue #987: Support bulkinsert in HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/987#issuecomment-548507965
 
 
   @garyli1019 @vinothchandar Thank you for the reviews!
   
   Understood that the plan of phasing out the importer. Will try out the delta 
streamer and let you know the outcome.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #623: Hudi Test Suite

2019-10-31 Thread GitBox

vinothchandar commented on issue #623: Hudi Test Suite
URL: https://github.com/apache/incubator-hudi/pull/623#issuecomment-548503609
 
 
   @n3nash I thought you were nt working on this. @yanghua  is also working on 
something similar in HUDI-289 .. Can you share some context? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Comment Edited] (HUDI-289) Implement a long running test for Hudi writing and querying end-end

2019-10-31 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964261#comment-16964261
 ] 

Nishith Agarwal edited comment on HUDI-289 at 10/31/19 6:04 PM:


[~yanghua] I'm working on resolving the comments on this PR : 
[https://github.com/apache/incubator-hudi/pull/623]. Let's sync after Monday 
next week, have had some changes lying locally for a while which I've pushed to 
the PR now.


was (Author: nishith29):
[~yanghua] I'm working on resolving the comments on this PR : 
[https://github.com/apache/incubator-hudi/pull/623]. Let's sync after Monday 
next week..

> Implement a long running test for Hudi writing and querying end-end
> ---
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.5.1
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-289) Implement a long running test for Hudi writing and querying end-end

2019-10-31 Thread Nishith Agarwal (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964261#comment-16964261
 ] 

Nishith Agarwal commented on HUDI-289:
--

[~yanghua] I'm working on resolving the comments on this PR : 
[https://github.com/apache/incubator-hudi/pull/623]. Let's sync after Monday 
next week..

> Implement a long running test for Hudi writing and querying end-end
> ---
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi (incubating)
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: vinoyang
>Priority: Major
> Fix For: 0.5.1
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] garyli1019 commented on issue #987: Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread GitBox

garyli1019 commented on issue #987: Support bulkinsert in HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/987#issuecomment-548475339
 
 
   https://issues.apache.org/jira/browse/HUDI-318
   Sure, I can update the page once I tested it myself. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Created] (HUDI-318) Update Migration Guide to Include Delta Streamer

2019-10-31 Thread Yanjia Gary Li (Jira)

Yanjia Gary Li created HUDI-318:
---

 Summary: Update Migration Guide to Include Delta Streamer
 Key: HUDI-318
 URL: https://issues.apache.org/jira/browse/HUDI-318
 Project: Apache Hudi (incubating)
  Issue Type: Improvement
Reporter: Yanjia Gary Li
Assignee: Yanjia Gary Li


[http://hudi.apache.org/migration_guide.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #969: [HUDI-251] JDBC 
incremental load to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#discussion_r341245781
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,235 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import java.util.Set;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LogManager.getLogger(JDBCSource.class);
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+DataFrameReader dataFrameReader = null;
+FSDataInputStream passwordFileStream = null;
+try {
+  dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (properties.containsKey(Config.PASSWORD) && !StringUtils
+  .isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+LOG.info("Reading JDBC password from properties file");
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else if (properties.containsKey(Config.PASSWORD_FILE) && !StringUtils
+  .isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info(
+String.format("Reading JDBC password from password file %s", 
properties.getString(Config.PASSWORD_FILE)));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
++ "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+  }
+
+  addExtraJdbcOptions(properties, dataFrameReader);
+
+  if (properties.containsKey(Config.IS_INCREMENTAL) && StringUtils
+  .isNullOrEmpty(properties.getString(Config.IS_INCREMENTAL))) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  IOUtils.closeStream(passwordFileStream);
+}
+  }
+
+  private static void addExtraJdbcOptions(TypedProperties properties, 
DataFrameReader dataFrameReader) {
+Set objects = properties.keySet();
+for (Object property : objects) {
+  String prop = (String) property;
+  if (prop.startsWith(Config.EXTRA_OPTIONS)) {
+String[] split = prop.split("\\.");
+String key = split[split.length - 1];
+String value = properties.getString(prop);
+LOG.info(String.format("Adding %s -> %s to jdbc options", key, value));
+dataFrameReader.option(key, value);
+  }
+}
+  }
+
+  @Override
+  protected Pair>, String> fetchNextBatch(Option 
lastCkptStr, long sourceLimit) {
 
 Review comment:
   yes usually the jdbc url is like `jdbc:mysql:`. and `jdbc:postgre

[GitHub] [incubator-hudi] vinothchandar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-10-31 Thread GitBox

vinothchandar commented on issue #969: [HUDI-251] JDBC incremental load to HUDI 
DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#issuecomment-548461664
 
 
   @taherk77 if you could resovle comments after addressing them, that would be 
very helpful for reviewing incrementally. small tip :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #969: [HUDI-251] JDBC 
incremental load to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#discussion_r341244277
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,239 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LogManager.getLogger(JDBCSource.class);
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+DataFrameReader dataFrameReader = null;
+FSDataInputStream passwordFileStream = null;
+try {
+  dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (properties.containsKey(Config.PASSWORD) && !StringUtils
+  .isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+LOG.info("Reading JDBC password from properties file");
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else if (properties.containsKey(Config.PASSWORD_FILE) && !StringUtils
+  .isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info(
+String.format("Reading JDBC password from password file %s", 
properties.getString(Config.PASSWORD_FILE)));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, new 
String(bytes));
+  } else {
+throw new IllegalArgumentException(String.format("JDBCSource needs 
either a %s or %s to connect to RDBMS "
++ "datasource", Config.PASSWORD_FILE, Config.PASSWORD));
+  }
+
+  addExtraJdbcOptions(properties, dataFrameReader);
+
+  if (properties.getBoolean(Config.IS_INCREMENTAL)) {
+DataSourceUtils.checkRequiredProperties(properties, 
Arrays.asList(Config.INCREMENTAL_COLUMN));
+  }
+  return dataFrameReader;
+} catch (Exception e) {
+  throw new HoodieException(e);
+} finally {
+  IOUtils.closeStream(passwordFileStream);
+}
+  }
+
+  private static void addExtraJdbcOptions(TypedProperties properties, 
DataFrameReader dataFrameReader) {
+Set objects = properties.keySet();
+for (Object property : objects) {
+  String prop = (String) property;
+  if (prop.startsWith(Config.EXTRA_OPTIONS)) {
+String key = Arrays.asList(prop.split(Config.EXTRA_OPTIONS)).stream()
+.collect(Collectors.joining());
+String value = properties.getString(prop);
+if (!StringUtils.isNullOrEmpty(value)) {
+  LOG.info(String.format("Adding %s -> %s to jdbc options", key, 
value));
+  dataFrameReader.option(key, value);
+} else {
+  LOG.warn(String.format("Skipping %s jdbc option as value is null or 
empty", key));
+}
+  }
+}
+  }
+
+  @Override
+  protect

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #969: [HUDI-251] JDBC incremental load to HUDI DeltaStreamer

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #969: [HUDI-251] JDBC 
incremental load to HUDI DeltaStreamer
URL: https://github.com/apache/incubator-hudi/pull/969#discussion_r341243826
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JDBCSource.java
 ##
 @@ -0,0 +1,239 @@
+package org.apache.hudi.utilities.sources;
+
+import java.util.Arrays;
+import java.util.Set;
+import java.util.stream.Collectors;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.IOUtils;
+import org.apache.hudi.DataSourceUtils;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.StringUtils;
+import org.apache.hudi.common.util.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.utilities.schema.SchemaProvider;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.Column;
+import org.apache.spark.sql.DataFrameReader;
+import org.apache.spark.sql.Dataset;
+import org.apache.spark.sql.Row;
+import org.apache.spark.sql.SparkSession;
+import org.apache.spark.sql.functions;
+import org.apache.spark.sql.types.DataTypes;
+import org.jetbrains.annotations.NotNull;
+
+
+public class JDBCSource extends RowSource {
+
+  private static Logger LOG = LogManager.getLogger(JDBCSource.class);
+
+  public JDBCSource(TypedProperties props, JavaSparkContext sparkContext, 
SparkSession sparkSession,
+  SchemaProvider schemaProvider) {
+super(props, sparkContext, sparkSession, schemaProvider);
+  }
+
+  private static DataFrameReader validatePropsAndGetDataFrameReader(final 
SparkSession session,
+  final TypedProperties properties)
+  throws HoodieException {
+DataFrameReader dataFrameReader = null;
+FSDataInputStream passwordFileStream = null;
+try {
+  dataFrameReader = session.read().format("jdbc");
+  dataFrameReader = dataFrameReader.option(Config.URL_PROP, 
properties.getString(Config.URL));
+  dataFrameReader = dataFrameReader.option(Config.USER_PROP, 
properties.getString(Config.USER));
+  dataFrameReader = dataFrameReader.option(Config.DRIVER_PROP, 
properties.getString(Config.DRIVER_CLASS));
+  dataFrameReader = dataFrameReader
+  .option(Config.RDBMS_TABLE_PROP, 
properties.getString(Config.RDBMS_TABLE_NAME));
+
+  if (properties.containsKey(Config.PASSWORD) && !StringUtils
+  .isNullOrEmpty(properties.getString(Config.PASSWORD))) {
+LOG.info("Reading JDBC password from properties file");
+dataFrameReader = dataFrameReader.option(Config.PASSWORD_PROP, 
properties.getString(Config.PASSWORD));
+  } else if (properties.containsKey(Config.PASSWORD_FILE) && !StringUtils
+  .isNullOrEmpty(properties.getString(Config.PASSWORD_FILE))) {
+LOG.info(
+String.format("Reading JDBC password from password file %s", 
properties.getString(Config.PASSWORD_FILE)));
+FileSystem fileSystem = FileSystem.get(new Configuration());
+passwordFileStream = fileSystem.open(new 
Path(properties.getString(Config.PASSWORD_FILE)));
+byte[] bytes = new byte[passwordFileStream.available()];
+passwordFileStream.read(bytes);
 
 Review comment:
   but you can pass the `passwordFileStream` to FileIoUtils.readAsByteArray` 
correct? its just another InputStream,?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #961: [HUDI-306] Support Glue catalog and other hive metastore implementations

2019-10-31 Thread GitBox

vinothchandar commented on issue #961: [HUDI-306] Support Glue catalog and 
other hive metastore implementations
URL: https://github.com/apache/incubator-hudi/pull/961#issuecomment-548444063
 
 
   Latest on that is summarized here 
   https://issues.apache.org/jira/browse/HUDI-312  We are actively debugging 
it.. Will try again today.. If you can take a crack at it, sure by all means :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[jira] [Commented] (HUDI-312) Investigate recent flaky CI runs

2019-10-31 Thread Vinoth Chandar (Jira)



[ 
https://issues.apache.org/jira/browse/HUDI-312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16964154#comment-16964154
 ] 

Vinoth Chandar commented on HUDI-312:
-

> Embedded timeline had been enabled for some time now. So, it is still not 
> clear if embedded timeline server is causing it.

Given its a feature we recommend to users, I would suggest not to disable this 
right away. 

@uditme This is where we are now. 

Next steps could be : 
 * Dump out the logs (currently logs are printed after command succeeds) as the 
command makes progress and see where the hang is at
 * Try force killing the jvm after that point and see if it exits atleast. (I 
tried added System.exit(0) to last line of DeltaStreamer::main and it did not 
do the trick. 

> Investigate recent flaky CI runs
> 
>
> Key: HUDI-312
> URL: https://issues.apache.org/jira/browse/HUDI-312
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Testing
>Reporter: Vinoth Chandar
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
> Attachments: Builds - apache_incubator-hudi - Travis CI.pdf
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> master used to be solid green. noticing that nowadays PRs and even some 
> master merges fail with 
> - No output received for 10m
> - Exceeded runtime of 50m 
> - VM exit crash 
> We saw this earlier in the year as well. It was due to the apache org queue 
> in travis being busy/stressed. I think we should shadow azure CI or circle CI 
> parallely and weed out code vs environment issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[GitHub] [incubator-hudi] vinothchandar commented on issue #989: [HUDI-312] Disable embedded timeline server to mitigate integration test hanging issue

2019-10-31 Thread GitBox

vinothchandar commented on issue #989: [HUDI-312] Disable embedded timeline 
server to mitigate integration test hanging issue
URL: https://github.com/apache/incubator-hudi/pull/989#issuecomment-548437171
 
 
   Why would this suddenly change? did you see where exactly the hang was using 
the log state,ments? I am also confused as to why its always in MOR ingest? 
   
   I would like to spend a bit more time before we make this fix, to understand 
why this is the fix..  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-221] Translate concept page (#977)

2019-10-31 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 4fd3c7f  [HUDI-221] Translate concept page (#977)
4fd3c7f is described below

commit 4fd3c7f737a2cf2a5d506896ea641e4d62d103ce
Author: leesf <490081...@qq.com>
AuthorDate: Thu Oct 31 23:01:20 2019 +0800

[HUDI-221] Translate concept page (#977)
---
 docs/concepts.cn.md | 203 
 1 file changed, 93 insertions(+), 110 deletions(-)

diff --git a/docs/concepts.cn.md b/docs/concepts.cn.md
index 5c38ea3..98a1692 100644
--- a/docs/concepts.cn.md
+++ b/docs/concepts.cn.md
@@ -4,168 +4,151 @@ keywords: hudi, design, storage, views, timeline
 sidebar: mydoc_sidebar
 permalink: concepts.html
 toc: false
-summary: "Here we introduce some basic concepts & give a broad technical 
overview of Hudi"
+summary: "这里我们将介绍Hudi的一些基本概念并提供关于Hudi的技术概述"
 ---
 
-Apache Hudi (pronounced “Hudi”) provides the following streaming primitives 
over datasets on DFS
+Apache Hudi(发音为“Hudi”)在DFS的数据集上提供以下流原语
 
- * Upsert (how do I change the dataset?)
- * Incremental pull   (how do I fetch data that changed?)
+ * 插入更新   (如何改变数据集?)
+ * 增量拉取   (如何获取变更的数据?)
 
-In this section, we will discuss key concepts & terminologies that are 
important to understand, to be able to effectively use these primitives.
+在本节中，我们将讨论重要的概念和术语，这些概念和术语有助于理解并有效使用这些原语。
 
-## Timeline
-At its core, Hudi maintains a `timeline` of all actions performed on the 
dataset at different `instants` of time that helps provide instantaneous views 
of the dataset,
-while also efficiently supporting retrieval of data in the order of arrival. A 
Hudi instant consists of the following components 
+## 时间轴
+在它的核心，Hudi维护一条包含在不同的`即时`时间所有对数据集操作的`时间轴`，从而提供，从不同时间点出发得到不同的视图下的数据集。Hudi即时包含以下组件
 
- * `Action type` : Type of action performed on the dataset
- * `Instant time` : Instant time is typically a timestamp (e.g: 
20190117010349), which monotonically increases in the order of action's begin 
time.
- * `state` : current state of the instant
- 
-Hudi guarantees that the actions performed on the timeline are atomic & 
timeline consistent based on the instant time.
+ * `操作类型` : 对数据集执行的操作类型
+ * `即时时间` : 即时时间通常是一个时间戳(例如：20190117010349)，该时间戳按操作开始时间的顺序单调增加。
+ * `状态` : 即时的状态
 
-Key actions performed include
+Hudi保证在时间轴上执行的操作的原子性和基于即时时间的时间轴一致性。
 
- * `COMMITS` - A commit denotes an **atomic write** of a batch of records into 
a dataset.
- * `CLEANS` - Background activity that gets rid of older versions of files in 
the dataset, that are no longer needed.
- * `DELTA_COMMIT` - A delta commit refers to an **atomic write** of a batch of 
records into a  MergeOnRead storage type of dataset, where some/all of the data 
could be just written to delta logs.
- * `COMPACTION` - Background activity to reconcile differential data 
structures within Hudi e.g: moving updates from row based log files to columnar 
formats. Internally, compaction manifests as a special commit on the timeline
- * `ROLLBACK` - Indicates that a commit/delta commit was unsuccessful & rolled 
back, removing any partial files produced during such a write
- * `SAVEPOINT` - Marks certain file groups as "saved", such that cleaner will 
not delete them. It helps restore the dataset to a point on the timeline, in 
case of disaster/data recovery scenarios.
+执行的关键操作包括
 
-Any given instant can be 
-in one of the following states
+ * `COMMITS` - 一次提交表示将一组记录**原子写入**到数据集中。
+ * `CLEANS` - 删除数据集中不再需要的旧文件版本的后台活动。
+ * `DELTA_COMMIT` - 
增量提交是指将一批记录**原子写入**到MergeOnRead存储类型的数据集中，其中一些/所有数据都可以只写到增量日志中。
+ * `COMPACTION` - 协调Hudi中差异数据结构的后台活动，例如：将更新从基于行的日志文件变成列格式。在内部，压缩表现为时间轴上的特殊提交。
+ * `ROLLBACK` - 表示提交/增量提交不成功且已回滚，删除在写入过程中产生的所有部分文件。
+ * `SAVEPOINT` - 
将某些文件组标记为"已保存"，以便清理程序不会将其删除。在发生灾难/数据恢复的情况下，它有助于将数据集还原到时间轴上的某个点。
 
- * `REQUESTED` - Denotes an action has been scheduled, but has not initiated
- * `INFLIGHT` - Denotes that the action is currently being performed
- * `COMPLETED` - Denotes completion of an action on the timeline
+任何给定的即时都可以处于以下状态之一
+
+ * `REQUESTED` - 表示已调度但尚未启动的操作。
+ * `INFLIGHT` - 表示当前正在执行该操作。
+ * `COMPLETED` - 表示在时间轴上完成了该操作。
 
 
 
 
 
-Example above shows upserts happenings between 10:00 and 10:20 on a Hudi 
dataset, roughly every 5 mins, leaving commit metadata on the Hudi timeline, 
along
-with other background cleaning/compactions. One key observation to make is 
that the commit time indicates the `arrival time` of the data (10:20AM), while 
the actual data
-organization reflects the actual time or `event time`, the data was intended 
for (hourly buckets from 07:00). These are two key concepts when reasoning 
about tradeoffs between latency and completeness of data.
+上面的示例显示了在Hudi数据集上大约10:00到10:20之间发生的更新事件，大约每5分钟一次，将提交元数据以及其他后台清理/压缩保留在Hudi时间轴上。
+观察的关键点是：提交时间指示数据的`到达时间`（上午10:20），而实际数据组

[GitHub] [incubator-hudi] vinothchandar merged pull request #977: [HUDI-221] Translate concept page

2019-10-31 Thread GitBox

vinothchandar merged pull request #977: [HUDI-221] Translate concept page
URL: https://github.com/apache/incubator-hudi/pull/977
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar merged pull request #985: [HUDI-275] Translate the Querying Data page into Chinese documentation

2019-10-31 Thread GitBox

vinothchandar merged pull request #985: [HUDI-275] Translate the Querying Data 
page into Chinese documentation
URL: https://github.com/apache/incubator-hudi/pull/985
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch asf-site updated: [HUDI-275] Translate the Querying Data page into Chinese documentation (#985)

2019-10-31 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 4b3b197  [HUDI-275] Translate the Querying Data page into Chinese 
documentation (#985)
4b3b197 is described below

commit 4b3b197b8a6e983f20067ed3ef00694e19edf9f9
Author: Y Ethan Guo 
AuthorDate: Thu Oct 31 08:00:37 2019 -0700

[HUDI-275] Translate the Querying Data page into Chinese documentation 
(#985)
---
 docs/querying_data.cn.md | 174 +++
 1 file changed, 87 insertions(+), 87 deletions(-)

diff --git a/docs/querying_data.cn.md b/docs/querying_data.cn.md
index 1653b08..c690385 100644
--- a/docs/querying_data.cn.md
+++ b/docs/querying_data.cn.md
@@ -1,102 +1,102 @@
 ---
-title: Querying Hudi Datasets
+title: 查询 Hudi 数据集
 keywords: hudi, hive, spark, sql, presto
 sidebar: mydoc_sidebar
 permalink: querying_data.html
 toc: false
-summary: In this page, we go over how to enable SQL queries on Hudi built 
tables.
+summary: 在这一页里，我们介绍了如何在Hudi构建的表上启用SQL查询。
 ---
 
-Conceptually, Hudi stores data physically once on DFS, while providing 3 
logical views on top, as explained [before](concepts.html#views). 
-Once the dataset is synced to the Hive metastore, it provides external Hive 
tables backed by Hudi's custom inputformats. Once the proper hudi
-bundle has been provided, the dataset can be queried by popular query engines 
like Hive, Spark and Presto.
+从概念上讲，Hudi物理存储一次数据到DFS上，同时在其上提供三个逻辑视图，如[之前](concepts.html#views)所述。
+数据集同步到Hive Metastore后，它将提供由Hudi的自定义输入格式支持的Hive外部表。一旦提供了适当的Hudi捆绑包，
+就可以通过Hive、Spark和Presto之类的常用查询引擎来查询数据集。
 
-Specifically, there are two Hive tables named off [table 
name](configurations.html#TABLE_NAME_OPT_KEY) passed during write. 
-For e.g, if `table name = hudi_tbl`, then we get  
+具体来说，在写入过程中传递了两个由[table name](configurations.html#TABLE_NAME_OPT_KEY)命名的Hive表。
+例如，如果`table name = hudi_tbl`，我们得到
 
- - `hudi_tbl` realizes the read optimized view of the dataset backed by 
`HoodieParquetInputFormat`, exposing purely columnar data.
- - `hudi_tbl_rt` realizes the real time view of the dataset  backed by 
`HoodieParquetRealtimeInputFormat`, exposing merged view of base and log data.
+ - `hudi_tbl` 实现了由 `HoodieParquetInputFormat` 支持的数据集的读优化视图，从而提供了纯列式数据。
+ - `hudi_tbl_rt` 实现了由 `HoodieParquetRealtimeInputFormat` 
支持的数据集的实时视图，从而提供了基础数据和日志数据的合并视图。
 
-As discussed in the concepts section, the one key primitive needed for 
[incrementally 
processing](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop),
-is `incremental pulls` (to obtain a change stream/log from a dataset). Hudi 
datasets can be pulled incrementally, which means you can get ALL and ONLY the 
updated & new rows 
-since a specified instant time. This, together with upserts, are particularly 
useful for building data pipelines where 1 or more source Hudi tables are 
incrementally pulled (streams/facts),
-joined with other tables (datasets/dimensions), to [write out 
deltas](writing_data.html) to a target Hudi dataset. Incremental view is 
realized by querying one of the tables above, 
-with special configurations that indicates to query planning that only 
incremental data needs to be fetched out of the dataset. 
+如概念部分所述，[增量处理](https://www.oreilly.com/ideas/ubers-case-for-incremental-processing-on-hadoop)所需要的
+一个关键原语是`增量拉取`（以从数据集中获取更改流/日志）。您可以增量提取Hudi数据集，这意味着自指定的即时时间起，
+您可以只获得全部更新和新行。 这与插入更新一起使用，对于构建某些数据管道尤其有用，包括将1个或多个源Hudi表（数据流/事实）以增量方式拉出（流/事实）
+并与其他表（数据集/维度）结合以[写出增量](write_data.html)到目标Hudi数据集。增量视图是通过查询上表之一实现的，并具有特殊配置，
+该特殊配置指示查询计划仅需要从数据集中获取增量数据。
 
-In sections, below we will discuss in detail how to access all the 3 views on 
each query engine.
+接下来，我们将详细讨论在每个查询引擎上如何访问所有三个视图。
 
 ## Hive
 
-In order for Hive to recognize Hudi datasets and query correctly, the 
HiveServer2 needs to be provided with the 
`hudi-hadoop-mr-bundle-x.y.z-SNAPSHOT.jar` 
-in its [aux jars 
path](https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cm_mc_hive_udf.html#concept_nc3_mms_lr).
 This will ensure the input format 
-classes with its dependencies are available for query planning & execution. 
-
-### Read Optimized table {#hive-ro-view}
-In addition to setup above, for beeline cli access, the `hive.input.format` 
variable needs to be set to the  fully qualified path name of the 
-inputformat `org.apache.hudi.hadoop.HoodieParquetInputFormat`. For Tez, 
additionally the `hive.tez.input.format` needs to be set 
-to `org.apache.hadoop.hive.ql.io.HiveInputFormat`
-
-### Real time table {#hive-rt-view}
-In addition to installing the hive bundle jar on the HiveServer2, it needs to 
be put on the hadoop/hive installation across the cluster, so that
-queries can pick up the custom RecordReader as well.
-
-### Incremental Pulling {#hive-incr-pull}
-
-`HiveIncrementalPuller` allows incrementally extracting changes from larg

[GitHub] [incubator-hudi] vinothchandar commented on issue #990: [MINOR] Fix annotation error in TestHiveSyncTool

2019-10-31 Thread GitBox

vinothchandar commented on issue #990: [MINOR] Fix annotation error in 
TestHiveSyncTool
URL: https://github.com/apache/incubator-hudi/pull/990#issuecomment-548417321
 
 
   Good catch @leesf  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[incubator-hudi] branch master updated: [MINOR] fix annotation in teardown (#990)

2019-10-31 Thread vinoth

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7c7403a  [MINOR] fix annotation in teardown (#990)
7c7403a is described below

commit 7c7403a59dedc012fe2945e230a50f90139fce4b
Author: leesf <490081...@qq.com>
AuthorDate: Thu Oct 31 22:59:35 2019 +0800

[MINOR] fix annotation in teardown (#990)
---
 hudi-hive/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hudi-hive/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java 
b/hudi-hive/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
index f7cdb93..b253114 100644
--- a/hudi-hive/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
+++ b/hudi-hive/src/test/java/org/apache/hudi/hive/TestHiveSyncTool.java
@@ -39,6 +39,7 @@ import org.apache.parquet.schema.OriginalType;
 import org.apache.parquet.schema.PrimitiveType;
 import org.apache.parquet.schema.Types;
 import org.joda.time.DateTime;
+import org.junit.After;
 import org.junit.Before;
 import org.junit.Test;
 import org.junit.runner.RunWith;
@@ -65,7 +66,7 @@ public class TestHiveSyncTool {
 TestUtil.setUp();
   }
 
-  @Before
+  @After
   public void teardown() throws IOException, InterruptedException {
 TestUtil.clear();
   }

[GitHub] [incubator-hudi] vinothchandar merged pull request #990: [MINOR] Fix annotation error in TestHiveSyncTool

2019-10-31 Thread GitBox

vinothchandar merged pull request #990: [MINOR] Fix annotation error in 
TestHiveSyncTool
URL: https://github.com/apache/incubator-hudi/pull/990
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on issue #987: Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread GitBox

vinothchandar commented on issue #987: Support bulkinsert in HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/987#issuecomment-548416042
 
 
   As @garyli1019 pointed out.. you can still use the delta streamer itself and 
to do the one time import. 
   
   @garyli1019 may be we can add a command/example either to wiki as a blog 
page, or http://hudi.apache.org/migration_guide.html ? :) 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #987: Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #987: Support bulkinsert in 
HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/987#discussion_r341183789
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java
 ##
 @@ -64,20 +65,17 @@
 
   private static volatile Logger log = 
LogManager.getLogger(HDFSParquetImporter.class);
 
-  public static final SimpleDateFormat PARTITION_FORMATTER = new 
SimpleDateFormat("/MM/dd");
-  private static volatile Logger logger = 
LogManager.getLogger(HDFSParquetImporter.class);
+  private static final DateTimeFormatter PARTITION_FORMATTER = 
DateTimeFormatter.ofPattern("/MM/dd")
+  .withZone(ZoneOffset.UTC);
 
 Review comment:
   can we leave this to the default zone on the box? typically, servers run in 
UTC anyway


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #987: Support bulkinsert in HDFSParquetImporter

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #987: Support bulkinsert in 
HDFSParquetImporter
URL: https://github.com/apache/incubator-hudi/pull/987#discussion_r341184146
 
 

 ##
 File path: 
hudi-utilities/src/main/java/org/apache/hudi/utilities/HDFSParquetImporter.java
 ##
 @@ -64,20 +65,17 @@
 
   private static volatile Logger log = 
LogManager.getLogger(HDFSParquetImporter.class);
 
-  public static final SimpleDateFormat PARTITION_FORMATTER = new 
SimpleDateFormat("/MM/dd");
-  private static volatile Logger logger = 
LogManager.getLogger(HDFSParquetImporter.class);
+  private static final DateTimeFormatter PARTITION_FORMATTER = 
DateTimeFormatter.ofPattern("/MM/dd")
+  .withZone(ZoneOffset.UTC);
   private final Config cfg;
   private transient FileSystem fs;
   /**
* Bag of properties with source, hoodie client, key generator etc.
*/
   private TypedProperties props;
 
-  public HDFSParquetImporter(Config cfg) throws IOException {
+  public HDFSParquetImporter(Config cfg) {
 this.cfg = cfg;
-this.props = cfg.propsFilePath == null ? 
UtilHelpers.buildProperties(cfg.configs)
-: UtilHelpers.readConfig(fs, new Path(cfg.propsFilePath), 
cfg.configs).getConfig();
 
 Review comment:
   yikes, @ovj fyi 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #986: [HUDI-317] change quickstart page spark-shell command

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #986: [HUDI-317] change 
quickstart page spark-shell command 
URL: https://github.com/apache/incubator-hudi/pull/986#discussion_r341181665
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -12,29 +12,14 @@ code snippets that allows you to insert and update a Hudi 
dataset of default sto
 [Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
-## Build Hudi spark bundle jar
-
-Hudi requires Java 8 to be installed on a *nix system. Check out 
[code](https://github.com/apache/incubator-hudi) and 
-normally build the maven project, from command line:
-
-``` 
-# checkout and build
-git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
-mvn clean install -DskipTests -DskipITs
-
-# Export the location of hudi-spark-bundle for later 
-mkdir -p /tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/tmp/hudi/hudi-spark-bundle.jar 
-export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
-```
-
 
 Review comment:
   I think we should keep instructions to build Hudi .. may be in a different 
place, but simple to find... Often times, folks are trying out code from 
master.. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #986: [HUDI-317] change quickstart page spark-shell command

2019-10-31 Thread GitBox

vinothchandar commented on a change in pull request #986: [HUDI-317] change 
quickstart page spark-shell command 
URL: https://github.com/apache/incubator-hudi/pull/986#discussion_r341182000
 
 

 ##
 File path: docs/quickstart.md
 ##
 @@ -12,29 +12,14 @@ code snippets that allows you to insert and update a Hudi 
dataset of default sto
 [Copy on Write](https://hudi.apache.org/concepts.html#copy-on-write-storage). 
 After each write operation we will also show how to read the data both 
snapshot and incrementally.
 
-## Build Hudi spark bundle jar
-
-Hudi requires Java 8 to be installed on a *nix system. Check out 
[code](https://github.com/apache/incubator-hudi) and 
-normally build the maven project, from command line:
-
-``` 
-# checkout and build
-git clone https://github.com/apache/incubator-hudi.git && cd incubator-hudi
-mvn clean install -DskipTests -DskipITs
-
-# Export the location of hudi-spark-bundle for later 
-mkdir -p /tmp/hudi && cp 
packaging/hudi-spark-bundle/target/hudi-spark-bundle-*.*.*-SNAPSHOT.jar  
/tmp/hudi/hudi-spark-bundle.jar 
-export HUDI_SPARK_BUNDLE_PATH=/tmp/hudi/hudi-spark-bundle.jar
-```
-
 ## Setup spark-shell
 Hudi works with Spark-2.x versions. You can follow instructions 
[here](https://spark.apache.org/downloads.html) for 
 setting up spark. 
 
 From the extracted directory run spark-shell with Hudi as:
 
 ```
-bin/spark-shell --jars $HUDI_SPARK_BUNDLE_PATH --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
+bin/spark-shell bin/spark-shell --packages 
org.apache.hudi:hudi-spark-bundle:0.5.0-incubating --conf 
'spark.serializer=org.apache.spark.serializer.KryoSerializer'
 
 Review comment:
   +1 . please make sure you can copy paste this into a shell as-is and things 
work end-end 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-hudi] leesf opened a new pull request #990: [MINOR] Fix annotation error in TestHiveSyncTool

2019-10-31 Thread GitBox

leesf opened a new pull request #990: [MINOR] Fix annotation error in 
TestHiveSyncTool
URL: https://github.com/apache/incubator-hudi/pull/990
 
 
   Fix annotation error in TestHiveSyncTool
   Cc @vinothchandar 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

57 matches

Mail list logo