[jira] [Assigned] (HUDI-1016) [Minor] Code optimization

2020-06-08 Thread Hong Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong Shen reassigned HUDI-1016:
---

Assignee: Hong Shen

> [Minor] Code optimization
> -
>
> Key: HUDI-1016
> URL: https://issues.apache.org/jira/browse/HUDI-1016
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hong Shen
>Assignee: Hong Shen
>Priority: Minor
> Attachments: image-2020-06-09-13-04-15-008.png
>
>
> Some code can be optimized.
>  !image-2020-06-09-13-04-15-008.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1016) [Minor] Code optimization

2020-06-08 Thread Hong Shen (Jira)
Hong Shen created HUDI-1016:
---

 Summary: [Minor] Code optimization
 Key: HUDI-1016
 URL: https://issues.apache.org/jira/browse/HUDI-1016
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Hong Shen
 Attachments: image-2020-06-09-13-04-15-008.png

Some code can be optimized.
 !image-2020-06-09-13-04-15-008.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1006) deltastreamer use kafkaSource with offset reset strategy: latest can't consume data

2020-06-08 Thread Tianye Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianye Li updated HUDI-1006:

Summary: deltastreamer use kafkaSource with offset reset strategy: latest 
can't consume data  (was: deltastreamer use kafkaSource set 
auto.offset.reset=latest can't consume data)

> deltastreamer use kafkaSource with offset reset strategy: latest can't 
> consume data
> ---
>
> Key: HUDI-1006
> URL: https://issues.apache.org/jira/browse/HUDI-1006
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: Tianye Li
>Priority: Major
> Fix For: 0.6.0
>
>
> org.apache.hudi.utilities.sources.JsonKafkaSource#fetchNewData
> if (totalNewMsgs <= 0) {
>  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
> lastCheckpointStr.get() : "");
> }
> I think it should not be empty here, it should be 
> if (totalNewMsgs <= 0) {
>  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
> lastCheckpointStr.get() : CheckpointUtils.offsetsToStr(offsetRanges));
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1006) deltastreamer use kafkaSource set auto.offset.reset=latest can't consume data

2020-06-08 Thread Tianye Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tianye Li updated HUDI-1006:

Summary: deltastreamer use kafkaSource set auto.offset.reset=latest can't 
consume data  (was: deltastreamer set auto.offset.reset=latest can't consume 
data)

> deltastreamer use kafkaSource set auto.offset.reset=latest can't consume data
> -
>
> Key: HUDI-1006
> URL: https://issues.apache.org/jira/browse/HUDI-1006
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: Tianye Li
>Priority: Major
> Fix For: 0.6.0
>
>
> org.apache.hudi.utilities.sources.JsonKafkaSource#fetchNewData
> if (totalNewMsgs <= 0) {
>  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
> lastCheckpointStr.get() : "");
> }
> I think it should not be empty here, it should be 
> if (totalNewMsgs <= 0) {
>  return new InputBatch<>(Option.empty(), lastCheckpointStr.isPresent() ? 
> lastCheckpointStr.get() : CheckpointUtils.offsetsToStr(offsetRanges));
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Build failed in Jenkins: hudi-snapshot-deployment-0.5 #303

2020-06-08 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.40 KB...]
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-timeline-service:jar:0.6.0-SNAPSHOT
[WARNING] 'build.plugins.plugin.(groupId:artifactId)' must be unique but found 
duplicate declaration of plugin org.jacoco:jacoco-maven-plugin @ 
org.apache.hudi:hudi-timeline-service:[unknown-version], 

 line 58, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 

[hudi] branch master updated: HUDI-494 fix incorrect record size estimation

2020-06-08 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 22cd824  HUDI-494 fix incorrect record size estimation
22cd824 is described below

commit 22cd824d993bf43d88121ea89bad3a1f23a28518
Author: garyli1019 
AuthorDate: Thu May 14 20:20:44 2020 -0700

HUDI-494 fix incorrect record size estimation
---
 .../apache/hudi/config/HoodieCompactionConfig.java |  13 +++
 .../org/apache/hudi/config/HoodieWriteConfig.java  |   4 +
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  |  31 --
 .../table/action/commit/UpsertPartitioner.java |   9 +-
 .../TestHoodieClientOnCopyOnWriteStorage.java  |   8 +-
 .../apache/hudi/table/TestHoodieRecordSizing.java  | 116 -
 .../table/action/commit/TestUpsertPartitioner.java |  91 
 .../hudi/testutils/HoodieTestDataGenerator.java|   8 +-
 8 files changed, 125 insertions(+), 155 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java 
b/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
index 5e295ac..f89fc06 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
@@ -54,6 +54,12 @@ public class HoodieCompactionConfig extends 
DefaultHoodieConfig {
   public static final String PARQUET_SMALL_FILE_LIMIT_BYTES = 
"hoodie.parquet.small.file.limit";
   // By default, treat any file <= 100MB as a small file.
   public static final String DEFAULT_PARQUET_SMALL_FILE_LIMIT_BYTES = 
String.valueOf(104857600);
+  // Hudi will use the previous commit to calculate the estimated record size 
by totalBytesWritten/totalRecordsWritten.
+  // If the previous commit is too small to make an accurate estimation, Hudi 
will search commits in the reverse order,
+  // until find a commit has totalBytesWritten larger than 
(PARQUET_SMALL_FILE_LIMIT_BYTES * RECORD_SIZE_ESTIMATION_THRESHOLD)
+  public static final String RECORD_SIZE_ESTIMATION_THRESHOLD_PROP = 
"hoodie.record.size.estimation.threshold";
+  public static final String DEFAULT_RECORD_SIZE_ESTIMATION_THRESHOLD = "1.0";
+
   /**
* Configs related to specific table types.
*/
@@ -173,6 +179,11 @@ public class HoodieCompactionConfig extends 
DefaultHoodieConfig {
   return this;
 }
 
+public Builder compactionRecordSizeEstimateThreshold(double threshold) {
+  props.setProperty(RECORD_SIZE_ESTIMATION_THRESHOLD_PROP, 
String.valueOf(threshold));
+  return this;
+}
+
 public Builder insertSplitSize(int insertSplitSize) {
   props.setProperty(COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE, 
String.valueOf(insertSplitSize));
   return this;
@@ -254,6 +265,8 @@ public class HoodieCompactionConfig extends 
DefaultHoodieConfig {
   DEFAULT_MIN_COMMITS_TO_KEEP);
   setDefaultOnCondition(props, 
!props.containsKey(PARQUET_SMALL_FILE_LIMIT_BYTES), 
PARQUET_SMALL_FILE_LIMIT_BYTES,
   DEFAULT_PARQUET_SMALL_FILE_LIMIT_BYTES);
+  setDefaultOnCondition(props, 
!props.containsKey(RECORD_SIZE_ESTIMATION_THRESHOLD_PROP), 
RECORD_SIZE_ESTIMATION_THRESHOLD_PROP,
+  DEFAULT_RECORD_SIZE_ESTIMATION_THRESHOLD);
   setDefaultOnCondition(props, 
!props.containsKey(COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE),
   COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE, 
DEFAULT_COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE);
   setDefaultOnCondition(props, 
!props.containsKey(COPY_ON_WRITE_TABLE_AUTO_SPLIT_INSERTS),
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java 
b/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
index d6527fa..d899257 100644
--- a/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
+++ b/hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
@@ -272,6 +272,10 @@ public class HoodieWriteConfig extends DefaultHoodieConfig 
{
 return 
Integer.parseInt(props.getProperty(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES));
   }
 
+  public double getRecordSizeEstimationThreshold() {
+return 
Double.parseDouble(props.getProperty(HoodieCompactionConfig.RECORD_SIZE_ESTIMATION_THRESHOLD_PROP));
+  }
+
   public int getCopyOnWriteInsertSplitSize() {
 return 
Integer.parseInt(props.getProperty(HoodieCompactionConfig.COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE));
   }
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
index ed29180..974d847 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
@@ -29,14 +29,12 @@ import 

[jira] [Comment Edited] (HUDI-781) Re-design test utilities

2020-06-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128778#comment-17128778
 ] 

Raymond Xu edited comment on HUDI-781 at 6/9/20, 2:41 AM:
--

[~yanghua] [~vinoth] [~nishith29] [~garyli1019]

Here is an execution plan of the subtasks
 * To begin with, I'm trying to finish subtask #1 as it can be a quick win. As 
shown in [https://github.com/apache/hudi/pull/1619#issuecomment-627610722,] we 
can reduce CI time by 10+ min by simply split the test tasks
 * In parallel we can start #3. The proposed `hudi-testutils` module is to 
encompass all `testutils` from each module, which makes the test dependencies 
clearer. It will clean up some misplaced tests found during package 
restructure. 
 ** org.apache.hudi.execution.TestBoundedInMemoryQueue in `hudi-client` should 
be put in `hudi-common` (misplaced due to client test harness dependency)
 ** org.apache.hudi.utilities.inline.fs.TestParquetInLining in `hudi-utilities` 
should be put in `hudi-common` (misplaced due to data generator dependency)
 * Once a minimum setup of `hudi-testutils` is done, we can start #4
 ** Implement a shared spark session provider there
 ** Use the shared spark session provider for test suites, which group 
functional tests with similar setup/teardown logic (may need to figure out 
Junit 5 version of Junit 4 test suites with Rule / ClassRule )
 ** By using the new provider class on functional tests one by one, we should 
start observing reduced test time of hudi-client module or others
 * #2 and #5 can be done in parallel

Each subtask has its own detailed points in its ticket. Please review this 
rough plan and feedback accordingly. Thanks!


was (Author: rxu):
[~yanghua] [~vinoth] [~nishith29] [~garyli1019]

Here is an execution plan of the subtasks
 * To begin with, I'm trying to finish subtask #1 as it can be a quick win. As 
shown in [https://github.com/apache/hudi/pull/1619#issuecomment-627610722,] we 
can reduce CI time by 10+ min by simply split the test tasks
 * In parallel we can start #3. The proposed `hudi-testutils` module is to 
encompass all `testutils` from each module, which makes the test dependencies 
clearer. It will clean up some misplaced tests found during package 
restructure. 
 ** org.apache.hudi.execution.TestBoundedInMemoryQueue in `hudi-client` should 
be put in `hudi-common` (due to client test harness dependency)
 ** org.apache.hudi.utilities.inline.fs.TestParquetInLining in `hudi-utilities` 
should be put in `hudi-common` (due to data generator dependency)
 * Once a minimum setup of `hudi-testutils` is done, we can start #4
 ** Implement a shared spark session provider there
 ** Use the shared spark session provider for test suites, which group 
functional tests with similar setup/teardown logic (may need to figure out 
Junit 5 version of Junit 4 test suites with Rule / ClassRule )
 ** By using the new provider class on functional tests one by one, we should 
start observing reduced test time of hudi-client module or others
 * #2 and #5 can be done in parallel

Each subtask has its own detailed points in its ticket. Please review this 
rough plan and feedback accordingly. Thanks!

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-781) Re-design test utilities

2020-06-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128778#comment-17128778
 ] 

Raymond Xu edited comment on HUDI-781 at 6/9/20, 2:36 AM:
--

[~yanghua] [~vinoth] [~nishith29] [~garyli1019]

Here is an execution plan of the subtasks
 * To begin with, I'm trying to finish subtask #1 as it can be a quick win. As 
shown in [https://github.com/apache/hudi/pull/1619#issuecomment-627610722,] we 
can reduce CI time by 10+ min by simply split the test tasks
 * In parallel we can start #3. The proposed `hudi-testutils` module is to 
encompass all `testutils` from each module, which makes the test dependencies 
clearer. It will clean up some misplaced tests found during package 
restructure. 
 ** org.apache.hudi.execution.TestBoundedInMemoryQueue in `hudi-client` should 
be put in `hudi-common` (due to client test harness dependency)
 ** org.apache.hudi.utilities.inline.fs.TestParquetInLining in `hudi-utilities` 
should be put in `hudi-common` (due to data generator dependency)
 * Once a minimum setup of `hudi-testutils` is done, we can start #4
 ** Implement a shared spark session provider there
 ** Use the shared spark session provider for test suites, which group 
functional tests with similar setup/teardown logic (may need to figure out 
Junit 5 version of Junit 4 test suites with Rule / ClassRule )
 ** By using the new provider class on functional tests one by one, we should 
start observing reduced test time of hudi-client module or others
 * #2 and #5 can be done in parallel

Each subtask has its own detailed points in its ticket. Please review this 
rough plan and feedback accordingly. Thanks!


was (Author: rxu):
[~yanghua] [~yanghua] [~nishith29] [~garyli1019]

Here is an execution plan of the subtasks
 * To begin with, I'm trying to finish subtask #1 as it can be a quick win. As 
shown in [https://github.com/apache/hudi/pull/1619#issuecomment-627610722,] we 
can reduce CI time by 10+ min by simply split the test tasks
 * In parallel we can start #3. The proposed `hudi-testutils` module is to 
encompass all `testutils` from each module, which makes the test dependencies 
clearer. It will clean up some misplaced tests found during package 
restructure. 
 ** org.apache.hudi.execution.TestBoundedInMemoryQueue in `hudi-client` should 
be put in `hudi-common` (due to client test harness dependency)
 ** org.apache.hudi.utilities.inline.fs.TestParquetInLining in `hudi-utilities` 
should be put in `hudi-common` (due to data generator dependency)
 * Once a minimum setup of `hudi-testutils` is done, we can start #4
 ** Implement a shared spark session provider there
 ** Use the shared spark session provider for test suites, which group 
functional tests with similar setup/teardown logic (may need to figure out 
Junit 5 version of Junit 4 test suites with Rule / ClassRule )
 ** By using the new provider class on functional tests one by one, we should 
start observing reduced test time of hudi-client module or others
 * #2 and #5 can be done in parallel

Each subtask has its own detailed points in its ticket. Please review this 
rough plan and feedback accordingly. Thanks!

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-781) Re-design test utilities

2020-06-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128778#comment-17128778
 ] 

Raymond Xu commented on HUDI-781:
-

[~yanghua] [~yanghua] [~nishith29] [~garyli1019]

Here is an execution plan of the subtasks
 * To begin with, I'm trying to finish subtask #1 as it can be a quick win. As 
shown in [https://github.com/apache/hudi/pull/1619#issuecomment-627610722,] we 
can reduce CI time by 10+ min by simply split the test tasks
 * In parallel we can start #3. The proposed `hudi-testutils` module is to 
encompass all `testutils` from each module, which makes the test dependencies 
clearer. It will clean up some misplaced tests found during package 
restructure. 
 ** org.apache.hudi.execution.TestBoundedInMemoryQueue in `hudi-client` should 
be put in `hudi-common` (due to client test harness dependency)
 ** org.apache.hudi.utilities.inline.fs.TestParquetInLining in `hudi-utilities` 
should be put in `hudi-common` (due to data generator dependency)
 * Once a minimum setup of `hudi-testutils` is done, we can start #4
 ** Implement a shared spark session provider there
 ** Use the shared spark session provider for test suites, which group 
functional tests with similar setup/teardown logic (may need to figure out 
Junit 5 version of Junit 4 test suites with Rule / ClassRule )
 ** By using the new provider class on functional tests one by one, we should 
start observing reduced test time of hudi-client module or others
 * #2 and #5 can be done in parallel

Each subtask has its own detailed points in its ticket. Please review this 
rough plan and feedback accordingly. Thanks!

> Re-design test utilities
> 
>
> Key: HUDI-781
> URL: https://issues.apache.org/jira/browse/HUDI-781
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> Test utility classes are to re-designed with considerations like
>  * Use more mockings
>  * Reduce spark context setup
>  * Improve/clean up data generator
> An RFC would be preferred for illustrating the design work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch release-0.5.3 updated (ed4bcbc -> e0c45f6)

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard ed4bcbc  [HUDI-988] Fix More Unit Test Flakiness
 new e0c45f6  [HUDI-988] Fix More Unit Test Flakiness

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (ed4bcbc)
\
 N -- N -- N   refs/heads/release-0.5.3 (e0c45f6)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 hudi-client/src/test/java/org/apache/hudi/index/TestHoodieIndex.java | 1 -
 1 file changed, 1 deletion(-)



[hudi] 01/01: [HUDI-988] Fix More Unit Test Flakiness

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit e0c45f62818da1e285781a5a30622f69accde1af
Author: garyli1019 
AuthorDate: Fri Jun 5 17:25:59 2020 -0700

[HUDI-988] Fix More Unit Test Flakiness
---
 .../hudi/client/TestCompactionAdminClient.java |  17 +---
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 +-
 .../hudi/client/TestUpdateSchemaEvolution.java |   3 +-
 .../hudi/common/HoodieClientTestHarness.java   |  67 +---
 .../execution/TestBoundedInMemoryExecutor.java |   2 +-
 .../hudi/execution/TestBoundedInMemoryQueue.java   |   3 +-
 .../org/apache/hudi/index/TestHoodieIndex.java |  11 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   4 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   5 +-
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |   3 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |   6 +-
 .../apache/hudi/table/TestConsistencyGuard.java|   2 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   4 +-
 .../apache/hudi/table/TestMergeOnReadTable.java| 112 ++---
 .../hudi/table/compact/TestAsyncCompaction.java|   2 +-
 .../hudi/table/compact/TestHoodieCompactor.java|   5 +-
 .../table/view/HoodieTableFileSystemView.java  |   6 ++
 .../timeline/service/FileSystemViewHandler.java|   2 +-
 18 files changed, 137 insertions(+), 121 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
 
b/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
index 8e94857..41fb16c 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
@@ -33,9 +33,9 @@ import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.table.compact.OperationResult;
+
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
-import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
 import org.junit.Test;
@@ -67,13 +67,6 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 client = new CompactionAdminClient(jsc, basePath);
   }
 
-  @After
-  public void tearDown() {
-client.close();
-metaClient = null;
-cleanupSparkContexts();
-  }
-
   @Test
   public void testUnscheduleCompactionPlan() throws Exception {
 int numEntriesPerInstant = 10;
@@ -273,10 +266,10 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 new HoodieTableFileSystemView(metaClient, 
metaClient.getCommitsAndCompactionTimeline());
 // Expect all file-slice whose base-commit is same as compaction commit to 
contain no new Log files
 
newFsView.getLatestFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 compactionInstant, true)
-.filter(fs -> 
fs.getBaseInstantTime().equals(compactionInstant)).forEach(fs -> {
-  Assert.assertFalse("No Data file must be present", 
fs.getBaseFile().isPresent());
-  Assert.assertEquals("No Log Files", 0, fs.getLogFiles().count());
-});
+.filter(fs -> 
fs.getBaseInstantTime().equals(compactionInstant)).forEach(fs -> {
+  Assert.assertFalse("No Data file must be present", 
fs.getBaseFile().isPresent());
+  Assert.assertEquals("No Log Files", 0, fs.getLogFiles().count());
+});
 
 // Ensure same number of log-files before and after renaming per fileId
 Map fileIdToCountsAfterRenaming =
diff --git a/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java 
b/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
index 8d3fa13..24ecc8e 100644
--- a/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
+++ b/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
@@ -63,9 +63,7 @@ public class TestMultiFS extends HoodieClientTestHarness {
 
   @After
   public void tearDown() throws Exception {
-cleanupSparkContexts();
-cleanupDFS();
-cleanupTestDataGenerator();
+cleanupResources();
   }
 
   protected HoodieWriteConfig getHoodieWriteConfig(String basePath) {
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
 
b/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
index ab6e940..de853f5 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
@@ -60,8 +60,7 @@ public class TestUpdateSchemaEvolution extends 
HoodieClientTestHarness {
 
   @After
   public void tearDown() throws IOException {
-cleanupSparkContexts();
-

[jira] [Updated] (HUDI-1007) When earliestOffsets is greater than checkpoint, Hudi will not be able to successfully consume data

2020-06-08 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-1007:

Status: Open  (was: New)

> When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> ---
>
> Key: HUDI-1007
> URL: https://issues.apache.org/jira/browse/HUDI-1007
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Use deltastreamer to consume kafka,
>  When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen#checkupValidOffsets
> boolean checkpointOffsetReseter = checkpointOffsets.entrySet().stream()
>  .anyMatch(offset -> offset.getValue() < 
> earliestOffsets.get(offset.getKey()));
> return checkpointOffsetReseter ? earliestOffsets : checkpointOffsets;
> Kafka data is continuously generated, which means that some data will 
> continue to expire.
>  When earliestOffsets is greater than checkpoint, earliestOffsets will be 
> taken. But at this moment, some data expired. In the end, consumption fails. 
> This process is an endless cycle. I can understand that this design may be to 
> avoid the loss of data, but it will lead to such a situation, I want to fix 
> this problem, I want to hear your opinion  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-914) support different target data clusters

2020-06-08 Thread liujinhui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128757#comment-17128757
 ] 

liujinhui commented on HUDI-914:


Due to the needs of some business parties, they only want the hudi dataset to 
appear on their clusters, and they do not want to pay attention to specific 
tasks
[~vinoth]

> support different target data clusters
> --
>
> Key: HUDI-914
> URL: https://issues.apache.org/jira/browse/HUDI-914
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently hudi-DeltaStreamer does not support writing to different target 
> clusters. The specific scenarios are as follows: Generally, Hudi tasks run on 
> an independent cluster. If you want to write data to the target data cluster, 
> you generally rely on core-site.xml and hdfs-site.xml; sometimes you will 
> encounter different targets. The data cluster writes data, but the cluster 
> running the hudi task does not have the core-site.xml and hdfs-site.xml of 
> the target cluster. Although specifying the namenode IP address of the target 
> cluster can be written, this loses HDFS high availability, so I plan to Use 
> the contents of the core-site.xml and hdfs-site.xml files of the target 
> cluster as configuration items and configure them in the 
> dfs-source.properties or kafka-source.properties file of Hudi.
> Is there a better way to solve this problem?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-635:

Status: Open  (was: New)

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-69:
---
Status: Patch Available  (was: In Progress)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Status: Patch Available  (was: In Progress)

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Fix Version/s: 0.6.0

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Major
>  Labels: help-requested, pull-request-available
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-684) Introduce abstraction for writing and reading and compacting from FileGroups

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-684:

Priority: Blocker  (was: Major)

> Introduce abstraction for writing and reading and compacting from FileGroups 
> -
>
> Key: HUDI-684
> URL: https://issues.apache.org/jira/browse/HUDI-684
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Code Cleanup, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Prashant Wason
>Priority: Blocker
>  Labels: help-requested, pull-request-available
> Fix For: 0.6.0
>
>
> We may have different combinations of base and log data 
>  
> parquet , avro (today)
> parquet, parquet 
> hfile, hfile (indexing, RFC-08)
>  
> reading/writing/compaction machinery should be solved 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-06-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-635:


Assignee: sivabalan narayanan  (was: Vinoth Chandar)

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-242) Support Efficient bootstrap of large parquet datasets to Hudi

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-242:

Status: Patch Available  (was: In Progress)

> Support Efficient bootstrap of large parquet datasets to Hudi
> -
>
> Key: HUDI-242
> URL: https://issues.apache.org/jira/browse/HUDI-242
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Usability
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
>  Support Efficient bootstrap of large parquet tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-882) Update documentation with new configs for 0.6.0 release

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-882:

Status: New  (was: Open)

> Update documentation with new configs for 0.6.0 release
> ---
>
> Key: HUDI-882
> URL: https://issues.apache.org/jira/browse/HUDI-882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Umbrella ticket to track new configurations that needs to be added in docs 
> page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-818:

Status: New  (was: Open)

> Optimize the default value of hoodie.memory.merge.max.size option
> -
>
> Key: HUDI-818
> URL: https://issues.apache.org/jira/browse/HUDI-818
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: bug-bash-0.6.0, help-requested
> Fix For: 0.6.0
>
>
> The default value of hoodie.memory.merge.max.size option is incapable of 
> meeting their performance requirements
> [https://github.com/apache/incubator-hudi/issues/1491]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-802:

Status: New  (was: Open)

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Priority: Blocker
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Status: In Progress  (was: Open)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-289) Implement a test suite to support long running test for Hudi writing and querying end-end

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-289:

Status: Patch Available  (was: In Progress)

> Implement a test suite to support long running test for Hudi writing and 
> querying end-end
> -
>
> Key: HUDI-289
> URL: https://issues.apache.org/jira/browse/HUDI-289
> Project: Apache Hudi
>  Issue Type: Test
>  Components: Usability
>Reporter: Vinoth Chandar
>Assignee: Nishith Agarwal
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> We would need an equivalent of an end-end test which runs some workload for 
> few hours atleast, triggers various actions like commit, deltacopmmit, 
> rollback, compaction and ensures correctness of code before every release
> P.S: Learn from all the CSS issues managing compaction..
> The feature branch is here: 
> [https://github.com/apache/incubator-hudi/tree/hudi_test_suite_refactor]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Status: New  (was: Open)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Status: Patch Available  (was: In Progress)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Index, Performance
>Reporter: Vinoth Chandar
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-855) Run Auto Cleaner in parallel with ingestion

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-855:

Status: New  (was: Open)

> Run Auto Cleaner in parallel with ingestion
> ---
>
> Key: HUDI-855
> URL: https://issues.apache.org/jira/browse/HUDI-855
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently auto clean is run synchronously after ingestion is finished.  As 
> Cleaning and ingestion can safely happen in parallel, we can take advantage 
> and schedule them to run in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-651) Incremental Query on Hive via Spark SQL does not return expected results

2020-06-08 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128742#comment-17128742
 ] 

Vinoth Chandar commented on HUDI-651:
-

[~bhavanisudha] can you please push your draft impl to a PR/branch so someone 
can pick it up? 

> Incremental Query on Hive via Spark SQL does not return expected results
> 
>
> Key: HUDI-651
> URL: https://issues.apache.org/jira/browse/HUDI-651
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Using the docker demo, I added two delta commits to a MOR table and was a 
> hoping to incremental consume them like Hive QL.. Something amiss
> {code}
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.start.timestamp","20200302210147")
> scala> 
> spark.sparkContext.hadoopConfiguration.set("hoodie.stock_ticks_mor_rt.consume.mode","INCREMENTAL")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> +---+
> |_hoodie_commit_time|
> +---+
> |20200302210010 |
> |20200302210147 |
> +---+
> scala> sc.setLogLevel("INFO")
> scala> spark.sql("select distinct `_hoodie_commit_time` from 
> stock_ticks_mor_rt").show(100, false)
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO aggregate.HashAggregateExec: 
> spark.sql.codegen.aggregate.map.twolevel.enabled is set to true, but current 
> version of codegened fast hashmap does not support this aggregate.
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44 stored as 
> values in memory (estimated size 292.3 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO memory.MemoryStore: Block broadcast_44_piece0 stored 
> as bytes in memory (estimated size 25.4 KB, free 365.3 MB)
> 20/03/02 21:15:37 INFO storage.BlockManagerInfo: Added broadcast_44_piece0 in 
> memory on adhoc-1:45623 (size: 25.4 KB, free: 366.2 MB)
> 20/03/02 21:15:37 INFO spark.SparkContext: Created broadcast 44 from 
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@5a66fc27, 
> file:/etc/hadoop/hive-site.xml], FileSystem: 
> [DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1645984031_1, ugi=root 
> (auth:SIMPLE)]]]
> 20/03/02 21:15:37 INFO table.HoodieTableConfig: Loading table properties from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor/.hoodie/hoodie.properties
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Finished Loading Table of 
> type MERGE_ON_READ(version=1) from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO mapred.FileInputFormat: Total input paths to process : 
> 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Found a total of 1 
> groups
> 20/03/02 21:15:37 INFO timeline.HoodieActiveTimeline: Loaded instants 
> [[20200302210010__clean__COMPLETED], 
> [20200302210010__deltacommit__COMPLETED], [20200302210147__clean__COMPLETED], 
> [20200302210147__deltacommit__COMPLETED]]
> 20/03/02 21:15:37 INFO view.HoodieTableFileSystemView: Adding file-groups for 
> partition :2018/08/31, #FileGroups=1
> 20/03/02 21:15:37 INFO view.AbstractTableFileSystemView: addFilesToView: 
> NumFiles=1, FileGroupsCreationTime=0, StoreTimeTaken=0
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Total paths to 
> process after hoodie filter 1
> 20/03/02 21:15:37 INFO hadoop.HoodieParquetInputFormat: Reading hoodie 
> metadata from path hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO table.HoodieTableMetaClient: Loading 
> HoodieTableMetaClient from 
> hdfs://namenode:8020/user/hive/warehouse/stock_ticks_mor
> 20/03/02 21:15:37 INFO util.FSUtils: Hadoop Configuration: fs.defaultFS: 
> [hdfs://namenode:8020], Config:[Configuration: core-default.xml, 
> core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, 
> yarn-site.xml, hdfs-default.xml, hdfs-site.xml, 
> 

[jira] [Updated] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-472:

Status: New  (was: Open)

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1013) Bulk Insert w/o converting to RDD

2020-06-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1013:
--
Summary: Bulk Insert w/o converting to RDD  (was: Bulk Insert w/ converting 
to RDD)

> Bulk Insert w/o converting to RDD
> -
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-115) Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-115:

Status: Patch Available  (was: In Progress)

> Enhance OverwriteWithLatestAvroPayload to also respect ordering value of 
> record in storage
> --
>
> Key: HUDI-115
> URL: https://issues.apache.org/jira/browse/HUDI-115
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> https://lists.apache.org/thread.html/45035cc88901b37e3f985b72def90ee5529c4caf87e48d650c00327d@
>  
> context here 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-635:

Priority: Blocker  (was: Major)

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-635) MergeHandle's DiskBasedMap entries can be thinner

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-635:

Fix Version/s: 0.6.0

> MergeHandle's DiskBasedMap entries can be thinner
> -
>
> Key: HUDI-635
> URL: https://issues.apache.org/jira/browse/HUDI-635
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Instead of , we can just track  ... Helps 
> with use-cases like HUDI-625



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2020-06-08 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-1015:


 Summary: Audit all getAllPartitionPaths() calls and keep em out of 
fast path
 Key: HUDI-1015
 URL: https://issues.apache.org/jira/browse/HUDI-1015
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Common Core, Writer Core
Reporter: Vinoth Chandar
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1013) Bulk Insert w/ converting to RDD

2020-06-08 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-1013:
--
Fix Version/s: 0.6.0

> Bulk Insert w/ converting to RDD
> 
>
> Key: HUDI-1013
> URL: https://issues.apache.org/jira/browse/HUDI-1013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1014) Design and Implement upgrade-downgrade infrastrucutre

2020-06-08 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-1014:


 Summary: Design and Implement upgrade-downgrade infrastrucutre
 Key: HUDI-1014
 URL: https://issues.apache.org/jira/browse/HUDI-1014
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Common Core, Writer Core
Reporter: Vinoth Chandar
Assignee: Vinoth Chandar
 Fix For: 0.6.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-839:

Status: In Progress  (was: Open)

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-839:

Status: Open  (was: New)

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1013) Bulk Insert w/ converting to RDD

2020-06-08 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-1013:
-

 Summary: Bulk Insert w/ converting to RDD
 Key: HUDI-1013
 URL: https://issues.apache.org/jira/browse/HUDI-1013
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: sivabalan narayanan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-839) Implement rollbacks using marker files instead of relying on commit metadata

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-839:

Priority: Blocker  (was: Major)

> Implement rollbacks using marker files instead of relying on commit metadata
> 
>
> Key: HUDI-839
> URL: https://issues.apache.org/jira/browse/HUDI-839
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> This is more efficient and avoids the needs for caching the input into 
> memory. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-305) Presto MOR "_rt" queries only reads base parquet file

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-305:

Priority: Blocker  (was: Major)

> Presto MOR "_rt" queries only reads base parquet file 
> --
>
> Key: HUDI-305
> URL: https://issues.apache.org/jira/browse/HUDI-305
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Presto Integration
> Environment: On AWS EMR
>Reporter: Brandon Scheller
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Code example to reproduce.
> {code:java}
> import org.apache.hudi.DataSourceWriteOptions
> import org.apache.hudi.config.HoodieWriteConfig
> import org.apache.spark.sql.SaveMode
> val df = Seq(
>   ("100", "event_name_900", "2015-01-01T13:51:39.340396Z", "type1"),
>   ("101", "event_name_546", "2015-01-01T12:14:58.597216Z", "type2"),
>   ("104", "event_name_123", "2015-01-01T12:15:00.512679Z", "type1"),
>   ("105", "event_name_678", "2015-01-01T13:51:42.248818Z", "type2")
>   ).toDF("event_id", "event_name", "event_ts", "event_type")
> var tableName = "hudi_events_mor_1"
> var tablePath = "s3://emr-users/wenningd/hudi/tables/events/" + tableName
> // write hudi dataset
> df.write.format("org.apache.hudi")
>   .option(HoodieWriteConfig.TABLE_NAME, tableName)
>   .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL)
>   .option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>   .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>   .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>   .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>   .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>   .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>   .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>   .option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>   .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>   .mode(SaveMode.Overwrite)
>   .save(tablePath)
> // update a record with event_name "event_name_123" => "event_name_changed"
> val df1 = spark.read.format("org.apache.hudi").load(tablePath + "/*/*")
> val df2 = df1.filter($"event_id" === "104")
> val df3 = df2.withColumn("event_name", lit("event_name_changed"))
> // update hudi dataset
> df3.write.format("org.apache.hudi")
>.option(HoodieWriteConfig.TABLE_NAME, tableName)
>.option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>.option(DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_STORAGE_TYPE_OPT_VAL)
>.option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "event_id")
>.option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, "event_type") 
>.option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, "event_ts")
>.option("hoodie.compact.inline", "false")
>.option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>.option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, tableName)
>.option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, "event_type")
>.option(DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY, "false")
>.option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> "org.apache.hudi.hive.MultiPartKeysValueExtractor")
>.mode(SaveMode.Append)
>.save(tablePath)
> {code}
> Now when querying the real-time table from Hive, we have no issue seeing the 
> updated value:
> {code:java}
> hive> select event_name from hudi_events_mor_1_rt;
> OK
> event_name_900
> event_name_changed
> event_name_546
> event_name_678
> Time taken: 0.103 seconds, Fetched: 4 row(s)
> {code}
> But when querying the real-time table from Presto, we only read the base 
> parquet file and do not see the update that should be merged in from the log 
> file.
> {code:java}
> presto:default> select event_name from hudi_events_mor_1_rt;
>event_name
> 
>  event_name_900
>  event_name_123
>  event_name_546
>  event_name_678
> (4 rows)
> {code}
> Our current understanding of this issue is that while the 
> HoodieParquetRealtimeInputFormat correctly generates the splits. The 
> RealtimeCompactedRecordReader record reader is not used so it is not reading 
> the log file and only reading the base parquet file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Priority: Blocker  (was: Major)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-575:

Priority: Blocker  (was: Major)

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Prasanna Rajaperumal
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-575) Support Async Compaction for spark streaming writes to hudi table

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-575:
---

Assignee: Balaji Varadarajan  (was: Prasanna Rajaperumal)

> Support Async Compaction for spark streaming writes to hudi table
> -
>
> Key: HUDI-575
> URL: https://issues.apache.org/jira/browse/HUDI-575
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Currenlty, only inline compaction is supported for Structured streaming 
> writes. 
>  
> We need to 
>  * Enable configuring async compaction for streaming writes 
>  * Implement a parallel compaction process like we did for delta streamer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-818) Optimize the default value of hoodie.memory.merge.max.size option

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-818:

Priority: Blocker  (was: Major)

> Optimize the default value of hoodie.memory.merge.max.size option
> -
>
> Key: HUDI-818
> URL: https://issues.apache.org/jira/browse/HUDI-818
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: lamber-ken
>Assignee: lamber-ken
>Priority: Blocker
>  Labels: bug-bash-0.6.0, help-requested
> Fix For: 0.6.0
>
>
> The default value of hoodie.memory.merge.max.size option is incapable of 
> meeting their performance requirements
> [https://github.com/apache/incubator-hudi/issues/1491]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-979) AWSDMSPayload delete handling with MOR

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-979:

Priority: Blocker  (was: Major)

> AWSDMSPayload delete handling with MOR
> --
>
> Key: HUDI-979
> URL: https://issues.apache.org/jira/browse/HUDI-979
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1549] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-855) Run Auto Cleaner in parallel with ingestion

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-855:

Priority: Blocker  (was: Major)

> Run Auto Cleaner in parallel with ingestion
> ---
>
> Key: HUDI-855
> URL: https://issues.apache.org/jira/browse/HUDI-855
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> Currently auto clean is run synchronously after ingestion is finished.  As 
> Cleaning and ingestion can safely happen in parallel, we can take advantage 
> and schedule them to run in parallel.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-845) Allow parallel writing and move the pending rollback work into cleaner

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-845:

Priority: Blocker  (was: Major)

> Allow parallel writing and move the pending rollback work into cleaner
> --
>
> Key: HUDI-845
> URL: https://issues.apache.org/jira/browse/HUDI-845
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Priority: Blocker
>  Labels: help-requested
> Fix For: 0.6.0
>
>
> Things to think about 
>  * Commit time has to be unique across writers 
>  * Parallel writers can finish commits out of order i.e c2 commits before c1.
>  * MOR log blocks fence uncommited data.. 
>  * Cleaner should loudly complain if it cannot finish cleaning up partial 
> writes.  
>  
> P.S: think about what is left for the general thing : log files may have 
> different order, inserts may violate uniqueness constraint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-853) Deprecate/Remove Clean_by_versions functionality in Hudi

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-853:

Status: Open  (was: New)

> Deprecate/Remove Clean_by_versions functionality in Hudi 
> -
>
> Key: HUDI-853
> URL: https://issues.apache.org/jira/browse/HUDI-853
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Cleaner by version is not used and it also does not lend itself well with 
> incremental cleaning. 
> Can we go ahead and deprecate it in 0.6.0 ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-853) Deprecate/Remove Clean_by_versions functionality in Hudi

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-853.
---
Resolution: Won't Fix

> Deprecate/Remove Clean_by_versions functionality in Hudi 
> -
>
> Key: HUDI-853
> URL: https://issues.apache.org/jira/browse/HUDI-853
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Cleaner
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
>
> Cleaner by version is not used and it also does not lend itself well with 
> incremental cleaning. 
> Can we go ahead and deprecate it in 0.6.0 ? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-719) Exception during clean phase: Found org.apache.hudi.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleanerPlan

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-719:

Status: Open  (was: New)

> Exception during clean phase: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan
> --
>
> Key: HUDI-719
> URL: https://issues.apache.org/jira/browse/HUDI-719
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Dataset is written using 0.5 moving to the latest master:
> {code:java}
>  Exception in thread "main" org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
>  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
>  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>  at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
>  at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
>  at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
>  at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>  at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86)
>  at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843)
>  at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:397)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>  at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>  at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-719) Exception during clean phase: Found org.apache.hudi.avro.model.HoodieCleanMetadata, expecting org.apache.hudi.avro.model.HoodieCleanerPlan

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-719.
-
Resolution: Fixed

> Exception during clean phase: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan
> --
>
> Key: HUDI-719
> URL: https://issues.apache.org/jira/browse/HUDI-719
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Alexander Filipchik
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> Dataset is written using 0.5 moving to the latest master:
> {code:java}
>  Exception in thread "main" org.apache.avro.AvroTypeException: Found 
> org.apache.hudi.avro.model.HoodieCleanMetadata, expecting 
> org.apache.hudi.avro.model.HoodieCleanerPlan, missing required field policy
>  at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:292)
>  at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
>  at 
> org.apache.avro.io.ResolvingDecoder.readFieldOrder(ResolvingDecoder.java:130)
>  at 
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:215)
>  at 
> org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
>  at 
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:233)
>  at org.apache.avro.file.DataFileStream.next(DataFileStream.java:220)
>  at 
> org.apache.hudi.common.util.AvroUtils.deserializeAvroMetadata(AvroUtils.java:149)
>  at 
> org.apache.hudi.common.util.CleanerUtils.getCleanerPlan(CleanerUtils.java:87)
>  at 
> org.apache.hudi.client.HoodieCleanClient.runClean(HoodieCleanClient.java:141)
>  at 
> org.apache.hudi.client.HoodieCleanClient.lambda$clean$0(HoodieCleanClient.java:88)
>  at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
>  at org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:86)
>  at org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:843)
>  at 
> org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:520)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:168)
>  at 
> org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:111)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:397)
>  at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:237)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:121)
>  at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:294)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>  at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>  at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>  at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>  at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>  at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>  at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>  at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-882) Update documentation with new configs for 0.6.0 release

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-882:

Priority: Blocker  (was: Major)

> Update documentation with new configs for 0.6.0 release
> ---
>
> Key: HUDI-882
> URL: https://issues.apache.org/jira/browse/HUDI-882
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs
>Reporter: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> Umbrella ticket to track new configurations that needs to be added in docs 
> page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-115) Enhance OverwriteWithLatestAvroPayload to also respect ordering value of record in storage

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-115:

Priority: Blocker  (was: Major)

> Enhance OverwriteWithLatestAvroPayload to also respect ordering value of 
> record in storage
> --
>
> Key: HUDI-115
> URL: https://issues.apache.org/jira/browse/HUDI-115
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Bhavani Sudha
>Priority: Blocker
>  Labels: bug-bash-0.6.0
> Fix For: 0.6.0
>
>
> https://lists.apache.org/thread.html/45035cc88901b37e3f985b72def90ee5529c4caf87e48d650c00327d@
>  
> context here 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-920) Incremental view on MOR table using Spark Datasource

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-920:

Priority: Blocker  (was: Major)

> Incremental view on MOR table using Spark Datasource
> 
>
> Key: HUDI-920
> URL: https://issues.apache.org/jira/browse/HUDI-920
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Blocker
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-844) Store Avro schema string as first-level entity in commit metadata

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-844:

Status: Open  (was: New)

> Store Avro schema string as first-level entity in commit metadata
> -
>
> Key: HUDI-844
> URL: https://issues.apache.org/jira/browse/HUDI-844
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: newbie
> Fix For: 0.6.0
>
>
> Currently, we store avro schema string in commit metadata inside a map 
> structure - extraMetadata. We are building logic where we expect this avro 
> schema to be present in metadata. It would be cleaner if we store avro schema 
> in the same way we store write-stats in commit metadata. 
> We need to use MigrationHandler framework to handle upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-69) Support realtime view in Spark datasource #136

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-69:
---
Priority: Blocker  (was: Major)

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-844) Store Avro schema string as first-level entity in commit metadata

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-844.
---
Resolution: Won't Fix

> Store Avro schema string as first-level entity in commit metadata
> -
>
> Key: HUDI-844
> URL: https://issues.apache.org/jira/browse/HUDI-844
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
>  Labels: newbie
> Fix For: 0.6.0
>
>
> Currently, we store avro schema string in commit metadata inside a map 
> structure - extraMetadata. We are building logic where we expect this avro 
> schema to be present in metadata. It would be cleaner if we store avro schema 
> in the same way we store write-stats in commit metadata. 
> We need to use MigrationHandler framework to handle upgrade.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-472:

Status: Open  (was: New)

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-472) Make sortBy() inside bulkInsertInternal() configurable for bulk_insert

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-472:

Priority: Blocker  (was: Major)

> Make sortBy() inside bulkInsertInternal() configurable for bulk_insert
> --
>
> Key: HUDI-472
> URL: https://issues.apache.org/jira/browse/HUDI-472
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-586) Revisit the release guide

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-586:
---

Assignee: sivabalan narayanan  (was: leesf)

> Revisit the release guide
> -
>
> Key: HUDI-586
> URL: https://issues.apache.org/jira/browse/HUDI-586
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Release  Administrative
>Reporter: leesf
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.6.0
>
>
> Currently, the release guide is not very standard, mainly meaning the 
> finalize the release step, we would refer to FLINK 
> [https://cwiki.apache.org/confluence/display/FLINK/Creating+a+Flink+Release] 
> , main change might be not adding rc-\{RC_NUM} to the pom.xml.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-672) Spark DataSource - Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar closed HUDI-672.
---
Resolution: Duplicate

> Spark DataSource - Upsert for S3 Hudi dataset with large partitions takes a 
> lot of time in writing
> --
>
> Key: HUDI-672
> URL: https://issues.apache.org/jira/browse/HUDI-672
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.6.0
>
>
> Github Issue : [https://github.com/apache/incubator-hudi/issues/1371]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-672) Spark DataSource - Upsert for S3 Hudi dataset with large partitions takes a lot of time in writing

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-672:

Status: Open  (was: New)

> Spark DataSource - Upsert for S3 Hudi dataset with large partitions takes a 
> lot of time in writing
> --
>
> Key: HUDI-672
> URL: https://issues.apache.org/jira/browse/HUDI-672
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Udit Mehrotra
>Priority: Major
> Fix For: 0.6.0
>
>
> Github Issue : [https://github.com/apache/incubator-hudi/issues/1371]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-802:

Priority: Blocker  (was: Major)

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Priority: Blocker
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-802) AWSDmsTransformer does not handle insert -> delete of a row in a single batch correctly

2020-06-08 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-802:

Status: Open  (was: New)

> AWSDmsTransformer does not handle insert -> delete of a row in a single batch 
> correctly
> ---
>
> Key: HUDI-802
> URL: https://issues.apache.org/jira/browse/HUDI-802
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Christopher Weaver
>Priority: Major
> Fix For: 0.6.0
>
>
> The provided AWSDmsAvroPayload class 
> ([https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/payload/AWSDmsAvroPayload.java])
>  currently handles cases where the "Op" column is a "D" for updates, and 
> successfully removes the row from the resulting table. 
> However, when an insert is quickly followed by a delete on the row (e.g. DMS 
> processes them together and puts the update records together in the same 
> parquet file), the row incorrectly appears in the resulting table. In this 
> case, the record is not in the table and getInsertValue is called rather than 
> combineAndGetUpdateValue. Since the logic to check for a delete is in 
> combineAndGetUpdateValue, it is skipped and the delete is missed. Something 
> like this could fix this issue: 
> [https://github.com/Weves/incubator-hudi/blob/release-0.5.1/hudi-spark/src/main/java/org/apache/hudi/payload/CustomAWSDmsAvroPayload.java].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch release-0.5.3 updated (864a7cd -> ed4bcbc)

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 864a7cd  [HUDI-988] Fix More Unit Test Flakiness
 new ed4bcbc  [HUDI-988] Fix More Unit Test Flakiness

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (864a7cd)
\
 N -- N -- N   refs/heads/release-0.5.3 (ed4bcbc)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 .../hudi/client/TestCompactionAdminClient.java | 28 +++---
 .../hudi/common/HoodieClientTestHarness.java   |  2 --
 .../org/apache/hudi/index/TestHoodieIndex.java | 19 ---
 .../apache/hudi/table/TestMergeOnReadTable.java| 12 ++
 4 files changed, 21 insertions(+), 40 deletions(-)



[hudi] 01/01: [HUDI-988] Fix More Unit Test Flakiness

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ed4bcbcf54945d52871e954855b7e8d470dfff26
Author: garyli1019 
AuthorDate: Fri Jun 5 17:25:59 2020 -0700

[HUDI-988] Fix More Unit Test Flakiness
---
 .../hudi/client/TestCompactionAdminClient.java |  17 +---
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 +-
 .../hudi/client/TestUpdateSchemaEvolution.java |   3 +-
 .../hudi/common/HoodieClientTestHarness.java   |  67 +---
 .../execution/TestBoundedInMemoryExecutor.java |   2 +-
 .../hudi/execution/TestBoundedInMemoryQueue.java   |   3 +-
 .../org/apache/hudi/index/TestHoodieIndex.java |  10 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   4 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   5 +-
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |   3 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |   6 +-
 .../apache/hudi/table/TestConsistencyGuard.java|   2 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   4 +-
 .../apache/hudi/table/TestMergeOnReadTable.java| 112 ++---
 .../hudi/table/compact/TestAsyncCompaction.java|   2 +-
 .../hudi/table/compact/TestHoodieCompactor.java|   5 +-
 .../table/view/HoodieTableFileSystemView.java  |   6 ++
 .../timeline/service/FileSystemViewHandler.java|   2 +-
 18 files changed, 137 insertions(+), 120 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
 
b/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
index 8e94857..41fb16c 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
@@ -33,9 +33,9 @@ import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.table.compact.OperationResult;
+
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
-import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
 import org.junit.Test;
@@ -67,13 +67,6 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 client = new CompactionAdminClient(jsc, basePath);
   }
 
-  @After
-  public void tearDown() {
-client.close();
-metaClient = null;
-cleanupSparkContexts();
-  }
-
   @Test
   public void testUnscheduleCompactionPlan() throws Exception {
 int numEntriesPerInstant = 10;
@@ -273,10 +266,10 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 new HoodieTableFileSystemView(metaClient, 
metaClient.getCommitsAndCompactionTimeline());
 // Expect all file-slice whose base-commit is same as compaction commit to 
contain no new Log files
 
newFsView.getLatestFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 compactionInstant, true)
-.filter(fs -> 
fs.getBaseInstantTime().equals(compactionInstant)).forEach(fs -> {
-  Assert.assertFalse("No Data file must be present", 
fs.getBaseFile().isPresent());
-  Assert.assertEquals("No Log Files", 0, fs.getLogFiles().count());
-});
+.filter(fs -> 
fs.getBaseInstantTime().equals(compactionInstant)).forEach(fs -> {
+  Assert.assertFalse("No Data file must be present", 
fs.getBaseFile().isPresent());
+  Assert.assertEquals("No Log Files", 0, fs.getLogFiles().count());
+});
 
 // Ensure same number of log-files before and after renaming per fileId
 Map fileIdToCountsAfterRenaming =
diff --git a/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java 
b/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
index 8d3fa13..24ecc8e 100644
--- a/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
+++ b/hudi-client/src/test/java/org/apache/hudi/client/TestMultiFS.java
@@ -63,9 +63,7 @@ public class TestMultiFS extends HoodieClientTestHarness {
 
   @After
   public void tearDown() throws Exception {
-cleanupSparkContexts();
-cleanupDFS();
-cleanupTestDataGenerator();
+cleanupResources();
   }
 
   protected HoodieWriteConfig getHoodieWriteConfig(String basePath) {
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
 
b/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
index ab6e940..de853f5 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/client/TestUpdateSchemaEvolution.java
@@ -60,8 +60,7 @@ public class TestUpdateSchemaEvolution extends 
HoodieClientTestHarness {
 
   @After
   public void tearDown() throws IOException {
-cleanupSparkContexts();
-

[hudi] 03/04: Making few fixes after cherry picking

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 84ca5b0cae72d9b33271045efee93a4cf1a0cff5
Author: Sivabalan Narayanan 
AuthorDate: Sun Jun 7 16:23:40 2020 -0400

Making few fixes after cherry picking
---
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  52 --
 .../org/apache/hudi/cli/commands/StatsCommand.java |   4 +-
 .../apache/hudi/client/TestHoodieClientBase.java   | 917 +++--
 .../hudi/common/HoodieClientTestHarness.java   | 426 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   3 -
 .../apache/hudi/table/TestMergeOnReadTable.java|   2 +
 .../hudi/table/compact/TestHoodieCompactor.java|   6 +-
 .../table/string/TestHoodieActiveTimeline.java |   2 +-
 8 files changed, 680 insertions(+), 732 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
deleted file mode 100644
index 708ae29..000
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
+++ /dev/null
@@ -1,52 +0,0 @@
-/*
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *  http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
-
-package org.apache.hudi.cli;
-
-/**
- * Fields of print table header.
- */
-public class HoodieTableHeaderFields {
-
-  public static final String HEADER_PARTITION = "Partition";
-  public static final String HEADER_PARTITION_PATH = HEADER_PARTITION + " 
Path";
-  /**
-   * Fields of Repair.
-   */
-  public static final String HEADER_METADATA_PRESENT = "Metadata Present?";
-  public static final String HEADER_REPAIR_ACTION = "Action";
-  public static final String HEADER_HOODIE_PROPERTY = "Property";
-  public static final String HEADER_OLD_VALUE = "Old Value";
-  public static final String HEADER_NEW_VALUE = "New Value";
-
-  /**
-   * Fields of Stats.
-   */
-  public static final String HEADER_COMMIT_TIME = "CommitTime";
-  public static final String HEADER_TOTAL_UPSERTED = "Total Upserted";
-  public static final String HEADER_TOTAL_WRITTEN = "Total Written";
-  public static final String HEADER_WRITE_AMPLIFICATION_FACTOR = "Write 
Amplification Factor";
-  public static final String HEADER_HISTOGRAM_MIN = "Min";
-  public static final String HEADER_HISTOGRAM_10TH = "10th";
-  public static final String HEADER_HISTOGRAM_50TH = "50th";
-  public static final String HEADER_HISTOGRAM_AVG = "avg";
-  public static final String HEADER_HISTOGRAM_95TH = "95th";
-  public static final String HEADER_HISTOGRAM_MAX = "Max";
-  public static final String HEADER_HISTOGRAM_NUM_FILES = "NumFiles";
-  public static final String HEADER_HISTOGRAM_STD_DEV = "StdDev";
-}
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
index 4874777..b05aee2 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
@@ -54,7 +54,7 @@ import java.util.stream.Collectors;
 @Component
 public class StatsCommand implements CommandMarker {
 
-  public static final int MAX_FILES = 100;
+  private static final int MAX_FILES = 100;
 
   @CliCommand(value = "stats wa", help = "Write Amplification. Ratio of how 
many records were upserted to how many "
   + "records were actually written")
@@ -97,7 +97,7 @@ public class StatsCommand implements CommandMarker {
 return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending, limit, headerOnly, rows);
   }
 
-  public Comparable[] printFileSizeHistogram(String commitTime, Snapshot s) {
+  private Comparable[] printFileSizeHistogram(String commitTime, Snapshot s) {
 return new Comparable[] {commitTime, s.getMin(), s.getValue(0.1), 
s.getMedian(), s.getMean(), s.get95thPercentile(),
 s.getMax(), s.size(), s.getStdDev()};
   }
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientBase.java 
b/hudi-client/src/test/java/org/apache/hudi/client/TestHoodieClientBase.java
index 6e6458b..6856489 100644
--- 

[hudi] 01/04: [HUDI-988] Fix Unit Test Flakiness : Ensure all instantiations of HoodieWriteClient is closed properly. Fix bug in TestRollbacks. Make CLI unit tests for Hudi CLI check skip redering str

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 6dcd0a3524fe7be0bbbd3e673ed7e1d4b035e0cb
Author: Balaji Varadarajan 
AuthorDate: Tue Jun 2 01:49:37 2020 -0700

[HUDI-988] Fix Unit Test Flakiness : Ensure all instantiations of 
HoodieWriteClient is closed properly. Fix bug in TestRollbacks. Make CLI unit 
tests for Hudi CLI check skip redering strings
---
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  16 +
 .../org/apache/hudi/cli/commands/StatsCommand.java |   4 +-
 .../cli/commands/AbstractShellIntegrationTest.java |   2 +-
 .../hudi/cli/commands/TestRepairsCommand.java  | 206 -
 .../org/apache/hudi/client/HoodieWriteClient.java  |   2 +-
 .../apache/hudi/client/TestHoodieClientBase.java   | 938 ++---
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 -
 .../hudi/client/TestUpdateSchemaEvolution.java |   4 +-
 .../hudi/common/HoodieClientTestHarness.java   | 426 +-
 .../hudi/index/TestHBaseQPSResourceAllocator.java  |   2 +-
 .../java/org/apache/hudi/index/TestHbaseIndex.java |  17 +-
 .../org/apache/hudi/index/TestHoodieIndex.java |   2 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   2 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   2 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |  12 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   5 +-
 .../apache/hudi/table/TestMergeOnReadTable.java|  38 +-
 .../hudi/table/compact/TestHoodieCompactor.java|  12 +-
 pom.xml|   1 +
 19 files changed, 745 insertions(+), 950 deletions(-)

diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
index 2e3bc01..708ae29 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/HoodieTableHeaderFields.java
@@ -33,4 +33,20 @@ public class HoodieTableHeaderFields {
   public static final String HEADER_HOODIE_PROPERTY = "Property";
   public static final String HEADER_OLD_VALUE = "Old Value";
   public static final String HEADER_NEW_VALUE = "New Value";
+
+  /**
+   * Fields of Stats.
+   */
+  public static final String HEADER_COMMIT_TIME = "CommitTime";
+  public static final String HEADER_TOTAL_UPSERTED = "Total Upserted";
+  public static final String HEADER_TOTAL_WRITTEN = "Total Written";
+  public static final String HEADER_WRITE_AMPLIFICATION_FACTOR = "Write 
Amplification Factor";
+  public static final String HEADER_HISTOGRAM_MIN = "Min";
+  public static final String HEADER_HISTOGRAM_10TH = "10th";
+  public static final String HEADER_HISTOGRAM_50TH = "50th";
+  public static final String HEADER_HISTOGRAM_AVG = "avg";
+  public static final String HEADER_HISTOGRAM_95TH = "95th";
+  public static final String HEADER_HISTOGRAM_MAX = "Max";
+  public static final String HEADER_HISTOGRAM_NUM_FILES = "NumFiles";
+  public static final String HEADER_HISTOGRAM_STD_DEV = "StdDev";
 }
diff --git 
a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java 
b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
index b05aee2..4874777 100644
--- a/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
+++ b/hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
@@ -54,7 +54,7 @@ import java.util.stream.Collectors;
 @Component
 public class StatsCommand implements CommandMarker {
 
-  private static final int MAX_FILES = 100;
+  public static final int MAX_FILES = 100;
 
   @CliCommand(value = "stats wa", help = "Write Amplification. Ratio of how 
many records were upserted to how many "
   + "records were actually written")
@@ -97,7 +97,7 @@ public class StatsCommand implements CommandMarker {
 return HoodiePrintHelper.print(header, new HashMap<>(), sortByField, 
descending, limit, headerOnly, rows);
   }
 
-  private Comparable[] printFileSizeHistogram(String commitTime, Snapshot s) {
+  public Comparable[] printFileSizeHistogram(String commitTime, Snapshot s) {
 return new Comparable[] {commitTime, s.getMin(), s.getValue(0.1), 
s.getMedian(), s.getMean(), s.get95thPercentile(),
 s.getMax(), s.size(), s.getStdDev()};
   }
diff --git 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/AbstractShellIntegrationTest.java
 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/AbstractShellIntegrationTest.java
index ad81af5..d9f1688 100644
--- 
a/hudi-cli/src/test/java/org/apache/hudi/cli/commands/AbstractShellIntegrationTest.java
+++ 
b/hudi-cli/src/test/java/org/apache/hudi/cli/commands/AbstractShellIntegrationTest.java
@@ -58,4 +58,4 @@ public abstract class AbstractShellIntegrationTest extends 
HoodieClientTestHarne
   protected static JLineShellComponent getShell() {
 

[hudi] branch release-0.5.3 updated (5fcc461 -> 864a7cd)

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git.


omit 5fcc461  Bumping release candidate number 1
 new 6dcd0a3  [HUDI-988] Fix Unit Test Flakiness : Ensure all 
instantiations of HoodieWriteClient is closed properly. Fix bug in 
TestRollbacks. Make CLI unit tests for Hudi CLI check skip redering strings
 new ae48ecb  [HUDI-990] Timeline API : 
filterCompletedAndCompactionInstants needs to handle requested state correctly. 
Also ensure timeline gets reloaded after we revert committed transactions
 new 84ca5b0  Making few fixes after cherry picking
 new 864a7cd  [HUDI-988] Fix More Unit Test Flakiness

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (5fcc461)
\
 N -- N -- N   refs/heads/release-0.5.3 (864a7cd)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 4 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 docker/hoodie/hadoop/base/pom.xml  |   2 +-
 docker/hoodie/hadoop/datanode/pom.xml  |   2 +-
 docker/hoodie/hadoop/historyserver/pom.xml |   2 +-
 docker/hoodie/hadoop/hive_base/pom.xml |   2 +-
 docker/hoodie/hadoop/namenode/pom.xml  |   2 +-
 docker/hoodie/hadoop/pom.xml   |   2 +-
 docker/hoodie/hadoop/prestobase/pom.xml|   2 +-
 docker/hoodie/hadoop/spark_base/pom.xml|   2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml|   2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml   |   2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml   |   2 +-
 hudi-cli/pom.xml   |   2 +-
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  36 
 .../cli/commands/AbstractShellIntegrationTest.java |   2 +-
 .../hudi/cli/commands/TestRepairsCommand.java  | 206 -
 hudi-client/pom.xml|   2 +-
 .../org/apache/hudi/client/HoodieWriteClient.java  |   2 +-
 .../client/embedded/EmbeddedTimelineService.java   |   4 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  |   2 +
 .../apache/hudi/table/HoodieMergeOnReadTable.java  |   2 +
 .../hudi/client/TestCompactionAdminClient.java |  35 ++--
 .../apache/hudi/client/TestHoodieClientBase.java   | 187 +--
 .../java/org/apache/hudi/client/TestMultiFS.java   |   8 +-
 .../hudi/client/TestUpdateSchemaEvolution.java |   5 +-
 .../hudi/common/HoodieClientTestHarness.java   | 101 +++---
 .../execution/TestBoundedInMemoryExecutor.java |   2 +-
 .../hudi/execution/TestBoundedInMemoryQueue.java   |   3 +-
 .../hudi/index/TestHBaseQPSResourceAllocator.java  |   2 +-
 .../java/org/apache/hudi/index/TestHbaseIndex.java |  17 +-
 .../org/apache/hudi/index/TestHoodieIndex.java |  29 ++-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   4 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   5 +-
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |   3 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |  14 +-
 .../apache/hudi/table/TestConsistencyGuard.java|   2 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   6 +-
 .../apache/hudi/table/TestMergeOnReadTable.java| 145 +++
 .../hudi/table/compact/TestAsyncCompaction.java|   2 +-
 .../hudi/table/compact/TestHoodieCompactor.java|  17 +-
 hudi-common/pom.xml|   2 +-
 .../table/timeline/HoodieDefaultTimeline.java  |   2 +-
 .../table/view/FileSystemViewStorageConfig.java|  21 +++
 .../table/view/HoodieTableFileSystemView.java  |   6 +
 .../table/string/TestHoodieActiveTimeline.java |   2 +-
 hudi-hadoop-mr/pom.xml |   2 +-
 hudi-hive/pom.xml  |   2 +-
 hudi-integ-test/pom.xml|   2 +-
 hudi-spark/pom.xml |   2 +-
 hudi-timeline-service/pom.xml  |   2 +-
 .../timeline/service/FileSystemViewHandler.java|   2 +-
 hudi-utilities/pom.xml |   2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml|   2 +-
 packaging/hudi-hive-bundle/pom.xml |  

[hudi] 04/04: [HUDI-988] Fix More Unit Test Flakiness

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit 864a7cd880cf80aac056aac0658ee94f53b36ac9
Author: garyli1019 
AuthorDate: Fri Jun 5 17:25:59 2020 -0700

[HUDI-988] Fix More Unit Test Flakiness
---
 .../hudi/client/TestCompactionAdminClient.java |  35 +++
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 +-
 .../hudi/client/TestUpdateSchemaEvolution.java |   3 +-
 .../hudi/common/HoodieClientTestHarness.java   |  69 +++---
 .../execution/TestBoundedInMemoryExecutor.java |   2 +-
 .../hudi/execution/TestBoundedInMemoryQueue.java   |   3 +-
 .../org/apache/hudi/index/TestHoodieIndex.java |  29 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   4 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   5 +-
 .../apache/hudi/io/TestHoodieCommitArchiveLog.java |   3 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |   6 +-
 .../apache/hudi/table/TestConsistencyGuard.java|   2 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   4 +-
 .../apache/hudi/table/TestMergeOnReadTable.java| 104 ++---
 .../hudi/table/compact/TestAsyncCompaction.java|   2 +-
 .../hudi/table/compact/TestHoodieCompactor.java|   5 +-
 .../table/view/HoodieTableFileSystemView.java  |   6 ++
 .../timeline/service/FileSystemViewHandler.java|   2 +-
 18 files changed, 162 insertions(+), 126 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
 
b/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
index 8e94857..b82863f 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/client/TestCompactionAdminClient.java
@@ -33,9 +33,9 @@ import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieException;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.table.compact.OperationResult;
+
 import org.apache.log4j.LogManager;
 import org.apache.log4j.Logger;
-import org.junit.After;
 import org.junit.Assert;
 import org.junit.Before;
 import org.junit.Test;
@@ -67,13 +67,6 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 client = new CompactionAdminClient(jsc, basePath);
   }
 
-  @After
-  public void tearDown() {
-client.close();
-metaClient = null;
-cleanupSparkContexts();
-  }
-
   @Test
   public void testUnscheduleCompactionPlan() throws Exception {
 int numEntriesPerInstant = 10;
@@ -142,13 +135,13 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 List> undoFiles =
 result.stream().flatMap(r -> 
getRenamingActionsToAlignWithCompactionOperation(metaClient,
 compactionInstant, r.getOperation(), 
Option.empty()).stream()).map(rn -> {
-  try {
-renameLogFile(metaClient, rn.getKey(), rn.getValue());
-  } catch (IOException e) {
-throw new HoodieIOException(e.getMessage(), e);
-  }
-  return rn;
-}).collect(Collectors.toList());
+  try {
+renameLogFile(metaClient, rn.getKey(), rn.getValue());
+  } catch (IOException e) {
+throw new HoodieIOException(e.getMessage(), e);
+  }
+  return rn;
+}).collect(Collectors.toList());
 Map renameFilesFromUndo = undoFiles.stream()
 .collect(Collectors.toMap(p -> p.getRight().getPath().toString(), x -> 
x.getLeft().getPath().toString()));
 Map expRenameFiles = renameFiles.stream()
@@ -274,9 +267,9 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 // Expect all file-slice whose base-commit is same as compaction commit to 
contain no new Log files
 
newFsView.getLatestFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 compactionInstant, true)
 .filter(fs -> 
fs.getBaseInstantTime().equals(compactionInstant)).forEach(fs -> {
-  Assert.assertFalse("No Data file must be present", 
fs.getBaseFile().isPresent());
-  Assert.assertEquals("No Log Files", 0, fs.getLogFiles().count());
-});
+  Assert.assertFalse("No Data file must be present", 
fs.getBaseFile().isPresent());
+  Assert.assertEquals("No Log Files", 0, fs.getLogFiles().count());
+});
 
 // Ensure same number of log-files before and after renaming per fileId
 Map fileIdToCountsAfterRenaming =
@@ -335,9 +328,9 @@ public class TestCompactionAdminClient extends 
TestHoodieClientBase {
 
newFsView.getLatestFileSlicesBeforeOrOn(HoodieTestUtils.DEFAULT_PARTITION_PATHS[0],
 compactionInstant, true)
 .filter(fs -> fs.getBaseInstantTime().equals(compactionInstant))
 .filter(fs -> fs.getFileId().equals(op.getFileId())).forEach(fs 

[hudi] 02/04: [HUDI-990] Timeline API : filterCompletedAndCompactionInstants needs to handle requested state correctly. Also ensure timeline gets reloaded after we revert committed transactions

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git

commit ae48ecbe232eb55267d1a138baeec13baa1fb249
Author: Balaji Varadarajan 
AuthorDate: Wed Jun 3 00:35:14 2020 -0700

[HUDI-990] Timeline API : filterCompletedAndCompactionInstants needs to 
handle requested state correctly. Also ensure timeline gets reloaded after we 
revert committed transactions
---
 .../client/embedded/EmbeddedTimelineService.java|  4 +++-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java   |  2 ++
 .../apache/hudi/table/HoodieMergeOnReadTable.java   |  2 ++
 .../org/apache/hudi/table/TestMergeOnReadTable.java |  3 +++
 .../table/timeline/HoodieDefaultTimeline.java   |  2 +-
 .../table/view/FileSystemViewStorageConfig.java | 21 +
 6 files changed, 32 insertions(+), 2 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
 
b/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
index 5afee3f..c7c4f7b 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/client/embedded/EmbeddedTimelineService.java
@@ -89,7 +89,9 @@ public class EmbeddedTimelineService {
* Retrieves proper view storage configs for remote clients to access this 
service.
*/
   public FileSystemViewStorageConfig getRemoteFileSystemViewConfig() {
-return 
FileSystemViewStorageConfig.newBuilder().withStorageType(FileSystemViewStorageType.REMOTE_FIRST)
+FileSystemViewStorageType viewStorageType = 
config.shouldEnableBackupForRemoteFileSystemView()
+? FileSystemViewStorageType.REMOTE_FIRST : 
FileSystemViewStorageType.REMOTE_ONLY;
+return 
FileSystemViewStorageConfig.newBuilder().withStorageType(viewStorageType)
 
.withRemoteServerHost(hostAddr).withRemoteServerPort(serverPort).build();
   }
 
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
index 4c91c77..c74af2d 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieCopyOnWriteTable.java
@@ -359,6 +359,8 @@ public class HoodieCopyOnWriteTable extends Hoodi
 if (instant.isCompleted()) {
   LOG.info("Unpublishing instant " + instant);
   instant = activeTimeline.revertToInflight(instant);
+  // reload meta-client to reflect latest timeline status
+  metaClient.reloadActiveTimeline();
 }
 
 // For Requested State (like failure during index lookup), there is 
nothing to do rollback other than
diff --git 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java
index 938a5fd..5f56369 100644
--- 
a/hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java
+++ 
b/hudi-client/src/main/java/org/apache/hudi/table/HoodieMergeOnReadTable.java
@@ -179,6 +179,8 @@ public class HoodieMergeOnReadTable extends Hoodi
 if (instant.isCompleted()) {
   LOG.error("Un-publishing instant " + instant + ", deleteInstants=" + 
deleteInstants);
   instant = this.getActiveTimeline().revertToInflight(instant);
+  // reload meta-client to reflect latest timeline status
+  metaClient.reloadActiveTimeline();
 }
 
 List allRollbackStats = new ArrayList<>();
diff --git 
a/hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java 
b/hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
index fdc968d..9f3eaea 100644
--- a/hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
+++ b/hudi-client/src/test/java/org/apache/hudi/table/TestMergeOnReadTable.java
@@ -44,6 +44,7 @@ import 
org.apache.hudi.common.table.TableFileSystemView.SliceView;
 import org.apache.hudi.common.table.timeline.HoodieActiveTimeline;
 import org.apache.hudi.common.table.timeline.HoodieInstant;
 import org.apache.hudi.common.table.timeline.HoodieInstant.State;
+import org.apache.hudi.common.table.view.FileSystemViewStorageConfig;
 import org.apache.hudi.common.table.view.HoodieTableFileSystemView;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.config.HoodieCompactionConfig;
@@ -1219,6 +1220,8 @@ public class TestMergeOnReadTable extends 
HoodieClientTestHarness {
 
.withInlineCompaction(false).withMaxNumDeltaCommitsBeforeCompaction(1).build())
 .withStorageConfig(HoodieStorageConfig.newBuilder().limitFileSize(1024 
* 1024 * 1024).build())
 .withEmbeddedTimelineServerEnabled(true).forTable("test-trip-table")
+.withFileSystemViewConfig(new FileSystemViewStorageConfig.Builder()
+

[hudi] branch master updated: HUDI-515 Resolve API conflict for Hive 2 & Hive 3

2020-06-08 Thread nagarwal
This is an automated email from the ASF dual-hosted git repository.

nagarwal pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 7d40f19  HUDI-515 Resolve API conflict for Hive 2 & Hive 3
7d40f19 is described below

commit 7d40f19f395460bc78a12ac452831a9f0393ab49
Author: Wenning Ding 
AuthorDate: Sat May 16 13:39:53 2020 -0700

HUDI-515 Resolve API conflict for Hive 2 & Hive 3
---
 .../hadoop/hive/HoodieCombineHiveInputFormat.java  | 43 ++
 .../hudi/hive/testutils/HiveTestService.java   |  4 +-
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
index 9f024e9..fd7ffb2 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/hive/HoodieCombineHiveInputFormat.java
@@ -70,6 +70,7 @@ import org.apache.log4j.Logger;
 import java.io.DataInput;
 import java.io.DataOutput;
 import java.io.IOException;
+import java.lang.reflect.Method;
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.Collections;
@@ -137,7 +138,7 @@ public class HoodieCombineHiveInputFormat poolSet = new HashSet<>();
 
 for (Path path : paths) {
-  PartitionDesc part = 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(pathToPartitionInfo, 
path,
+  PartitionDesc part = getPartitionFromPath(pathToPartitionInfo, path,
   IOPrepareCache.get().allocatePartitionDescMap());
   TableDesc tableDesc = part.getTableDesc();
   if ((tableDesc != null) && tableDesc.isNonNative()) {
@@ -376,6 +377,34 @@ public class HoodieCombineHiveInputFormat 
pathToPartitionInfo, Path dir,
+  Map, Map> cacheMap)
+  throws IOException {
+Method method;
+try {
+  Class hiveUtilsClass = 
Class.forName("org.apache.hadoop.hive.ql.io.HiveFileFormatUtils");
+  try {
+// HiveFileFormatUtils.getPartitionDescFromPathRecursively method only 
available in Hive 2.x
+method = 
hiveUtilsClass.getMethod("getPartitionDescFromPathRecursively", Map.class, 
Path.class, Map.class);
+  } catch (NoSuchMethodException e) {
+// HiveFileFormatUtils.getFromPathRecursively method only available in 
Hive 3.x
+method = hiveUtilsClass.getMethod("getFromPathRecursively", Map.class, 
Path.class, Map.class);
+  }
+  return (PartitionDesc) method.invoke(null, pathToPartitionInfo, dir, 
cacheMap);
+} catch (ReflectiveOperationException e) {
+  throw new IOException(e);
+}
+  }
+
+  /**
* MOD - Just added this for visibility.
*/
   Path[] getInputPaths(JobConf job) throws IOException {
@@ -568,8 +597,8 @@ public class HoodieCombineHiveInputFormat 0) {
-  PartitionDesc part = 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(this.pathToPartitionInfo,
-  ipaths[0], IOPrepareCache.get().getPartitionDescMap());
+  PartitionDesc part = getPartitionFromPath(this.pathToPartitionInfo, 
ipaths[0],
+  IOPrepareCache.get().getPartitionDescMap());
   inputFormatClassName = part.getInputFileFormatClass().getName();
 }
   }
@@ -703,8 +732,8 @@ public class HoodieCombineHiveInputFormat call() throws Exception {
   Set nonCombinablePathIndices = new HashSet();
   for (int i = 0; i < length; i++) {
-PartitionDesc part = 
HiveFileFormatUtils.getPartitionDescFromPathRecursively(pathToPartitionInfo,
-paths[i + start], IOPrepareCache.get().allocatePartitionDescMap());
+PartitionDesc part = getPartitionFromPath(pathToPartitionInfo, paths[i 
+ start],
+IOPrepareCache.get().allocatePartitionDescMap());
 // Use HiveInputFormat if any of the paths is not splittable
 Class inputFormatClass = 
part.getInputFileFormatClass();
 InputFormat inputFormat = 
getInputFormatFromCache(inputFormatClass, conf);
diff --git 
a/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestService.java
 
b/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestService.java
index e325094..2f98803 100644
--- 
a/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestService.java
+++ 
b/hudi-hive-sync/src/test/java/org/apache/hudi/hive/testutils/HiveTestService.java
@@ -27,6 +27,7 @@ import org.apache.hadoop.hive.conf.HiveConf;
 import org.apache.hadoop.hive.metastore.HiveMetaStore;
 import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
 import org.apache.hadoop.hive.metastore.IHMSHandler;
+import org.apache.hadoop.hive.metastore.RetryingHMSHandler;
 import org.apache.hadoop.hive.metastore.TSetIpAddressProcessor;
 import org.apache.hadoop.hive.metastore.TUGIBasedProcessor;
 import 

[hudi] branch release-0.5.3 updated (41fb6c2 -> 5fcc461)

2020-06-08 Thread sivabalan
This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a change to branch release-0.5.3
in repository https://gitbox.apache.org/repos/asf/hudi.git.


 discard 41fb6c2  Bumping release candidate number 2
 discard d3afcba  Making few fixes after cherry picking
 discard ae48ecb  [HUDI-990] Timeline API : 
filterCompletedAndCompactionInstants needs to handle requested state correctly. 
Also ensure timeline gets reloaded after we revert committed transactions
 discard 6dcd0a3  [HUDI-988] Fix Unit Test Flakiness : Ensure all 
instantiations of HoodieWriteClient is closed properly. Fix bug in 
TestRollbacks. Make CLI unit tests for Hudi CLI check skip redering strings
 add 5fcc461  Bumping release candidate number 1

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (41fb6c2)
\
 N -- N -- N   refs/heads/release-0.5.3 (5fcc461)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

No new revisions were added by this update.

Summary of changes:
 docker/hoodie/hadoop/base/pom.xml  |   2 +-
 docker/hoodie/hadoop/datanode/pom.xml  |   2 +-
 docker/hoodie/hadoop/historyserver/pom.xml |   2 +-
 docker/hoodie/hadoop/hive_base/pom.xml |   2 +-
 docker/hoodie/hadoop/namenode/pom.xml  |   2 +-
 docker/hoodie/hadoop/pom.xml   |   2 +-
 docker/hoodie/hadoop/prestobase/pom.xml|   2 +-
 docker/hoodie/hadoop/spark_base/pom.xml|   2 +-
 docker/hoodie/hadoop/sparkadhoc/pom.xml|   2 +-
 docker/hoodie/hadoop/sparkmaster/pom.xml   |   2 +-
 docker/hoodie/hadoop/sparkworker/pom.xml   |   2 +-
 hudi-cli/pom.xml   |   2 +-
 .../apache/hudi/cli/HoodieTableHeaderFields.java   |  16 --
 .../org/apache/hudi/cli/commands/StatsCommand.java |   4 +-
 .../cli/commands/AbstractShellIntegrationTest.java |   2 +-
 .../hudi/cli/commands/TestRepairsCommand.java  | 206 +
 hudi-client/pom.xml|   2 +-
 .../org/apache/hudi/client/HoodieWriteClient.java  |   2 +-
 .../client/embedded/EmbeddedTimelineService.java   |   4 +-
 .../apache/hudi/table/HoodieCopyOnWriteTable.java  |   2 -
 .../apache/hudi/table/HoodieMergeOnReadTable.java  |   2 -
 .../apache/hudi/client/TestHoodieClientBase.java   | 187 ++-
 .../java/org/apache/hudi/client/TestMultiFS.java   |   4 +
 .../hudi/client/TestUpdateSchemaEvolution.java |   4 +-
 .../hudi/common/HoodieClientTestHarness.java   |  54 ++
 .../hudi/index/TestHBaseQPSResourceAllocator.java  |   2 +-
 .../java/org/apache/hudi/index/TestHbaseIndex.java |  17 +-
 .../org/apache/hudi/index/TestHoodieIndex.java |   2 +-
 .../hudi/index/bloom/TestHoodieBloomIndex.java |   2 +-
 .../index/bloom/TestHoodieGlobalBloomIndex.java|   2 +-
 .../org/apache/hudi/io/TestHoodieMergeHandle.java  |  12 +-
 .../apache/hudi/table/TestCopyOnWriteTable.java|   5 +-
 .../apache/hudi/table/TestMergeOnReadTable.java|  43 +++--
 .../hudi/table/compact/TestHoodieCompactor.java|  14 +-
 hudi-common/pom.xml|   2 +-
 .../table/timeline/HoodieDefaultTimeline.java  |   2 +-
 .../table/view/FileSystemViewStorageConfig.java|  21 ---
 .../table/string/TestHoodieActiveTimeline.java |   2 +-
 hudi-hadoop-mr/pom.xml |   2 +-
 hudi-hive/pom.xml  |   2 +-
 hudi-integ-test/pom.xml|   2 +-
 hudi-spark/pom.xml |   2 +-
 hudi-timeline-service/pom.xml  |   2 +-
 hudi-utilities/pom.xml |   2 +-
 packaging/hudi-hadoop-mr-bundle/pom.xml|   2 +-
 packaging/hudi-hive-bundle/pom.xml |   2 +-
 packaging/hudi-presto-bundle/pom.xml   |   2 +-
 packaging/hudi-spark-bundle/pom.xml|   2 +-
 packaging/hudi-timeline-server-bundle/pom.xml  |   2 +-
 packaging/hudi-utilities-bundle/pom.xml|   2 +-
 pom.xml|   3 +-
 51 files changed, 419 insertions(+), 247 deletions(-)
 create mode 100644 
hudi-cli/src/test/java/org/apache/hudi/cli/commands/TestRepairsCommand.java



[jira] [Created] (HUDI-1012) add test for snapshot reads

2020-06-08 Thread satish (Jira)
satish created HUDI-1012:


 Summary: add test for snapshot reads
 Key: HUDI-1012
 URL: https://issues.apache.org/jira/browse/HUDI-1012
 Project: Apache Hudi
  Issue Type: Test
Reporter: satish
Assignee: satish


For MOR tables, there are tests for incremental reads. But tests are missing 
for snapshot reads. Add additional tests



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-1011.


> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu resolved HUDI-1011.
--
Resolution: Duplicate

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-996) Use shared spark session provider

2020-06-08 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128497#comment-17128497
 ] 

Raymond Xu commented on HUDI-996:
-

Notes by [~garyli1019]

hudi-client unit tests are the most time-consuming test module and not stable. 
We initialize and clean up resources for every single unit test, which is 
inefficient. We can refactor the hudi-client test structure to run more tests 
in one initialization.

> Use shared spark session provider 
> --
>
> Key: HUDI-996
> URL: https://issues.apache.org/jira/browse/HUDI-996
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Raymond Xu
>Priority: Major
>
> * implement a shared spark session provider to be used for test suites, setup 
> and tear down less spark sessions and other mini servers
>  * add functional tests with similar setup logic to test suites, to make use 
> of shared spark session



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-1010:
-
Parent: HUDI-781
Issue Type: Sub-task  (was: Bug)

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1011:
-
Status: Open  (was: New)

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1011:
-
Component/s: Testing

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Status: Open  (was: New)

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Component/s: Testing

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Testing
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1011:
-
Labels: help-wanted  (was: )

> Refactor hudi-client unit tests structure
> -
>
> Key: HUDI-1011
> URL: https://issues.apache.org/jira/browse/HUDI-1011
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
>
> hudi-client unit tests are the most time-consuming test module and not 
> stable. We initialize and clean up resources for every single unit test, 
> which is inefficient. We can refactor the hudi-client test structure to run 
> more tests in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1011) Refactor hudi-client unit tests structure

2020-06-08 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1011:


 Summary: Refactor hudi-client unit tests structure
 Key: HUDI-1011
 URL: https://issues.apache.org/jira/browse/HUDI-1011
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Yanjia Gary Li


hudi-client unit tests are the most time-consuming test module and not stable. 
We initialize and clean up resources for every single unit test, which is 
inefficient. We can refactor the hudi-client test structure to run more tests 
in one initialization.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Description: 
hudi-client unit test has a memory leak, which could be some resources are not 
properly released during the cleanup. The memory consumption was accumulating 
over time and lead to the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

Related PR: [https://github.com/apache/hudi/pull/1707]

[https://github.com/apache/hudi/pull/1697]

!image-2020-06-08-09-22-08-864.png!

  was:
hudi-client unit test has a memory leak, which could be some resources are not 
properly released during the cleanup. The memory consumption was accumulating 
over time and lead to the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!


> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> Related PR: [https://github.com/apache/hudi/pull/1707]
> [https://github.com/apache/hudi/pull/1697]
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Description: 
hudi-client unit test has a memory leak, which could be some resources are not 
properly released during the cleanup. The memory consumption was accumulating 
over time and lead to the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!

  was:
hudi-client unit test has a memory leak, which could be some resources are not 
released during the cleanup. The memory consumption was accumulating overtime 
and lead the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!


> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not properly released during the cleanup. The memory consumption was 
> accumulating over time and lead to the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1010:
-
Labels: help-wanted  (was: )

> Fix the memory leak for hudi-client unit tests
> --
>
> Key: HUDI-1010
> URL: https://issues.apache.org/jira/browse/HUDI-1010
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Yanjia Gary Li
>Priority: Major
>  Labels: help-wanted
> Attachments: image-2020-06-08-09-22-08-864.png
>
>
> hudi-client unit test has a memory leak, which could be some resources are 
> not released during the cleanup. The memory consumption was accumulating 
> overtime and lead the Travis CI failure. 
> By using the IntelliJ memory analysis tool, we can find the major leak was 
> HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c
> !image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1010) Fix the memory leak for hudi-client unit tests

2020-06-08 Thread Yanjia Gary Li (Jira)
Yanjia Gary Li created HUDI-1010:


 Summary: Fix the memory leak for hudi-client unit tests
 Key: HUDI-1010
 URL: https://issues.apache.org/jira/browse/HUDI-1010
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Yanjia Gary Li
 Attachments: image-2020-06-08-09-22-08-864.png

hudi-client unit test has a memory leak, which could be some resources are not 
released during the cleanup. The memory consumption was accumulating overtime 
and lead the Travis CI failure. 

By using the IntelliJ memory analysis tool, we can find the major leak was 
HoodieLogFormatWriter, HoodieWrapperFileSystem, HoodieLogFileReader, e.t.c

!image-2020-06-08-09-22-08-864.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1009) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128410#comment-17128410
 ] 

liwei commented on HUDI-1009:
-

https://issues.apache.org/jira/browse/HUDI-1008 repeated

> Handle insert for recordkey is not unique
> -
>
> Key: HUDI-1009
> URL: https://issues.apache.org/jira/browse/HUDI-1009
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> Now use hudi ,must set hoodie.datasource.write.recordkey.field. If the key is 
> not unique, the insert data will be unpredictable with the same key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-1009) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei closed HUDI-1009.
---
Resolution: Duplicate

> Handle insert for recordkey is not unique
> -
>
> Key: HUDI-1009
> URL: https://issues.apache.org/jira/browse/HUDI-1009
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> Now use hudi ,must set hoodie.datasource.write.recordkey.field. If the key is 
> not unique, the insert data will be unpredictable with the same key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1009) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1009:
---

Assignee: liwei

> Handle insert for recordkey is not unique
> -
>
> Key: HUDI-1009
> URL: https://issues.apache.org/jira/browse/HUDI-1009
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> Now use hudi ,must set hoodie.datasource.write.recordkey.field. If the key is 
> not unique, the insert data will be unpredictable with the same key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1009) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei updated HUDI-1009:

Status: Open  (was: New)

> Handle insert for recordkey is not unique
> -
>
> Key: HUDI-1009
> URL: https://issues.apache.org/jira/browse/HUDI-1009
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> Now use hudi ,must set hoodie.datasource.write.recordkey.field. If the key is 
> not unique, the insert data will be unpredictable with the same key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1008) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liwei reassigned HUDI-1008:
---

Assignee: liwei

> Handle insert for recordkey is not unique
> -
>
> Key: HUDI-1008
> URL: https://issues.apache.org/jira/browse/HUDI-1008
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: liwei
>Assignee: liwei
>Priority: Major
>
> hudi now must set the hoodie.datasource.write.recordkey.field, when the key 
> is not unique, insert the data ,and the data will be unpredictable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1009) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)
liwei created HUDI-1009:
---

 Summary: Handle insert for recordkey is not unique
 Key: HUDI-1009
 URL: https://issues.apache.org/jira/browse/HUDI-1009
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: liwei


Now use hudi ,must set hoodie.datasource.write.recordkey.field. If the key is 
not unique, the insert data will be unpredictable with the same key.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1008) Handle insert for recordkey is not unique

2020-06-08 Thread liwei (Jira)
liwei created HUDI-1008:
---

 Summary: Handle insert for recordkey is not unique
 Key: HUDI-1008
 URL: https://issues.apache.org/jira/browse/HUDI-1008
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: liwei


hudi now must set the hoodie.datasource.write.recordkey.field, when the key is 
not unique, insert the data ,and the data will be unpredictable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-918) Fix kafkaOffsetGen can not read kafka data bug

2020-06-08 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui closed HUDI-918.
--

> Fix kafkaOffsetGen can not read kafka data bug
> --
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data
> Steps to reproduce:
> 1、Use deltastreamer to consume data from kafka
> 2、Set the value of sourceLimit to be less than the value of kafka partition
> 3、INFO DeltaSync:313 - No new data, source checkpoint has not changed. 
> Nothing to commit. Old checkpoint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-918) Fix kafkaOffsetGen can not read kafka data bug

2020-06-08 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui resolved HUDI-918.

Resolution: Fixed

> Fix kafkaOffsetGen can not read kafka data bug
> --
>
> Key: HUDI-918
> URL: https://issues.apache.org/jira/browse/HUDI-918
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When the sourcelimit is less than the number of Kafka partitions, Hudi cannot 
> get the data
> Steps to reproduce:
> 1、Use deltastreamer to consume data from kafka
> 2、Set the value of sourceLimit to be less than the value of kafka partition
> 3、INFO DeltaSync:313 - No new data, source checkpoint has not changed. 
> Nothing to commit. Old checkpoint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1007) When earliestOffsets is greater than checkpoint, Hudi will not be able to successfully consume data

2020-06-08 Thread liujinhui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17128400#comment-17128400
 ] 

liujinhui commented on HUDI-1007:
-

Yes, every run will check the offset of the earliest in the offect stored in 
the checkpoint and kafka. Since the data is out of date, I definitely choose 
earliest.

> When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> ---
>
> Key: HUDI-1007
> URL: https://issues.apache.org/jira/browse/HUDI-1007
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Major
> Fix For: 0.6.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Use deltastreamer to consume kafka,
>  When earliestOffsets is greater than checkpoint, Hudi will not be able to 
> successfully consume data
> org.apache.hudi.utilities.sources.helpers.KafkaOffsetGen#checkupValidOffsets
> boolean checkpointOffsetReseter = checkpointOffsets.entrySet().stream()
>  .anyMatch(offset -> offset.getValue() < 
> earliestOffsets.get(offset.getKey()));
> return checkpointOffsetReseter ? earliestOffsets : checkpointOffsets;
> Kafka data is continuously generated, which means that some data will 
> continue to expire.
>  When earliestOffsets is greater than checkpoint, earliestOffsets will be 
> taken. But at this moment, some data expired. In the end, consumption fails. 
> This process is an endless cycle. I can understand that this design may be to 
> avoid the loss of data, but it will lead to such a situation, I want to fix 
> this problem, I want to hear your opinion  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >