[GitHub] [hudi] wangxianghu commented on pull request #1935: [HUDI-1121][DOC]Provide a document describing how to use callback

2020-08-07 Thread GitBox


wangxianghu commented on pull request #1935:
URL: https://github.com/apache/hudi/pull/1935#issuecomment-670828182


   @yanghua @leesf please take a look when free



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu opened a new pull request #1935: [HUDI-1121][DOC]Provide a document describing how to use callback

2020-08-07 Thread GitBox


wangxianghu opened a new pull request #1935:
URL: https://github.com/apache/hudi/pull/1935


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *Add description of the config of write commit callback to describe how to 
use callback*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1121) Provide a document describing how to use callback

2020-08-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1121:
-
Labels: pull-request-available  (was: )

> Provide a document describing how to use callback
> -
>
> Key: HUDI-1121
> URL: https://issues.apache.org/jira/browse/HUDI-1121
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


bvaradar commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670826574


   @umehrot2 : Can you confirm if all review comments are resolved and the PR 
is ready otherwise. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


bvaradar commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670826438


   @umehrot2 : Thanks for the update. Yeah, the integration test flakiness is a 
know issue and the logs shows the same pattern. Let me do one pass of it along 
with other bootstrap PRs from @zhedoubushishi  and land them. If there are any 
minor review comments, I will update the PRs myself to speed up landing. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan opened a new pull request #1934: [MINOR] Move a test method to Transformations

2020-08-07 Thread GitBox


xushiyan opened a new pull request #1934:
URL: https://github.com/apache/hudi/pull/1934


   - Move TestHoodieKeyLocationFetchHandle#getRecordsPerPartition to 
Transformations
   - Improve some var namings
   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


leesf commented on pull request #1931:
URL: https://github.com/apache/hudi/pull/1931#issuecomment-670820501


   rerun tests



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-1026] Removed slf4j dependency from HoodieClientTestHarness (#1928)

2020-08-07 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 1072f27  [HUDI-1026] Removed slf4j dependency from 
HoodieClientTestHarness (#1928)
1072f27 is described below

commit 1072f2748a3e6802b0c1c9492edf7573779645ac
Author: cheshta2904 <69254936+cheshta2...@users.noreply.github.com>
AuthorDate: Sat Aug 8 09:37:22 2020 +0530

[HUDI-1026] Removed slf4j dependency from HoodieClientTestHarness (#1928)
---
 .../java/org/apache/hudi/testutils/HoodieClientTestHarness.java | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestHarness.java
 
b/hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestHarness.java
index 17b9c35..7369598 100644
--- 
a/hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestHarness.java
+++ 
b/hudi-client/src/test/java/org/apache/hudi/testutils/HoodieClientTestHarness.java
@@ -42,8 +42,8 @@ import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.SQLContext;
 import org.junit.jupiter.api.BeforeEach;
 import org.junit.jupiter.api.TestInfo;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
+import org.apache.log4j.Logger;
+import org.apache.log4j.LogManager;
 
 import java.io.IOException;
 import java.io.Serializable;
@@ -56,7 +56,7 @@ import java.util.concurrent.atomic.AtomicInteger;
  */
 public abstract class HoodieClientTestHarness extends HoodieCommonTestHarness 
implements Serializable {
 
-  private static final Logger LOG = 
LoggerFactory.getLogger(HoodieClientTestHarness.class);
+  private static final Logger LOG = 
LogManager.getLogger(HoodieClientTestHarness.class);
   
   private String testMethodName;
   protected transient JavaSparkContext jsc = null;



[GitHub] [hudi] leesf merged pull request #1928: [HUDI-1026]: removed slf4j dependency from HoodieClientTestHarness

2020-08-07 Thread GitBox


leesf merged pull request #1928:
URL: https://github.com/apache/hudi/pull/1928


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf merged pull request #1932: [MINOR]Remove unused import

2020-08-07 Thread GitBox


leesf merged pull request #1932:
URL: https://github.com/apache/hudi/pull/1932


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [MINOR] Remove unused import (#1932)

2020-08-07 Thread leesf
This is an automated email from the ASF dual-hosted git repository.

leesf pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 8b66524  [MINOR] Remove unused import (#1932)
8b66524 is described below

commit 8b66524090a5f21a11c2546580c952aa917bd0ad
Author: Yungthuis <36870105+yungth...@users.noreply.github.com>
AuthorDate: Sat Aug 8 12:04:31 2020 +0800

[MINOR] Remove unused import (#1932)

Co-authored-by: tom_glb 
---
 .../org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java| 2 --
 .../java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java | 3 ---
 hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java| 1 -
 3 files changed, 6 deletions(-)

diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
index f28e6bf..5179e89 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/HoodieDeltaStreamerWrapper.java
@@ -18,8 +18,6 @@
 
 package org.apache.hudi.integ.testsuite;
 
-import org.apache.hadoop.fs.FileSystem;
-import org.apache.hadoop.hive.conf.HiveConf;
 import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.model.HoodieRecord;
 import org.apache.hudi.common.util.collection.Pair;
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
index 930f307..c9d129e 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/DeltaGenerator.java
@@ -41,9 +41,6 @@ import org.apache.hudi.integ.testsuite.writer.DeltaWriteStats;
 import org.apache.hudi.integ.testsuite.writer.DeltaWriterAdapter;
 import org.apache.hudi.integ.testsuite.writer.DeltaWriterFactory;
 import org.apache.hudi.keygen.BuiltinKeyGenerator;
-import org.apache.hudi.keygen.ComplexKeyGenerator;
-import org.apache.hudi.keygen.KeyGenerator;
-import org.apache.hudi.keygen.SimpleKeyGenerator;
 import org.apache.hudi.integ.testsuite.configuration.DFSDeltaConfig;
 import org.apache.hudi.integ.testsuite.configuration.DeltaConfig;
 import org.apache.hudi.integ.testsuite.configuration.DeltaConfig.Config;
diff --git 
a/hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java 
b/hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java
index 3ffa0cf..0423103 100644
--- a/hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java
+++ b/hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java
@@ -18,7 +18,6 @@
 
 package org.apache.hudi.integ;
 
-import java.util.concurrent.TimeUnit;
 import java.util.concurrent.TimeoutException;
 import org.apache.hudi.common.util.FileIOUtils;
 import org.apache.hudi.common.util.collection.Pair;



[GitHub] [hudi] garyli1019 commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


garyli1019 commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670818469


   The integration test fails sometimes for no reason. I have been seeing this 
for a few times. Maybe rerun will fix if lucky.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #363

2020-08-07 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.59 KB...]
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${scala.binary.version}:[unknown-version],
 

 line 27, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the 

[GitHub] [hudi] umehrot2 commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


umehrot2 commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670814850


   @vinothchandar the unit tests issues are resolved now. But the integration 
tests are behaving crazy. They passed the last time, and failed now even though 
I didn't make any code change. They are getting stuck for some reason. I think 
you mentioned about this issue to me.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on a change in pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


umehrot2 commented on a change in pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#discussion_r46735



##
File path: 
hudi-spark/src/test/scala/org/apache/hudi/functional/TestDataSourceForBootstrap.scala
##
@@ -0,0 +1,616 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import java.time.Instant
+import java.util.Collections
+
+import collection.JavaConverters._
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
+import org.apache.hudi.client.TestBootstrap
+import 
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, 
HoodieDataSourceHelpers}
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.timeline.HoodieTimeline
+import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieCompactionConfig, 
HoodieWriteConfig}
+import org.apache.hudi.keygen.SimpleKeyGenerator
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.{SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.assertEquals
+import org.junit.jupiter.api.{BeforeEach, Test}
+import org.junit.jupiter.api.io.TempDir
+
+class TestDataSourceForBootstrap {
+
+  var spark: SparkSession = _
+  val commonOpts = Map(
+HoodieWriteConfig.INSERT_PARALLELISM -> "4",
+HoodieWriteConfig.UPSERT_PARALLELISM -> "4",
+HoodieWriteConfig.DELETE_PARALLELISM -> "4",
+HoodieWriteConfig.BULKINSERT_PARALLELISM -> "4",
+HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM -> "4",
+HoodieBootstrapConfig.BOOTSTRAP_PARALLELISM -> "4",
+DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_row_key",
+DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "partition",
+DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp",
+HoodieWriteConfig.TABLE_NAME -> "hoodie_test"
+  )
+  var basePath: String = _
+  var srcPath: String = _
+  var fs: FileSystem = _
+
+  @BeforeEach def initialize(@TempDir tempDir: java.nio.file.Path) {
+spark = SparkSession.builder
+  .appName("Hoodie Datasource test")
+  .master("local[2]")
+  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+  .getOrCreate
+basePath = tempDir.toAbsolutePath.toString + "/base"
+srcPath = tempDir.toAbsolutePath.toString + "/src"
+fs = FSUtils.getFs(basePath, spark.sparkContext.hadoopConfiguration)
+  }
+

Review comment:
   Thanks @garyli1019 . You were right, I wasn't cleaning up the spark 
contexts after my test runs. Fixed it now.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on a change in pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


garyli1019 commented on a change in pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#discussion_r467341322



##
File path: 
hudi-spark/src/test/scala/org/apache/hudi/functional/TestDataSourceForBootstrap.scala
##
@@ -0,0 +1,616 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import java.time.Instant
+import java.util.Collections
+
+import collection.JavaConverters._
+import org.apache.hadoop.fs.FileSystem
+import org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider
+import org.apache.hudi.client.TestBootstrap
+import 
org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
+import org.apache.hudi.{DataSourceReadOptions, DataSourceWriteOptions, 
HoodieDataSourceHelpers}
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.table.timeline.HoodieTimeline
+import org.apache.hudi.config.{HoodieBootstrapConfig, HoodieCompactionConfig, 
HoodieWriteConfig}
+import org.apache.hudi.keygen.SimpleKeyGenerator
+import org.apache.spark.api.java.JavaSparkContext
+import org.apache.spark.sql.functions.{col, lit}
+import org.apache.spark.sql.{SaveMode, SparkSession}
+import org.junit.jupiter.api.Assertions.assertEquals
+import org.junit.jupiter.api.{BeforeEach, Test}
+import org.junit.jupiter.api.io.TempDir
+
+class TestDataSourceForBootstrap {
+
+  var spark: SparkSession = _
+  val commonOpts = Map(
+HoodieWriteConfig.INSERT_PARALLELISM -> "4",
+HoodieWriteConfig.UPSERT_PARALLELISM -> "4",
+HoodieWriteConfig.DELETE_PARALLELISM -> "4",
+HoodieWriteConfig.BULKINSERT_PARALLELISM -> "4",
+HoodieWriteConfig.FINALIZE_WRITE_PARALLELISM -> "4",
+HoodieBootstrapConfig.BOOTSTRAP_PARALLELISM -> "4",
+DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "_row_key",
+DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "partition",
+DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "timestamp",
+HoodieWriteConfig.TABLE_NAME -> "hoodie_test"
+  )
+  var basePath: String = _
+  var srcPath: String = _
+  var fs: FileSystem = _
+
+  @BeforeEach def initialize(@TempDir tempDir: java.nio.file.Path) {
+spark = SparkSession.builder
+  .appName("Hoodie Datasource test")
+  .master("local[2]")
+  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+  .getOrCreate
+basePath = tempDir.toAbsolutePath.toString + "/base"
+srcPath = tempDir.toAbsolutePath.toString + "/src"
+fs = FSUtils.getFs(basePath, spark.sparkContext.hadoopConfiguration)
+  }
+

Review comment:
   I believe add 
https://github.com/apache/hudi/blob/master/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala#L55
 will resolve this issue.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] garyli1019 commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


garyli1019 commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670799557


   > @vinothchandar I fixed the rebase issue, and resolved the `bootstrap` 
related test failures. I still see `MOR data source` related unit test failures 
because of `spark context`. Is this something you are already aware about ?
   
   hi @umehrot2 , the datasource test will initialize spark context before each 
run. If the previous run didn't close the spark properly, this error will come 
out. See 
https://github.com/apache/hudi/commit/4f74a84607d46249e9bb6e1397246f8dc076b390#diff-b9deb8bdc09b0440cafdf6354fe9068dR104



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


umehrot2 commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670793211


   @vinothchandar I fixed the rebase issue, and resolved the `bootstrap` 
related test failures. I still see `MOR data source` related unit test failures 
because of `spark context`. Is this something you are already aware about ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] umehrot2 commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


umehrot2 commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670717238


   > @umehrot2 some tests are failing . looking at them later today.
   > 
   > Before we head into the weekend, is this PR ready from your perspective. 
if so, I will take care of making the final changes and land.
   
   @vinothchandar the rebase has some issues. With the introduction of Spark 
datasource support for real time queries, we need to handle the bootstrap case 
there. For bootstrapped tables, real time queries are still not supported. Only 
read optimized queries will work for MOR case with bootstrapped tables for now. 
I will fix this, and hopefully that should fix atleast the unit test failures.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhedoubushishi commented on a change in pull request #1869: [HUDI-427] Implement CLI support for performing bootstrap

2020-08-07 Thread GitBox


zhedoubushishi commented on a change in pull request #1869:
URL: https://github.com/apache/hudi/pull/1869#discussion_r467272326



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
##
@@ -240,13 +240,21 @@ private HoodieBootstrapIndexInfo 
fetchBootstrapIndexInfo() throws IOException {
 @Override
 public List getIndexedPartitionPaths() {
   HFileScanner scanner = partitionIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // part=datestr=2452537//LATEST_TIMESTAMP/Put/vlen=2405/seqid=0
+  return cellKeys.stream().map(key -> key.split("//")[0].substring(5))
+  .distinct().collect(Collectors.toList());
 }
 
 @Override
 public List getIndexedFileIds() {
   HFileScanner scanner = fileIdIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // 
part=datestr=2452537;fileid=baab9c50-c35e-49d1-b928-695aa7e37833//LATEST_TIMESTAMP/Put/vlen=2312/seqid=0
+  return cellKeys.stream().map(key -> 
key.split("//")[0].split(";")[1].split("=")[1])

Review comment:
   @bvaradar thanks! PR #1933 looks good to me. Migrated it into current PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-69) Support realtime view in Spark datasource #136

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-69?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-69.

Resolution: Fixed

> Support realtime view in Spark datasource #136
> --
>
> Key: HUDI-69
> URL: https://issues.apache.org/jira/browse/HUDI-69
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Vinoth Chandar
>Assignee: Yanjia Gary Li
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> [https://github.com/uber/hudi/issues/136]
> RFC: 
> [https://cwiki.apache.org/confluence/display/HUDI/RFC+-+16+Abstraction+for+HoodieInputFormat+and+RecordReader]
> PR: [https://github.com/apache/incubator-hudi/pull/1592]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1052) Support vectorized reader for MOR datasource reader

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-1052.
--
Resolution: Fixed

> Support vectorized reader for MOR datasource reader
> ---
>
> Key: HUDI-1052
> URL: https://issues.apache.org/jira/browse/HUDI-1052
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-1050) Support filter pushdown and column pruning for MOR table on Spark Datasource

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li resolved HUDI-1050.
--
Resolution: Fixed

> Support filter pushdown and column pruning for MOR table on Spark Datasource
> 
>
> Key: HUDI-1050
> URL: https://issues.apache.org/jira/browse/HUDI-1050
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Major
> Fix For: 0.6.0
>
>
> We need to use the information provided by PrunedFilteredScan to push down 
> the filter and column projection. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1052) Support vectorized reader for MOR datasource reader

2020-08-07 Thread Yanjia Gary Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanjia Gary Li updated HUDI-1052:
-
Status: In Progress  (was: Open)

> Support vectorized reader for MOR datasource reader
> ---
>
> Key: HUDI-1052
> URL: https://issues.apache.org/jira/browse/HUDI-1052
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Yanjia Gary Li
>Assignee: Yanjia Gary Li
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] rufferjr commented on issue #1923: [SUPPORT] Hive Sync fails to add decimal partition

2020-08-07 Thread GitBox


rufferjr commented on issue #1923:
URL: https://github.com/apache/hudi/issues/1923#issuecomment-670636479


   @bvaradar would you like the S3 partition path? If so, the following 
examples may be of use:
   
   s3://data-beta/vault/cod_combinations/partition_val=1003
   s3://data-beta/vault/cod_combinations/partition_val=1008
   ... etc.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Comment Edited] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-07 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173348#comment-17173348
 ] 

Balaji Varadarajan edited comment on HUDI-1146 at 8/7/20, 5:25 PM:
---

[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 
{code:java}
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 +++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 @@ -321,7 +321,7 @@ public class DeltaSync implements Serializable { .map(r -> 
(SchemaProvider) new DelegatingSchemaProvider(props, jssc, 
dataAndCheckpoint.getSchemaProvider(), new RowBasedSchemaProvider(r.schema( 
- .orElse(dataAndCheckpoint.getSchemaProvider()); + 
.orElseGet((dataAndCheckpoint::getSchemaProvider)); avroRDDOptional = 
transformed .map(t -> AvroConversionUtils.createRdd( t, 
HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD());
{code}
 


was (Author: vbalaji):
[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 

```

--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -321,7 +321,7 @@ public class DeltaSync implements Serializable {
 .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, jssc,
 dataAndCheckpoint.getSchemaProvider(),
 new RowBasedSchemaProvider(r.schema(
- .orElse(dataAndCheckpoint.getSchemaProvider());
+ .orElseGet((dataAndCheckpoint::getSchemaProvider));
 avroRDDOptional = transformed
 .map(t -> AvroConversionUtils.createRdd(
 t, HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD());

```

 

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-07 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173348#comment-17173348
 ] 

Balaji Varadarajan edited comment on HUDI-1146 at 8/7/20, 5:25 PM:
---

[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 
{code:java}
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -321,7 +321,7 @@ public class DeltaSync implements Serializable {
 .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, 
jssc,
 dataAndCheckpoint.getSchemaProvider(),
 new RowBasedSchemaProvider(r.schema(
-.orElse(dataAndCheckpoint.getSchemaProvider());
+.orElseGet((dataAndCheckpoint::getSchemaProvider));
 avroRDDOptional = transformed
 .map(t -> AvroConversionUtils.createRdd(
 t, HOODIE_RECORD_STRUCT_NAME, 
HOODIE_RECORD_NAMESPACE).toJavaRDD()); {code}


was (Author: vbalaji):
[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 
{code:java}
--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 +++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
 @@ -321,7 +321,7 @@ public class DeltaSync implements Serializable { .map(r -> 
(SchemaProvider) new DelegatingSchemaProvider(props, jssc, 
dataAndCheckpoint.getSchemaProvider(), new RowBasedSchemaProvider(r.schema( 
- .orElse(dataAndCheckpoint.getSchemaProvider()); + 
.orElseGet((dataAndCheckpoint::getSchemaProvider)); avroRDDOptional = 
transformed .map(t -> AvroConversionUtils.createRdd( t, 
HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD());
{code}
 

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-07 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173348#comment-17173348
 ] 

Balaji Varadarajan edited comment on HUDI-1146 at 8/7/20, 5:24 PM:
---

[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 

```

--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -321,7 +321,7 @@ public class DeltaSync implements Serializable {
 .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, jssc,
 dataAndCheckpoint.getSchemaProvider(),
 new RowBasedSchemaProvider(r.schema(
- .orElse(dataAndCheckpoint.getSchemaProvider());
+ .orElseGet((dataAndCheckpoint::getSchemaProvider));
 avroRDDOptional = transformed
 .map(t -> AvroConversionUtils.createRdd(
 t, HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD());

```

 


was (Author: vbalaji):
[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 

--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -321,7 +321,7 @@ public class DeltaSync implements Serializable {
 .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, jssc,
 dataAndCheckpoint.getSchemaProvider(),
 new RowBasedSchemaProvider(r.schema(
- .orElse(dataAndCheckpoint.getSchemaProvider());
+ .orElseGet((dataAndCheckpoint::getSchemaProvider));
 avroRDDOptional = transformed
 .map(t -> AvroConversionUtils.createRdd(
 t, HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD());

 

 

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1146) DeltaStreamer fails to start when No updated records + schemaProvider not supplied

2020-08-07 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173348#comment-17173348
 ] 

Balaji Varadarajan commented on HUDI-1146:
--

[~bdscheller]:

I think InputBatch::getSchemaProvider will be called irrespective of whether 
input batch is empty or not. I am suspecting this to be similar to HUDI-1091 
where an empty input batch is triggering this case.

Can you try this change ?

 

--- 
a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
+++ 
b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
@@ -321,7 +321,7 @@ public class DeltaSync implements Serializable {
 .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, jssc,
 dataAndCheckpoint.getSchemaProvider(),
 new RowBasedSchemaProvider(r.schema(
- .orElse(dataAndCheckpoint.getSchemaProvider());
+ .orElseGet((dataAndCheckpoint::getSchemaProvider));
 avroRDDOptional = transformed
 .map(t -> AvroConversionUtils.createRdd(
 t, HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD());

 

 

> DeltaStreamer fails to start when No updated records + schemaProvider not 
> supplied
> --
>
> Key: HUDI-1146
> URL: https://issues.apache.org/jira/browse/HUDI-1146
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: Brandon Scheller
>Priority: Major
>
> DeltaStreamer issue — happens with both COW or MOR - Restarting the 
> DeltaStreamer Process crashes, that is, 2nd Run does nothing.
> Steps:
>  Run Hudi DeltaStreamer job in --continuous mode
>  Run the same job again without deleting the output parquet files generated 
> due to step above
>  2nd run crashes with the below error ( it does not crash if we delete the 
> output parquet file)
> {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a 
> valid schema provider class!}}
> {{ at 
> org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}}
> {{ at 
> org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}}
>  
> {{This looks to be because of this line:}}
> {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315]
>  }}
> The "orElse" block here doesn't seem to make sense as if "transformed" is 
> empty then it is likely "dataAndCheckpoint" will have a null schema provider



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1171) Hudi 0.5.2 with ScalaTest and Spark 2.4.0 java.lang.NoClassDefFoundError: Could not initialize class org.apache.hudi.avro.model.HoodieCleanerPlan

2020-08-07 Thread Prashanth (Jira)
Prashanth created HUDI-1171:
---

 Summary: Hudi 0.5.2 with ScalaTest and Spark 2.4.0 
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hudi.avro.model.HoodieCleanerPlan
 Key: HUDI-1171
 URL: https://issues.apache.org/jira/browse/HUDI-1171
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Prashanth


I am using Hudi 0.5.2 with ScalaTest 2.2.5 framework on Scala 2.11 but seeing 
the following error when saving but I can run the main code in the Spark 
cluster with no errors. Is there a compatibility issue with ScalaTest and Hudi 
version? If so which version should I use? I tried ScalaTest 3.0.0 as well but 
still the same issue[scalatest]  Cause: java.lang.NoClassDefFoundError: Could 
not initialize class org.apache.hudi.avro.model.HoodieCleanerPlan
[scalatest]  at 
org.apache.hudi.table.HoodieCopyOnWriteTable.scheduleClean(HoodieCopyOnWriteTable.java:295)
[scalatest]  at 
org.apache.hudi.client.HoodieCleanClient.scheduleClean(HoodieCleanClient.java:114)
[scalatest]  at 
org.apache.hudi.client.HoodieCleanClient.clean(HoodieCleanClient.java:91)
[scalatest]  at 
org.apache.hudi.client.HoodieWriteClient.clean(HoodieWriteClient.java:835)
[scalatest]  at 
org.apache.hudi.client.HoodieWriteClient.postCommit(HoodieWriteClient.java:512)
[scalatest]  at 
org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:157)
[scalatest]  at 
org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:101)
[scalatest]  at 
org.apache.hudi.client.AbstractHoodieWriteClient.commit(AbstractHoodieWriteClient.java:92)
[scalatest]  at 
org.apache.hudi.HoodieSparkSqlWriter$.checkWriteStatus(HoodieSparkSqlWriter.scala:262)
[scalatest]  at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:184)
[scalatest]  at 
org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:91)
[scalatest]  at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
[scalatest]  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
[scalatest]  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
[scalatest]  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
[scalatest]  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
[scalatest]  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
[scalatest]  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
[scalatest]  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[scalatest]  at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
[scalatest]  at 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
[scalatest]  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
[scalatest]  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
[scalatest]  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
[scalatest]  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
[scalatest]  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
[scalatest]  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
[scalatest]  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
[scalatest]  at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
[scalatest]  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
[scalatest]  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
[scalatest]  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] zhedoubushishi commented on pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


zhedoubushishi commented on pull request #1933:
URL: https://github.com/apache/hudi/pull/1933#issuecomment-670594285


   > @vinothchandar : It looks like @zhedoubushishi had addressed the same 
issue in his original PR. So, I am going to close this one. @zhedoubushishi : 
Can you us the changes in this PR regarding getting the partition and file id 
to show up correctly ?
   
   Sure. Thanks for bring this fix up. Will update my pr based on this pr.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 commented on pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 commented on pull request #1931:
URL: https://github.com/apache/hudi/pull/1931#issuecomment-670594131


   @leesf Please review



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-971:
---

Assignee: Wenning Ding  (was: Balaji Varadarajan)

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on a change in pull request #1869: [HUDI-427] Implement CLI support for performing bootstrap

2020-08-07 Thread GitBox


bvaradar commented on a change in pull request #1869:
URL: https://github.com/apache/hudi/pull/1869#discussion_r467135559



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
##
@@ -240,13 +240,21 @@ private HoodieBootstrapIndexInfo 
fetchBootstrapIndexInfo() throws IOException {
 @Override
 public List getIndexedPartitionPaths() {
   HFileScanner scanner = partitionIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // part=datestr=2452537//LATEST_TIMESTAMP/Put/vlen=2405/seqid=0
+  return cellKeys.stream().map(key -> key.split("//")[0].substring(5))
+  .distinct().collect(Collectors.toList());
 }
 
 @Override
 public List getIndexedFileIds() {
   HFileScanner scanner = fileIdIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // 
part=datestr=2452537;fileid=baab9c50-c35e-49d1-b928-695aa7e37833//LATEST_TIMESTAMP/Put/vlen=2312/seqid=0
+  return cellKeys.stream().map(key -> 
key.split("//")[0].split(";")[1].split("=")[1])

Review comment:
   @zhedoubushishi : I will assign the original ticket to you 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar edited a comment on pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


bvaradar edited a comment on pull request #1933:
URL: https://github.com/apache/hudi/pull/1933#issuecomment-670592925


   @vinothchandar : It looks like @zhedoubushishi  had addressed the same issue 
in his original PR. So, I am going to close this one. @zhedoubushishi : Can you 
us the changes in this PR regarding getting the partition and file id to show 
up correctly ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


bvaradar commented on pull request #1933:
URL: https://github.com/apache/hudi/pull/1933#issuecomment-670592925


   @vinothchandar : It looks like @zhedoubushishi  had addressed it in his 
original PR. So, I am going to close this one. @zhedoubushishi : Can you us the 
changes in this PR regarding getting the partition and file id to show up 
correctly ? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


bvaradar closed pull request #1933:
URL: https://github.com/apache/hudi/pull/1933


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1869: [HUDI-427] Implement CLI support for performing bootstrap

2020-08-07 Thread GitBox


bvaradar commented on a change in pull request #1869:
URL: https://github.com/apache/hudi/pull/1869#discussion_r467133890



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
##
@@ -240,13 +240,21 @@ private HoodieBootstrapIndexInfo 
fetchBootstrapIndexInfo() throws IOException {
 @Override
 public List getIndexedPartitionPaths() {
   HFileScanner scanner = partitionIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // part=datestr=2452537//LATEST_TIMESTAMP/Put/vlen=2405/seqid=0
+  return cellKeys.stream().map(key -> key.split("//")[0].substring(5))
+  .distinct().collect(Collectors.toList());
 }
 
 @Override
 public List getIndexedFileIds() {
   HFileScanner scanner = fileIdIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // 
part=datestr=2452537;fileid=baab9c50-c35e-49d1-b928-695aa7e37833//LATEST_TIMESTAMP/Put/vlen=2312/seqid=0
+  return cellKeys.stream().map(key -> 
key.split("//")[0].split(";")[1].split("=")[1])

Review comment:
   @zhedoubushishi :  Can you instead copy the code from the PR 
https://github.com/apache/hudi/pull/1933. This is essentially the same thing I 
fixed. I did not realize you have tried to address the issue here. IMO, PR 1933 
is more cleaner 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on a change in pull request #1869: [HUDI-427] Implement CLI support for performing bootstrap

2020-08-07 Thread GitBox


bvaradar commented on a change in pull request #1869:
URL: https://github.com/apache/hudi/pull/1869#discussion_r467133890



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/bootstrap/index/HFileBootstrapIndex.java
##
@@ -240,13 +240,21 @@ private HoodieBootstrapIndexInfo 
fetchBootstrapIndexInfo() throws IOException {
 @Override
 public List getIndexedPartitionPaths() {
   HFileScanner scanner = partitionIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // part=datestr=2452537//LATEST_TIMESTAMP/Put/vlen=2405/seqid=0
+  return cellKeys.stream().map(key -> key.split("//")[0].substring(5))
+  .distinct().collect(Collectors.toList());
 }
 
 @Override
 public List getIndexedFileIds() {
   HFileScanner scanner = fileIdIndexReader().getScanner(true, true);
-  return getAllKeys(scanner);
+  List cellKeys = getAllKeys(scanner);
+  // cellKey is in this format:
+  // 
part=datestr=2452537;fileid=baab9c50-c35e-49d1-b928-695aa7e37833//LATEST_TIMESTAMP/Put/vlen=2312/seqid=0
+  return cellKeys.stream().map(key -> 
key.split("//")[0].split(";")[1].split("=")[1])

Review comment:
   @zhedoubushishi :  Can you instead copy the code from the PR 
https://github.com/apache/hudi/pull/1933. This is essentially I fixed. I did 
not realize you have tried to address the issue here. IMO, PR 1933 is more 
cleaner 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2020-08-07 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17173242#comment-17173242
 ] 

Balaji Varadarajan commented on HUDI-1015:
--

Subtasks added to track all location where we list all partitions. 
https://issues.apache.org/jira/browse/HUDI-1170 to track the above log file 
listing case.

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1170) File Listing during log file rollback is affecting ingestion latency in S3

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1170:
-
Fix Version/s: 0.6.1

> File Listing during log file rollback is affecting ingestion latency in S3
> --
>
> Key: HUDI-1170
> URL: https://issues.apache.org/jira/browse/HUDI-1170
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> (Source : [https://github.com/apache/hudi/issues/1852])
>  
> : 
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:259)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:167)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:124)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:180)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:549)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:628)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:532)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:344)
>  org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
>  org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
>  
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:487)
>  org.apache.hudi.common.fs.FSUtils.getAllLogFiles(FSUtils.java:409)
>  org.apache.hudi.common.fs.FSUtils.getLatestLogVersion(FSUtils.java:420)
>  org.apache.hudi.common.fs.FSUtils.computeNextLogVersion(FSUtils.java:434)
>  org.apache.hudi.common.model.HoodieLogFile.rollOver(HoodieLogFile.java:115)
>  
> org.apache.hudi.common.table.log.HoodieLogFormatWriter.(HoodieLogFormatWriter.java:101)
>  
> org.apache.hudi.common.table.log.HoodieLogFormat$WriterBuilder.build(HoodieLogFormat.java:249)
>  
> org.apache.hudi.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:291)
>  org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
>  org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:197)
>  
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:77)
>  
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:246)
>  
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor$$Lambda$192/1449069739.call(Unknown
>  Source)
>  
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:105)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1170) File Listing during log file rollback is affecting ingestion latency in S3

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1170:
-
Status: Open  (was: New)

> File Listing during log file rollback is affecting ingestion latency in S3
> --
>
> Key: HUDI-1170
> URL: https://issues.apache.org/jira/browse/HUDI-1170
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Balaji Varadarajan
>Priority: Major
>
> (Source : [https://github.com/apache/hudi/issues/1852])
>  
> : 
> sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:259)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:167)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:124)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:180)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:549)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:628)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:532)
>  
> shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:344)
>  org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
>  org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
>  
> org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:487)
>  org.apache.hudi.common.fs.FSUtils.getAllLogFiles(FSUtils.java:409)
>  org.apache.hudi.common.fs.FSUtils.getLatestLogVersion(FSUtils.java:420)
>  org.apache.hudi.common.fs.FSUtils.computeNextLogVersion(FSUtils.java:434)
>  org.apache.hudi.common.model.HoodieLogFile.rollOver(HoodieLogFile.java:115)
>  
> org.apache.hudi.common.table.log.HoodieLogFormatWriter.(HoodieLogFormatWriter.java:101)
>  
> org.apache.hudi.common.table.log.HoodieLogFormat$WriterBuilder.build(HoodieLogFormat.java:249)
>  
> org.apache.hudi.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:291)
>  org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
>  org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:197)
>  
> org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:77)
>  
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:246)
>  
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
>  
> org.apache.hudi.table.action.commit.BaseCommitActionExecutor$$Lambda$192/1449069739.call(Unknown
>  Source)
>  
> org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:105)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1170) File Listing during log file rollback is affecting ingestion latency in S3

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1170:
-
Description: 
(Source : [https://github.com/apache/hudi/issues/1852])

 

: 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:259)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:167)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:124)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:180)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:549)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:628)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:532)
 
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:344)
 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
 org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
 
org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:487)
 org.apache.hudi.common.fs.FSUtils.getAllLogFiles(FSUtils.java:409)
 org.apache.hudi.common.fs.FSUtils.getLatestLogVersion(FSUtils.java:420)
 org.apache.hudi.common.fs.FSUtils.computeNextLogVersion(FSUtils.java:434)
 org.apache.hudi.common.model.HoodieLogFile.rollOver(HoodieLogFile.java:115)
 
org.apache.hudi.common.table.log.HoodieLogFormatWriter.(HoodieLogFormatWriter.java:101)
 
org.apache.hudi.common.table.log.HoodieLogFormat$WriterBuilder.build(HoodieLogFormat.java:249)
 
org.apache.hudi.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:291)
 org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
 org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:197)
 
org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:77)
 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:246)
 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
 
org.apache.hudi.table.action.commit.BaseCommitActionExecutor$$Lambda$192/1449069739.call(Unknown
 Source)
 
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:105)

  was:
: 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:259)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:167)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:124)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:180)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:549)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:628)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:532)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:344)
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:487)
org.apache.hudi.common.fs.FSUtils.getAllLogFiles(FSUtils.java:409)
org.apache.hudi.common.fs.FSUtils.getLatestLogVersion(FSUtils.java:420)
org.apache.hudi.common.fs.FSUtils.computeNextLogVersion(FSUtils.java:434)
org.apache.hudi.common.model.HoodieLogFile.rollOver(HoodieLogFile.java:115)
org.apache.hudi.common.table.log.HoodieLogFormatWriter.(HoodieLogFormatWriter.java:101)
org.apache.hudi.common.table.log.HoodieLogFormat$WriterBuilder.build(HoodieLogFormat.java:249)
org.apache.hudi.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:291)

[jira] [Created] (HUDI-1170) File Listing during log file rollback is affecting ingestion latency in S3

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1170:


 Summary: File Listing during log file rollback is affecting 
ingestion latency in S3
 Key: HUDI-1170
 URL: https://issues.apache.org/jira/browse/HUDI-1170
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Writer Core
Reporter: Balaji Varadarajan


: 
sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:352)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsHttpOperation.processResponse(AbfsHttpOperation.java:259)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:167)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:124)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.services.AbfsClient.listPath(AbfsClient.java:180)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listFiles(AzureBlobFileSystemStore.java:549)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:628)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.listStatus(AzureBlobFileSystemStore.java:532)
shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.listStatus(AzureBlobFileSystem.java:344)
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1517)
org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1557)
org.apache.hudi.common.fs.HoodieWrapperFileSystem.listStatus(HoodieWrapperFileSystem.java:487)
org.apache.hudi.common.fs.FSUtils.getAllLogFiles(FSUtils.java:409)
org.apache.hudi.common.fs.FSUtils.getLatestLogVersion(FSUtils.java:420)
org.apache.hudi.common.fs.FSUtils.computeNextLogVersion(FSUtils.java:434)
org.apache.hudi.common.model.HoodieLogFile.rollOver(HoodieLogFile.java:115)
org.apache.hudi.common.table.log.HoodieLogFormatWriter.(HoodieLogFormatWriter.java:101)
org.apache.hudi.common.table.log.HoodieLogFormat$WriterBuilder.build(HoodieLogFormat.java:249)
org.apache.hudi.io.HoodieAppendHandle.createLogWriter(HoodieAppendHandle.java:291)
org.apache.hudi.io.HoodieAppendHandle.init(HoodieAppendHandle.java:141)
org.apache.hudi.io.HoodieAppendHandle.doAppend(HoodieAppendHandle.java:197)
org.apache.hudi.table.action.deltacommit.DeltaCommitActionExecutor.handleUpdate(DeltaCommitActionExecutor.java:77)
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.handleUpsertPartition(BaseCommitActionExecutor.java:246)
org.apache.hudi.table.action.commit.BaseCommitActionExecutor.lambda$execute$caffe4c4$1(BaseCommitActionExecutor.java:102)
org.apache.hudi.table.action.commit.BaseCommitActionExecutor$$Lambda$192/1449069739.call(Unknown
 Source)
org.apache.spark.api.java.JavaRDDLike$$anonfun$mapPartitionsWithIndex$1.apply(JavaRDDLike.scala:105)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1015:
-
Priority: Major  (was: Blocker)

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1015) Audit all getAllPartitionPaths() calls and keep em out of fast path

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1015:
-
Fix Version/s: (was: 0.6.0)
   0.6.1

> Audit all getAllPartitionPaths() calls and keep em out of fast path
> ---
>
> Key: HUDI-1015
> URL: https://issues.apache.org/jira/browse/HUDI-1015
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Common Core, Writer Core
>Reporter: Vinoth Chandar
>Assignee: Balaji Varadarajan
>Priority: Blocker
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1169) Audit Partition Listing : Snapshot Copier and Exporter Utilities

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1169:


 Summary: Audit Partition Listing : Snapshot Copier and Exporter 
Utilities
 Key: HUDI-1169
 URL: https://issues.apache.org/jira/browse/HUDI-1169
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Utilities
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


These are new tools in Hudi to copy/export Hudi data. Rarely used operation. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1168) Audit Partition Listing : Savepoint Creation

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1168:


 Summary: Audit Partition Listing : Savepoint Creation
 Key: HUDI-1168
 URL: https://issues.apache.org/jira/browse/HUDI-1168
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


This is a seldom used operation. But, documenting it for completeness. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] UZi5136225 commented on pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 commented on pull request #1931:
URL: https://github.com/apache/hudi/pull/1931#issuecomment-670584668


   
![9C0310E1-BE19-4cbf-9476-5351C72908FC](https://user-images.githubusercontent.com/25769285/89663903-e600cf80-d908-11ea-9d96-ada9f7a039f2.png)
   
![9C0310E1-BE19-4cbf-9476-5351C72908FC](https://user-images.githubusercontent.com/25769285/89663997-0a5cac00-d909-11ea-9d74-9c36a6749022.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 commented on pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 commented on pull request #1931:
URL: https://github.com/apache/hudi/pull/1931#issuecomment-670584994


   
![9C0310E1-BE19-4cbf-9476-5351C72908FC](https://user-images.githubusercontent.com/25769285/89664026-19435e80-d909-11ea-936a-dff502005fff.png)
   
![BC04E798-8AFA-40a0-8FFF-43D6F89ED990](https://user-images.githubusercontent.com/25769285/89664057-27917a80-d909-11ea-8bb7-3e7c19224482.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


vinothchandar commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670584927


   @umehrot2 some tests are failing . looking at them later today. 
   
   Before we head into the weekend, is this PR ready from your perspective. if 
so, I will take care of making the final changes and land. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 removed a comment on pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 removed a comment on pull request #1931:
URL: https://github.com/apache/hudi/pull/1931#issuecomment-670584668


   
![9C0310E1-BE19-4cbf-9476-5351C72908FC](https://user-images.githubusercontent.com/25769285/89663903-e600cf80-d908-11ea-9d96-ada9f7a039f2.png)
   
![9C0310E1-BE19-4cbf-9476-5351C72908FC](https://user-images.githubusercontent.com/25769285/89663997-0a5cac00-d909-11ea-9d74-9c36a6749022.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1167) Audit Partition Listing : Hive Syncing

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1167:


 Summary: Audit Partition Listing : Hive Syncing
 Key: HUDI-1167
 URL: https://issues.apache.org/jira/browse/HUDI-1167
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Hive Integration
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


This is only done during first time when we do not have 
lastCommitTimeSynced.0.6.1. Again, use consolidated metadata to avoid listing 
here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] UZi5136225 commented on pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 commented on pull request #1931:
URL: https://github.com/apache/hudi/pull/1931#issuecomment-670583885


   
![61960BDB-6E83-4086-BA4C-F0F0DBBC6722](https://user-images.githubusercontent.com/25769285/89663836-c8cc0100-d908-11ea-9d96-a0666441986a.png)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1166) Audit Partition Listing : Rollback By Listing

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1166:


 Summary: Audit Partition Listing : Rollback By Listing
 Key: HUDI-1166
 URL: https://issues.apache.org/jira/browse/HUDI-1166
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: Balaji Varadarajan


By default, we employ rollback using listing which scans all partitions. There 
is a new strategy in place for using rollback by marker files which avoids 
listing. When rollback by marker files gets stabilized, we need to make it 
default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1165) Audit Partition Listing : Compaction Scheduling

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1165:


 Summary: Audit Partition Listing : Compaction Scheduling
 Key: HUDI-1165
 URL: https://issues.apache.org/jira/browse/HUDI-1165
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Compaction
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


When scheduling compaction, we list all partition paths to generate file-slices 
for compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] jpugliesi commented on issue #1925: [SUPPORT] Support for Confluent Cloud SchemaRegistryProvider

2020-08-07 Thread GitBox


jpugliesi commented on issue #1925:
URL: https://github.com/apache/hudi/issues/1925#issuecomment-670581206


   @bvaradar brilliant, didn't think of this - I'll give it a try and report 
back.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


vinothchandar commented on pull request #1933:
URL: https://github.com/apache/hudi/pull/1933#issuecomment-670580147


   @bvaradar is this a  release blocker? sounds like that? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1164) Audit Partition Listing Location : CleanPlanner.getPartitionPathsForFullCleaning

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1164:
-
Description: 
This ticket is to track all known locations where we call 
FSUtils.getAllPartitionPaths 

This should only impact the first time we do clean. Incremental cleaner should 
reduce the scope of partition-paths to be listed.  But again, consolidated 
metadata would effectively avoid file-system level listing 

  was:
This ticket is to track all known locations where we call 
FSUtils.getAllPartitionPaths 

Consolidated Metadata would help avoid this listing. 


> Audit Partition Listing Location : 
> CleanPlanner.getPartitionPathsForFullCleaning
> 
>
> Key: HUDI-1164
> URL: https://issues.apache.org/jira/browse/HUDI-1164
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This ticket is to track all known locations where we call 
> FSUtils.getAllPartitionPaths 
> This should only impact the first time we do clean. Incremental cleaner 
> should reduce the scope of partition-paths to be listed.  But again, 
> consolidated metadata would effectively avoid file-system level listing 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1164) Audit Partition Listing Location : CleanPlanner.getPartitionPathsForFullCleaning

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1164:
-
Status: Open  (was: New)

> Audit Partition Listing Location : 
> CleanPlanner.getPartitionPathsForFullCleaning
> 
>
> Key: HUDI-1164
> URL: https://issues.apache.org/jira/browse/HUDI-1164
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This ticket is to track all known locations where we call 
> FSUtils.getAllPartitionPaths 
> Consolidated Metadata would help avoid this listing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1164) Audit Partition Listing Location : CleanPlanner.getPartitionPathsForFullCleaning

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1164:


 Summary: Audit Partition Listing Location : 
CleanPlanner.getPartitionPathsForFullCleaning
 Key: HUDI-1164
 URL: https://issues.apache.org/jira/browse/HUDI-1164
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Index
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


This ticket is to track all known locations where we call 
FSUtils.getAllPartitionPaths 

Consolidated Metadata would help avoid this listing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1163) Audit Partition Listing Location : Global Simple Index lookup

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1163:
-
Status: Open  (was: New)

> Audit Partition Listing Location : Global Simple Index lookup
> -
>
> Key: HUDI-1163
> URL: https://issues.apache.org/jira/browse/HUDI-1163
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This ticket is to track all known locations where we call 
> FSUtils.getAllPartitionPaths 
> Consolidated Metadata would help avoid this listing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1163) Audit Partition Listing Location : Global Simple Index lookup

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1163:


 Summary: Audit Partition Listing Location : Global Simple Index 
lookup
 Key: HUDI-1163
 URL: https://issues.apache.org/jira/browse/HUDI-1163
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Index
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


This ticket is to track all known locations where we call 
FSUtils.getAllPartitionPaths 

Consolidated Metadata would help avoid this listing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1162) Audit Partition Listing Location : Global Bloom Index lookup

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-1162:
-
Status: Open  (was: New)

> Audit Partition Listing Location : Global Bloom Index lookup
> 
>
> Key: HUDI-1162
> URL: https://issues.apache.org/jira/browse/HUDI-1162
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This ticket is to track all known locations where we call 
> FSUtils.getAllPartitionPaths 
> Consolidated Metadata would help avoid this listing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1162) Audit Partition Listing Location : Global Bloom Index lookup

2020-08-07 Thread Balaji Varadarajan (Jira)
Balaji Varadarajan created HUDI-1162:


 Summary: Audit Partition Listing Location : Global Bloom Index 
lookup
 Key: HUDI-1162
 URL: https://issues.apache.org/jira/browse/HUDI-1162
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Index
Reporter: Balaji Varadarajan
 Fix For: 0.6.1


This ticket is to track all known locations where we call 
FSUtils.getAllPartitionPaths 

Consolidated Metadata would help avoid this listing. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-971:

Status: Patch Available  (was: In Progress)

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan updated HUDI-971:

Status: In Progress  (was: Open)

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-971) Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-971:

Labels: pull-request-available  (was: )

> Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean 
> partition name
> ---
>
> Key: HUDI-971
> URL: https://issues.apache.org/jira/browse/HUDI-971
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Assignee: Balaji Varadarajan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> When calling HFileBootstrapIndexReader.getIndexedPartitions(), it will return 
> unclean partitions because of 
> [https://github.com/apache/hbase/blob/rel/1.2.3/hbase-common/src/main/java/org/apache/hadoop/hbase/CellUtil.java#L768].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] bvaradar commented on pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


bvaradar commented on pull request #1933:
URL: https://github.com/apache/hudi/pull/1933#issuecomment-670573069


   @zhedoubushishi : Can you review this. This would impact your bootstrap CLI. 
 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar opened a new pull request #1933: [HUDI-971] Fix HFileBootstrapIndexReader.getIndexedPartitions() returns unclean partition name

2020-08-07 Thread GitBox


bvaradar opened a new pull request #1933:
URL: https://github.com/apache/hudi/pull/1933


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Yungthuis opened a new pull request #1932: [MINOR]Remove unused import

2020-08-07 Thread GitBox


Yungthuis opened a new pull request #1932:
URL: https://github.com/apache/hudi/pull/1932


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 opened a new pull request #1931: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 opened a new pull request #1931:
URL: https://github.com/apache/hudi/pull/1931


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   hudi support prometheus/pushgateway
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 closed pull request #1930: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 closed pull request #1930:
URL: https://github.com/apache/hudi/pull/1930


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 opened a new pull request #1930: [HUDI-210] hudi-support-prometheus-pushgateway

2020-08-07 Thread GitBox


UZi5136225 opened a new pull request #1930:
URL: https://github.com/apache/hudi/pull/1930


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   Hudi support prometheus
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] UZi5136225 closed pull request #1726: [HUDI-210]Hudi support prometheus

2020-08-07 Thread GitBox


UZi5136225 closed pull request #1726:
URL: https://github.com/apache/hudi/pull/1726


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1159) Parquet encryption policy interface

2020-08-07 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated HUDI-1159:
---
Description: Provide an interface for Parquet column encryption policy 
engine clients.

> Parquet encryption policy interface
> ---
>
> Key: HUDI-1159
> URL: https://issues.apache.org/jira/browse/HUDI-1159
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Priority: Major
>
> Provide an interface for Parquet column encryption policy engine clients.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1159) Parquet encryption policy interface

2020-08-07 Thread Gidon Gershinsky (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gidon Gershinsky updated HUDI-1159:
---
Summary: Parquet encryption policy interface  (was: Encryption policy 
interface)

> Parquet encryption policy interface
> ---
>
> Key: HUDI-1159
> URL: https://issues.apache.org/jira/browse/HUDI-1159
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Gidon Gershinsky
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Mathieu1124 commented on pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes

2020-08-07 Thread GitBox


Mathieu1124 commented on pull request #1901:
URL: https://github.com/apache/hudi/pull/1901#issuecomment-670506233


@cheshta2904 @pratyakshsharma I have addressed all your concerns, thanks 
for your detailed review :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes

2020-08-07 Thread GitBox


Mathieu1124 commented on a change in pull request #1901:
URL: https://github.com/apache/hudi/pull/1901#discussion_r467025816



##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/dag/HiveSyncDagGenerator.java
##
@@ -31,6 +31,9 @@
 import org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode;
 import org.apache.hudi.integ.testsuite.dag.nodes.InsertNode;
 
+/**
+ * An implementation of {@link WorkflowDagGenerator}, that generate hive sync 
workflowDag.

Review comment:
   done

##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/dag/HiveSyncDagGeneratorMOR.java
##
@@ -31,6 +31,9 @@
 import org.apache.hudi.integ.testsuite.dag.nodes.HiveSyncNode;
 import org.apache.hudi.integ.testsuite.dag.nodes.InsertNode;
 
+/**
+ * An implementation of {@link WorkflowDagGenerator}, that generate hive sync 
workflowDag for MOR table.

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes

2020-08-07 Thread GitBox


Mathieu1124 commented on a change in pull request #1901:
URL: https://github.com/apache/hudi/pull/1901#discussion_r467025753



##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/dag/ComplexDagGenerator.java
##
@@ -33,6 +33,9 @@
 import org.apache.hudi.integ.testsuite.dag.nodes.ValidateNode;
 import org.apache.spark.api.java.JavaRDD;
 
+/**
+ * An implementation of {@link WorkflowDagGenerator}, that generate complex 
workflowDag.

Review comment:
   > generate -> generates.
   
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes

2020-08-07 Thread GitBox


Mathieu1124 commented on a change in pull request #1901:
URL: https://github.com/apache/hudi/pull/1901#discussion_r467025989



##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/utils/TestUtils.java
##
@@ -28,6 +28,9 @@
 import org.apache.spark.api.java.JavaSparkContext;
 import org.apache.spark.sql.SparkSession;
 
+/**
+ * A utility class for test purpose.

Review comment:
   done

##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/utils/TestUtils.java
##
@@ -45,6 +48,15 @@
 return dataGenerator.generateGenericRecords(numRecords);
   }
 
+  /**
+   * Method help to create avro files and save it to file.
+   *
+   * @param jsc   {@link JavaSparkContext}.
+   * @param sparkSession  {@link SparkSession}.
+   * @param basePath  The basePath where files written to.

Review comment:
   done

##
File path: 
hudi-integ-test/src/test/java/org/apache/hudi/integ/testsuite/TestDFSHoodieTestSuiteWriterAdapter.java
##
@@ -52,6 +52,9 @@
 import org.junit.jupiter.api.Test;
 import org.mockito.Mockito;
 
+/**
+ * An adapter of {@link HoodieTestSuiteWriter} help to test write DFS file.

Review comment:
   done





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Mathieu1124 commented on a change in pull request #1901: [HUDI-532]Add java doc for hudi test suite test classes

2020-08-07 Thread GitBox


Mathieu1124 commented on a change in pull request #1901:
URL: https://github.com/apache/hudi/pull/1901#discussion_r467025252



##
File path: hudi-integ-test/src/test/java/org/apache/hudi/integ/ITTestBase.java
##
@@ -48,6 +48,9 @@
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertNotEquals;
 
+/**
+ * Base test class for ITTest help to run cmd and generate data.

Review comment:
   done, thanks for your review 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf opened a new pull request #1929: [HUDI-1160] Support update partial fields for CoW table

2020-08-07 Thread GitBox


leesf opened a new pull request #1929:
URL: https://github.com/apache/hudi/pull/1929


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1160) Support update partial fields for CoW table

2020-08-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1160:
-
Labels: pull-request-available  (was: )

> Support update partial fields for CoW table
> ---
>
> Key: HUDI-1160
> URL: https://issues.apache.org/jira/browse/HUDI-1160
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: leesf
>Assignee: leesf
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1161) Support update partial fields for MoR table

2020-08-07 Thread leesf (Jira)
leesf created HUDI-1161:
---

 Summary: Support update partial fields for MoR table
 Key: HUDI-1161
 URL: https://issues.apache.org/jira/browse/HUDI-1161
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: leesf
Assignee: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1160) Support update partial fields for CoW table

2020-08-07 Thread leesf (Jira)
leesf created HUDI-1160:
---

 Summary: Support update partial fields for CoW table
 Key: HUDI-1160
 URL: https://issues.apache.org/jira/browse/HUDI-1160
 Project: Apache Hudi
  Issue Type: Sub-task
  Components: Writer Core
Reporter: leesf
Assignee: leesf






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on pull request #1912: [HUDI-1098] Adding TimedWaitOnAppearConsistencyGuard

2020-08-07 Thread GitBox


nsivabalan commented on pull request #1912:
URL: https://github.com/apache/hudi/pull/1912#issuecomment-670480225


   @umehrot2 : Would appreciate if you agree on the approach here. Before I go 
ahead and address feedback want to have consensus. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #1912: [HUDI-1098] Adding TimedWaitOnAppearConsistencyGuard

2020-08-07 Thread GitBox


nsivabalan commented on pull request #1912:
URL: https://github.com/apache/hudi/pull/1912#issuecomment-670479909


   @bvaradar : since you suggested to have the TimedWaitOnAppearCG as default 
opt in, I would suggest to introduce a new config for the sleep time. so that 
we can set it to 2 or 3 secs as default. The existing config which we are 
repurposing, has a default value of 400 ms and users may not realize to fix 
this value since this is going to be default opt in. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1858: [HUDI-1014] Adding Upgrade and downgrade infra for smooth transitioning from list based rollback to marker based rollback

2020-08-07 Thread GitBox


nsivabalan commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r466991187



##
File path: 
hudi-client/src/main/java/org/apache/hudi/client/AbstractHoodieWriteClient.java
##
@@ -186,10 +188,14 @@ public HoodieMetrics getMetrics() {
* Get HoodieTable and init {@link Timer.Context}.
*
* @param operationType write operation type
+   * @param instantTime current inflight instant time
* @return HoodieTable
*/
-  protected HoodieTable getTableAndInitCtx(WriteOperationType operationType) {
+  protected HoodieTable getTableAndInitCtx(WriteOperationType operationType, 
String instantTime) {
 HoodieTableMetaClient metaClient = createMetaClient(true);
+if (config.shouldRollbackUsingMarkers()) {

Review comment:
   this was my thinking behind this guard. If someone wishes to stay with 
list based rollback, why execute this upgrade this which specifically does some 
work to assist in marker based rollback which will never be used since marked 
based rollback is not going to be used at all. I am not very strong on this 
though. But tests need some fixes though as I rely on creating commits and 
marker files using client at first, by disabling marker based rollback. If we 
remove this guard, then tests need to manually create all data files and marker 
files. I am not saying that as a reason to keep this guard, just saying we have 
some extra work to be done. 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/table/action/rollback/RollbackUtils.java
##
@@ -63,4 +84,156 @@ static HoodieRollbackStat 
mergeRollbackStat(HoodieRollbackStat stat1, HoodieRoll
 return new HoodieRollbackStat(stat1.getPartitionPath(), 
successDeleteFiles, failedDeleteFiles, commandBlocksCount);
   }
 
+  /**

Review comment:
   yes





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1858: [HUDI-1014] Adding Upgrade and downgrade infra for smooth transitioning from list based rollback to marker based rollback

2020-08-07 Thread GitBox


nsivabalan commented on a change in pull request #1858:
URL: https://github.com/apache/hudi/pull/1858#discussion_r466989376



##
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/SparkMain.java
##
@@ -329,9 +341,34 @@ private static int deleteSavepoint(JavaSparkContext jsc, 
String savepointTime, S
 }
   }
 
+  /**
+   * Upgrade or downgrade hoodie table.
+   * @param jsc instance of {@link JavaSparkContext} to use.
+   * @param basePath base path of the dataset.
+   * @param toVersion version to which upgrade/downgrade to be done.
+   * @return 0 if success, else -1.
+   * @throws Exception
+   */
+  protected static int upgradeOrDowngradeHoodieDataset(JavaSparkContext jsc, 
String basePath, String toVersion) throws Exception {
+HoodieWriteConfig config = getWriteConfig(basePath);
+HoodieTableMetaClient metaClient = 
ClientUtils.createMetaClient(jsc.hadoopConfiguration(), config, false);
+try {
+  UpgradeDowngradeUtil.doUpgradeOrDowngrade(metaClient, 
HoodieTableVersion.valueOf(toVersion), config, jsc, null);

Review comment:
   I am not sure if migrate will be the right terminology to use here. 
Isn't migrate used to move from one system to another? This is more of an 
upgrade version or downgrade version right within the same system(hudi).  





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-1159) Encryption policy interface

2020-08-07 Thread Gidon Gershinsky (Jira)
Gidon Gershinsky created HUDI-1159:
--

 Summary: Encryption policy interface
 Key: HUDI-1159
 URL: https://issues.apache.org/jira/browse/HUDI-1159
 Project: Apache Hudi
  Issue Type: New Feature
Reporter: Gidon Gershinsky






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] Mathieu1124 commented on pull request #1886: [HUDI-1122]Introduce a kafka implementation of hoodie write commit ca…

2020-08-07 Thread GitBox


Mathieu1124 commented on pull request #1886:
URL: https://github.com/apache/hudi/pull/1886#issuecomment-670453889


   > 
   > 
   > > I was wondering can we move this implement to hudi-client module just 
like the way all the implementations of metrics does.
   > 
   > I think we can move this down the line. `hudi-client` or `hudi-spark` 
talking a direct dependency on kafka does not feel that clean to me. May be 
file a follow up JIRA?
   
   ok 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1702: [HUDI-426] Bootstrap datasource integration

2020-08-07 Thread GitBox


vinothchandar commented on pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#issuecomment-670414375


   @umehrot2 I rebased this after landing @garyli1019 's PR. Please take a look 
at `DefaultSource` again to make sure things are ok 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar closed pull request #1678: [HUDI-242] Metadata Bootstrap changes

2020-08-07 Thread GitBox


vinothchandar closed pull request #1678:
URL: https://github.com/apache/hudi/pull/1678


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1678: [HUDI-242] Metadata Bootstrap changes

2020-08-07 Thread GitBox


vinothchandar commented on pull request #1678:
URL: https://github.com/apache/hudi/pull/1678#issuecomment-670393921


   closing this. There is a followup JIRA assigned to you @bvaradar with some 
of the unaddressed comments from here



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1760: [HUDI-1040] Update apis for spark3 compatibility

2020-08-07 Thread GitBox


vinothchandar commented on a change in pull request #1760:
URL: https://github.com/apache/hudi/pull/1760#discussion_r466890310



##
File path: hudi-spark/src/main/scala/org/apache/hudi/AvroConversionUtils.scala
##
@@ -78,4 +79,21 @@ object AvroConversionUtils {
   def convertAvroSchemaToStructType(avroSchema: Schema): StructType = {
 SchemaConverters.toSqlType(avroSchema).dataType.asInstanceOf[StructType]
   }
+
+  private def deserializeRow(encoder: ExpressionEncoder[Row], internalRow: 
InternalRow): Row = {
+// First attempt to use spark2 API for deserialization, otherwise attempt 
with spark3 API
+try {
+  val spark2method = encoder.getClass.getMethods.filter(method => 
method.getName.equals("fromRow")).last
+  spark2method.invoke(encoder, internalRow).asInstanceOf[Row]
+} catch {
+  case e: NoSuchElementException => spark3Deserialize(encoder, internalRow)

Review comment:
   @nsivabalan are you able to take a swing at this for 0.6.0? this would 
be good to have 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pratyakshsharma commented on pull request #1928: [HUDI-1026]: removed slf4j dependency from HoodieClientTestHarness

2020-08-07 Thread GitBox


pratyakshsharma commented on pull request #1928:
URL: https://github.com/apache/hudi/pull/1928#issuecomment-670390527


   LGTM!



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] cheshta2904 opened a new pull request #1928: [HUDI-1026]: removed slf4j dependency from HoodieClientTestHarness

2020-08-07 Thread GitBox


cheshta2904 opened a new pull request #1928:
URL: https://github.com/apache/hudi/pull/1928


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1026) Remove slf4j dependency from HoodieClientTestHarness

2020-08-07 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1026:
-
Labels: pull-request-available  (was: )

> Remove slf4j dependency from HoodieClientTestHarness
> 
>
> Key: HUDI-1026
> URL: https://issues.apache.org/jira/browse/HUDI-1026
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: newbie
>Reporter: Nishith Agarwal
>Assignee: Cheshta Sharma
>Priority: Minor
>  Labels: pull-request-available
>
> Right now, the HoodieClientTestHarness is using slf4j while the whole project 
> is on log4j.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] luffyd commented on issue #1913: [SUPPORT][MOR]Too many open files on IOException and Crash

2020-08-07 Thread GitBox


luffyd commented on issue #1913:
URL: https://github.com/apache/hudi/issues/1913#issuecomment-670385009


   It is this, seems latest. This is whatever comes in AWS emr
   
/mnt2/yarn/usercache/hadoop/appcache/application_1596743154329_0001/container_1596743154329_0001_01_01/__spark_libs__/parquet-hadoop-1.10.1-spark-amzn-1.jar
   
   
/mnt2/yarn/usercache/hadoop/appcache/application_1596743154329_0001/container_1596743154329_0001_01_01/__spark_libs__/parquet-hadoop-bundle-1.6.0.jar
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   >