[GitHub] [hudi] Trevor-zhang commented on a change in pull request #1779: [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen

2020-07-03 Thread GitBox


Trevor-zhang commented on a change in pull request #1779:
URL: https://github.com/apache/hudi/pull/1779#discussion_r449744347



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestKafkaSource.java
##
@@ -191,13 +191,13 @@ public void testJsonKafkaSourceWithDefaultUpperCap() {
  */
 testUtils.sendMessages(TEST_TOPIC_NAME, 
Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 1000)));
 InputBatch> fetch1 = 
kafkaSource.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
-assertEquals(500, fetch1.getBatch().get().count());
+assertEquals(1000, fetch1.getBatch().get().count());

Review comment:
   > Hi @Trevor-zhang Would please add a new test case to verify the 
`sourceLimit` less than the generated insert record's num?
   
   @yanghua  ok,i'd like to.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #1779: [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen

2020-07-03 Thread GitBox


yanghua commented on a change in pull request #1779:
URL: https://github.com/apache/hudi/pull/1779#discussion_r449744177



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestKafkaSource.java
##
@@ -191,13 +191,13 @@ public void testJsonKafkaSourceWithDefaultUpperCap() {
  */
 testUtils.sendMessages(TEST_TOPIC_NAME, 
Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 1000)));
 InputBatch> fetch1 = 
kafkaSource.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
-assertEquals(500, fetch1.getBatch().get().count());
+assertEquals(1000, fetch1.getBatch().get().count());

Review comment:
   Hi @Trevor-zhang Would please add a new test case to verify the 
`sourceLimit` less than the generated insert record's num?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on issue #1777: [SUPPORT] org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were

2020-07-03 Thread GitBox


bhasudha commented on issue #1777:
URL: https://github.com/apache/hudi/issues/1777#issuecomment-653726518


   How big is your `results` df ? You don't have to do the CONCAT for combining 
multiple columns for record key. You could simply pass them as comma separated 
columns and set the config `hoodie.datasource.write.keygenerator.class` to 
`org.apache.hudi.keygen.ComplexKeyGenerator. Other questions:
   
   1. Is above all the configs you have passed and using defaults for the rest?
   2. And could you also share your spark ui to help debug further?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




Build failed in Jenkins: hudi-snapshot-deployment-0.5 #328

2020-07-03 Thread Apache Jenkins Server
See 


Changes:


--
[...truncated 2.35 KB...]

/home/jenkins/tools/maven/apache-maven-3.5.4/conf:
logging
settings.xml
toolchains.xml

/home/jenkins/tools/maven/apache-maven-3.5.4/conf/logging:
simplelogger.properties

/home/jenkins/tools/maven/apache-maven-3.5.4/lib:
aopalliance-1.0.jar
cdi-api-1.0.jar
cdi-api.license
commons-cli-1.4.jar
commons-cli.license
commons-io-2.5.jar
commons-io.license
commons-lang3-3.5.jar
commons-lang3.license
ext
guava-20.0.jar
guice-4.2.0-no_aop.jar
jansi-1.17.1.jar
jansi-native
javax.inject-1.jar
jcl-over-slf4j-1.7.25.jar
jcl-over-slf4j.license
jsr250-api-1.0.jar
jsr250-api.license
maven-artifact-3.5.4.jar
maven-artifact.license
maven-builder-support-3.5.4.jar
maven-builder-support.license
maven-compat-3.5.4.jar
maven-compat.license
maven-core-3.5.4.jar
maven-core.license
maven-embedder-3.5.4.jar
maven-embedder.license
maven-model-3.5.4.jar
maven-model-builder-3.5.4.jar
maven-model-builder.license
maven-model.license
maven-plugin-api-3.5.4.jar
maven-plugin-api.license
maven-repository-metadata-3.5.4.jar
maven-repository-metadata.license
maven-resolver-api-1.1.1.jar
maven-resolver-api.license
maven-resolver-connector-basic-1.1.1.jar
maven-resolver-connector-basic.license
maven-resolver-impl-1.1.1.jar
maven-resolver-impl.license
maven-resolver-provider-3.5.4.jar
maven-resolver-provider.license
maven-resolver-spi-1.1.1.jar
maven-resolver-spi.license
maven-resolver-transport-wagon-1.1.1.jar
maven-resolver-transport-wagon.license
maven-resolver-util-1.1.1.jar
maven-resolver-util.license
maven-settings-3.5.4.jar
maven-settings-builder-3.5.4.jar
maven-settings-builder.license
maven-settings.license
maven-shared-utils-3.2.1.jar
maven-shared-utils.license
maven-slf4j-provider-3.5.4.jar
maven-slf4j-provider.license
org.eclipse.sisu.inject-0.3.3.jar
org.eclipse.sisu.inject.license
org.eclipse.sisu.plexus-0.3.3.jar
org.eclipse.sisu.plexus.license
plexus-cipher-1.7.jar
plexus-cipher.license
plexus-component-annotations-1.7.1.jar
plexus-component-annotations.license
plexus-interpolation-1.24.jar
plexus-interpolation.license
plexus-sec-dispatcher-1.4.jar
plexus-sec-dispatcher.license
plexus-utils-3.1.0.jar
plexus-utils.license
slf4j-api-1.7.25.jar
slf4j-api.license
wagon-file-3.1.0.jar
wagon-file.license
wagon-http-3.1.0-shaded.jar
wagon-http.license
wagon-provider-api-3.1.0.jar
wagon-provider-api.license

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/ext:
README.txt

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native:
freebsd32
freebsd64
linux32
linux64
osx
README.txt
windows32
windows64

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/freebsd64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux32:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/linux64:
libjansi.so

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/osx:
libjansi.jnilib

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows32:
jansi.dll

/home/jenkins/tools/maven/apache-maven-3.5.4/lib/jansi-native/windows64:
jansi.dll
Finished /home/jenkins/tools/maven/apache-maven-3.5.4 Directory Listing :
Detected current version as: 
'HUDI_home=
0.6.0-SNAPSHOT'
[INFO] Scanning for projects...
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-spark-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-spark-bundle_${scala.binary.version}:[unknown-version], 

 line 26, column 15
[WARNING] 
[WARNING] Some problems were encountered while building the effective model for 
org.apache.hudi:hudi-utilities-bundle_2.11:jar:0.6.0-SNAPSHOT
[WARNING] 'artifactId' contains an expression but should be a constant. @ 
org.apache.hudi:hudi-utilities-bundle_${s

[GitHub] [hudi] Trevor-zhang commented on a change in pull request #1779: [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen

2020-07-03 Thread GitBox


Trevor-zhang commented on a change in pull request #1779:
URL: https://github.com/apache/hudi/pull/1779#discussion_r449731135



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestKafkaSource.java
##
@@ -191,13 +191,13 @@ public void testJsonKafkaSourceWithDefaultUpperCap() {
  */
 testUtils.sendMessages(TEST_TOPIC_NAME, 
Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 1000)));
 InputBatch> fetch1 = 
kafkaSource.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
-assertEquals(500, fetch1.getBatch().get().count());
+assertEquals(1000, fetch1.getBatch().get().count());

Review comment:
   > why exactly does this test have to change? could you please clarify
   hi @vinothchandar,  the value of `maxEventsToReadFromKafka `has changed in 
this commit, so the value of `fetch1.getBatch().get().count() `also changes in 
the test case.
   
   > there is an empty `git` file in this commit.. can you please remove this?
   @vinothchandar I have removed this file.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] Trevor-zhang commented on a change in pull request #1779: [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen

2020-07-03 Thread GitBox


Trevor-zhang commented on a change in pull request #1779:
URL: https://github.com/apache/hudi/pull/1779#discussion_r449731135



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestKafkaSource.java
##
@@ -191,13 +191,13 @@ public void testJsonKafkaSourceWithDefaultUpperCap() {
  */
 testUtils.sendMessages(TEST_TOPIC_NAME, 
Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 1000)));
 InputBatch> fetch1 = 
kafkaSource.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
-assertEquals(500, fetch1.getBatch().get().count());
+assertEquals(1000, fetch1.getBatch().get().count());

Review comment:
   > why exactly does this test have to change? could you please clarify
   hi @vinothchandar,  the value of `maxEventsToReadFromKafka `has changed in 
this commit, so the value of `fetch1.getBatch().get().count() `also changes in 
the test case.
   
   > there is an empty `git` file in this commit.. can you please remove this?
   @vinothchandar I hive removed this file.
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on issue #1787: Exception During Insert

2020-07-03 Thread GitBox


leesf commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-653707211


   @asheeshgarg In fact, enable the embedtimelineserver would reduce calls to 
FileSystem/S3,before 0.5.3 version, it is disabled on default, so I think it is 
ok to disable it in 0.5.3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf edited a comment on issue #1787: Exception During Insert

2020-07-03 Thread GitBox


leesf edited a comment on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-653707211


   @asheeshgarg In fact, enable the embedtimelineserver would reduce calls to 
FileSystem/S3, no other impact. before 0.5.3 version, it is disabled on 
default, so I think it is ok to disable it in 0.5.3.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on issue #1791: [SUPPORT] Does DeltaStreamer support listening to multiple kafka topics and upserting to multiple tables?

2020-07-03 Thread GitBox


vinothchandar commented on issue #1791:
URL: https://github.com/apache/hudi/issues/1791#issuecomment-653706540


   Support for this has landed onto master.. 
   @pratyakshsharma can you chime in here and possibly work closely with 
@masterlemmi and get it hardened more before the 0.6.0 release?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] masterlemmi opened a new issue #1791: [SUPPORT] Does DeltaStreamer support listening to multiple kafka topics and upserting to multiple tables?

2020-07-03 Thread GitBox


masterlemmi opened a new issue #1791:
URL: https://github.com/apache/hudi/issues/1791


   **_Tips before filing an issue_**
   
   - Have you gone through our 
[FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)? Yes
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   1. I need to listen to multiple kafka topics and save messages to 
corresponding tables. Does DeltaStreamer allow that? and does the processing of 
each stream/topic run in parallel?
   
   2. I am also exploring Spark Streaming and using Spark DataSource.  
basically something like this
   `DSTREAM.map(x =>(x.topic, List(x.value(
 .reduceByKey(_:::_)
 .map(processAndSavetoHudi)
.print()
   `
   
   Is it possible to run hudi upserts from the executor tasks (i.e. from the 
Dstream.map function) ? The foreachrdd function doesn't process streams in 
parallel, so I am trying to use the map function and save each stream to hudi 
from the workers.
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1.
   2.
   3.
   4.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 5.2
   
   * Spark version : 2.4.5.
   
   * Hive version : 2.3.3
   
   * Hadoop version : 2.8
   
   * Storage (HDFS/S3/GCS..) : no
   
   * Running on Docker? (yes/no) :
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-979) AWSDMSPayload delete handling with MOR

2020-07-03 Thread leesf (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151144#comment-17151144
 ] 

leesf commented on HUDI-979:


[~vinoth] We support delete flag in MoR in our inner branch, and I think we 
would help to fix it, [~309637554] do you have interest in picking it up?

> AWSDMSPayload delete handling with MOR
> --
>
> Key: HUDI-979
> URL: https://issues.apache.org/jira/browse/HUDI-979
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1549] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on a change in pull request #1746: [HUDI-996] Add functional test suite for hudi-utilities

2020-07-03 Thread GitBox


vinothchandar commented on a change in pull request #1746:
URL: https://github.com/apache/hudi/pull/1746#discussion_r449719578



##
File path: 
hudi-client/src/test/java/org/apache/hudi/testutils/FunctionalTestHarness.java
##
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.testutils;
+
+import org.apache.hudi.client.HoodieWriteClient;
+import org.apache.hudi.common.testutils.minicluster.HdfsTestService;
+
+import org.apache.hadoop.fs.FileStatus;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.hdfs.DistributedFileSystem;
+import org.apache.hadoop.hdfs.MiniDFSCluster;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.sql.SQLContext;
+import org.apache.spark.sql.SparkSession;
+import org.junit.jupiter.api.AfterAll;
+import org.junit.jupiter.api.BeforeEach;
+import org.junit.jupiter.api.io.TempDir;
+
+import java.io.IOException;
+
+public class FunctionalTestHarness implements SparkProvider, DFSProvider {
+
+  private static transient SparkSession spark;
+  private static transient SQLContext sqlContext;
+  private static transient JavaSparkContext jsc;
+
+  private static transient HdfsTestService hdfsTestService;
+  private static transient MiniDFSCluster dfsCluster;
+  private static transient DistributedFileSystem dfs;
+
+  /**
+   * An indicator of the initialization status.
+   */
+  protected boolean initialized = false;

Review comment:
   Food for thought: if tests are run parallely, in the same jvm. 
(parallelism option in surefire)... this boolean may not be sufficient for 
synchronization.. i.e two tests can attempt to create these test resources in 
parallel.

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/UtilitiesFunctionalTestSuite.java
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.functional;
+
+import org.junit.platform.runner.JUnitPlatform;
+import org.junit.platform.suite.api.IncludeTags;
+import org.junit.platform.suite.api.SelectPackages;
+import org.junit.runner.RunWith;
+
+@RunWith(JUnitPlatform.class)
+@SelectPackages("org.apache.hudi.utilities.functional")
+@IncludeTags("functional")
+public class UtilitiesFunctionalTestSuite {

Review comment:
   should this be abstract? 

##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/functional/UtilitiesFunctionalTestSuite.java
##
@@ -0,0 +1,32 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.utilities.functional;
+
+import org.junit.pl

[GitHub] [hudi] vinothchandar commented on a change in pull request #1779: [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen

2020-07-03 Thread GitBox


vinothchandar commented on a change in pull request #1779:
URL: https://github.com/apache/hudi/pull/1779#discussion_r449718990



##
File path: 
hudi-utilities/src/test/java/org/apache/hudi/utilities/sources/TestKafkaSource.java
##
@@ -191,13 +191,13 @@ public void testJsonKafkaSourceWithDefaultUpperCap() {
  */
 testUtils.sendMessages(TEST_TOPIC_NAME, 
Helpers.jsonifyRecords(dataGenerator.generateInserts("000", 1000)));
 InputBatch> fetch1 = 
kafkaSource.fetchNewDataInAvroFormat(Option.empty(), Long.MAX_VALUE);
-assertEquals(500, fetch1.getBatch().get().count());
+assertEquals(1000, fetch1.getBatch().get().count());

Review comment:
   why exactly does this test have to change? could you please clarify





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #1779: [HUDI-1062]Remove unnecessary maxEvent check and add some log in KafkaOffsetGen

2020-07-03 Thread GitBox


vinothchandar commented on pull request #1779:
URL: https://github.com/apache/hudi/pull/1779#issuecomment-653694966


   there is an empty `git` file in this commit.. can you please remove this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar merged pull request #1781: [MINOR] Relocate jetty during shading/packaging for Databricks runtime

2020-07-03 Thread GitBox


vinothchandar merged pull request #1781:
URL: https://github.com/apache/hudi/pull/1781


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (37ea795 -> 574dcf9)

2020-07-03 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from 37ea795  [HUDI-539] Make HoodieROTablePathFilter implement 
Configurable (#1784)
 add 574dcf9  [MINOR] Relocate jetty during shading/packaging for 
Databricks runtime (#1781)

No new revisions were added by this update.

Summary of changes:
 packaging/hudi-spark-bundle/pom.xml | 27 +++
 1 file changed, 15 insertions(+), 12 deletions(-)



[jira] [Resolved] (HUDI-539) RO Path filter does not pick up hadoop configs from the spark context

2020-07-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-539.
-
Resolution: Fixed

Thanks!  Merged! 

> RO Path filter does not pick up hadoop configs from the spark context
> -
>
> Key: HUDI-539
> URL: https://issues.apache.org/jira/browse/HUDI-539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.5.1
> Environment: Spark version : 2.4.4
> Hadoop version : 2.7.3
> Databricks Runtime: 6.1
>Reporter: Sam Somuah
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hi,
>  I'm trying to use hudi to write to one of the Azure storage container file 
> systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file 
> schemes. The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries 
> to get a file path passing in a blank hadoop configuration. This manifests as 
> {{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
> have any of the configuration in the environment.
> The problematic line is
> [https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]
>  
> {code:java}
>  Stacktrace
>  java.io.IOException: No FileSystem for scheme: abfss
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>  at 
> org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
>  at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar merged pull request #1784: [HUDI-539] Make HoodieROTablePathFilter implement Configurable

2020-07-03 Thread GitBox


vinothchandar merged pull request #1784:
URL: https://github.com/apache/hudi/pull/1784


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated: [HUDI-539] Make HoodieROTablePathFilter implement Configurable (#1784)

2020-07-03 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new 37ea795  [HUDI-539] Make HoodieROTablePathFilter implement 
Configurable (#1784)
37ea795 is described below

commit 37ea79566de190605d0a941d6b65e2f46196de88
Author: andreitaleanu 
AuthorDate: Fri Jul 3 23:39:53 2020 +0300

[HUDI-539] Make HoodieROTablePathFilter implement Configurable (#1784)

Co-authored-by: Andrei Taleanu 
---
 .../org/apache/hudi/hadoop/HoodieROTablePathFilter.java | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java
 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java
index d27d6ad..86199d2 100644
--- 
a/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java
+++ 
b/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java
@@ -18,6 +18,7 @@
 
 package org.apache.hudi.hadoop;
 
+import org.apache.hadoop.conf.Configurable;
 import org.apache.hudi.common.config.SerializableConfiguration;
 import org.apache.hudi.common.model.HoodieBaseFile;
 import org.apache.hudi.common.model.HoodiePartitionMetadata;
@@ -50,7 +51,7 @@ import java.util.stream.Collectors;
  * hadoopConf.setClass("mapreduce.input.pathFilter.class", 
org.apache.hudi.hadoop .HoodieROTablePathFilter.class,
  * org.apache.hadoop.fs.PathFilter.class)
  */
-public class HoodieROTablePathFilter implements PathFilter, Serializable {
+public class HoodieROTablePathFilter implements Configurable, PathFilter, 
Serializable {
 
   private static final long serialVersionUID = 1L;
   private static final Logger LOG = 
LogManager.getLogger(HoodieROTablePathFilter.class);
@@ -190,4 +191,14 @@ public class HoodieROTablePathFilter implements 
PathFilter, Serializable {
   throw new HoodieException(msg, e);
 }
   }
+
+  @Override
+  public void setConf(Configuration conf) {
+this.conf = new SerializableConfiguration(conf);
+  }
+
+  @Override
+  public Configuration getConf() {
+return conf.get();
+  }
 }



[jira] [Created] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-07-03 Thread Vinoth Chandar (Jira)
Vinoth Chandar created HUDI-1068:


 Summary: HoodieGlobalBloomIndex does not correctly send deletes to 
older partition when partition path is updated
 Key: HUDI-1068
 URL: https://issues.apache.org/jira/browse/HUDI-1068
 Project: Apache Hudi
  Issue Type: Bug
Reporter: Vinoth Chandar


[https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-07-03 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151057#comment-17151057
 ] 

Vinoth Chandar commented on HUDI-1068:
--

https://github.com/vinothchandar/incubator-hudi/tree/issue-1745-debug

> HoodieGlobalBloomIndex does not correctly send deletes to older partition 
> when partition path is updated
> 
>
> Key: HUDI-1068
> URL: https://issues.apache.org/jira/browse/HUDI-1068
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinoth Chandar
>Priority: Blocker
>
> [https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-07-03 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1068:
-
Fix Version/s: 0.6.0

> HoodieGlobalBloomIndex does not correctly send deletes to older partition 
> when partition path is updated
> 
>
> Key: HUDI-1068
> URL: https://issues.apache.org/jira/browse/HUDI-1068
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1067) Replace the integer version field with HoodieLogBlockVersion data structure

2020-07-03 Thread vinoyang (Jira)
vinoyang created HUDI-1067:
--

 Summary: Replace the integer version field with 
HoodieLogBlockVersion data structure
 Key: HUDI-1067
 URL: https://issues.apache.org/jira/browse/HUDI-1067
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Common Core
Reporter: vinoyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-539) RO Path filter does not pick up hadoop configs from the spark context

2020-07-03 Thread Andrei Taleanu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150787#comment-17150787
 ] 

Andrei Taleanu commented on HUDI-539:
-

[~vinoth] I have opened #1784 to address this issue, would you please take a 
look? Thanks

> RO Path filter does not pick up hadoop configs from the spark context
> -
>
> Key: HUDI-539
> URL: https://issues.apache.org/jira/browse/HUDI-539
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Affects Versions: 0.5.1
> Environment: Spark version : 2.4.4
> Hadoop version : 2.7.3
> Databricks Runtime: 6.1
>Reporter: Sam Somuah
>Assignee: Vinoth Chandar
>Priority: Major
>  Labels: bug-bash-0.6.0, pull-request-available
> Fix For: 0.6.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Hi,
>  I'm trying to use hudi to write to one of the Azure storage container file 
> systems, ADLS Gen 2 (abfs://). ABFS:// is one of the whitelisted file 
> schemes. The issue I'm facing is that in {{HoodieROTablePathFilter}} it tries 
> to get a file path passing in a blank hadoop configuration. This manifests as 
> {{java.io.IOException: No FileSystem for scheme: abfss}} because it doesn't 
> have any of the configuration in the environment.
> The problematic line is
> [https://github.com/apache/incubator-hudi/blob/2bb0c21a3dd29687e49d362ed34f050380ff47ae/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java#L96]
>  
> {code:java}
>  Stacktrace
>  java.io.IOException: No FileSystem for scheme: abfss
>  at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
>  at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
>  at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
>  at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
>  at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
>  at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
>  at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
>  at 
> org.apache.hudi.hadoop.HoodieROTablePathFilter.accept(HoodieROTablePathFilter.java:96)
>  at 
> org.apache.spark.sql.execution.datasources.InMemoryFileIndex$$anonfun$16.apply(InMemoryFileIndex.scala:349){code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1068) HoodieGlobalBloomIndex does not correctly send deletes to older partition when partition path is updated

2020-07-03 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan reassigned HUDI-1068:
-

Assignee: sivabalan narayanan

> HoodieGlobalBloomIndex does not correctly send deletes to older partition 
> when partition path is updated
> 
>
> Key: HUDI-1068
> URL: https://issues.apache.org/jira/browse/HUDI-1068
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Vinoth Chandar
>Assignee: sivabalan narayanan
>Priority: Blocker
> Fix For: 0.6.0
>
>
> [https://github.com/apache/hudi/issues/1745]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] asheeshgarg commented on issue #1787: Exception During Insert

2020-07-03 Thread GitBox


asheeshgarg commented on issue #1787:
URL: https://github.com/apache/hudi/issues/1787#issuecomment-653641776


   @leesf after adding the options it works fine. Does setting the option to 
false have any impact?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zuyanton opened a new issue #1790: [SUPPORT] Querying MoR tables with DecimalType columns via Spark SQL fails

2020-07-03 Thread GitBox


zuyanton opened a new issue #1790:
URL: https://github.com/apache/hudi/issues/1790


   It looks like Hudi does not handle DecimalType properly. one of the symptoms 
is when we try to use decimal column as partition, Hudi creates folders that 
look like this '[0, 0, 0, 0, 0, 0, 0, 0, 27, -63, 109, 103, 78, -56, 0, 0]' 
instead of expected '2'. The other symptom is that querying MoR tables fails 
when table contain Decimal columns. It looks like failure happens when Hudi 
tries to digest decimals coming from log (avro) files. When all data is in the 
parquet file, spark sql works just fine.  
   
   **To Reproduce**
   consider following example:
   Step 1 - create table
   ```
   spark.sql("drop table if exists testTable_ro")
   spark.sql("drop table if exists testTable_rt")
   var df = Seq((1, 2, 3)).toDF("pk", "partition", "sort_key")
   df = df.withColumn("decimal_column", df.col("pk").cast("decimal(38,18)"))
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://someBucket/testTable")
   ```
   since its a firs time we wrote to the table, there is no log files as of 
yet, Hudi created only one parquet file. Therefore simple select query works as 
expected:  
   ```
   scala> spark.sql("select * from testTable_rt").show
   
   
+---++--+--++---+++-+
   
|_hoodie_commit_time|_hoodie_commit_seqno|_hoodie_record_key|_hoodie_partition_path|
   _hoodie_file_name| pk|sort_key|  decimal_column|partition|
   
+---++--+--++---+++-+
   | 20200703164509|  20200703164509_0_8|  pk:1|
 2|cd1293b4-3876-426...|  1|   3|1.00|2|
   
+---++--+--++---+++-+
   
   ```
   Step 2 - update table with the same datafarame  
   ```
   
df.write.format("org.apache.hudi").options(hudiOptions).mode(SaveMode.Append).save("s3://someBucket/testTable")
   ```  
   since we updating existing record , Hudi this time creates log (avro) file 
on top of existing parquet file. Running the same query as in step one will 
result in exception   
   ```
   scala> spark.sql("select * from testTable_rt").show
   
   java.lang.ClassCastException: org.apache.hadoop.io.BytesWritable cannot be 
cast to org.apache.hadoop.hive.serde2.io.HiveDecimalWritable
at 
org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableHiveDecimalObjectInspector.getPrimitiveWritableObject(WritableHiveDecimalObjectInspector.java:41)
at 
org.apache.spark.sql.hive.HiveShim$.toCatalystDecimal(HiveShim.scala:107)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:414)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$11.apply(TableReader.scala:413)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:442)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:433)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:291)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:283)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.

[GitHub] [hudi] srsteinmetz commented on issue #1737: [SUPPORT]spark streaming create small parquet files

2020-07-03 Thread GitBox


srsteinmetz commented on issue #1737:
URL: https://github.com/apache/hudi/issues/1737#issuecomment-653619184


   When I was originally load testing this table I was sending almost 
exclusively inserts. According to this documentation it seems expected that 
inserts end up in new parquet files: 
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=135860485. 
When I changed my load generator to start sending updates I noticed that the 
parquet files were compressed as expected. Now with only updates being sent the 
.hoodie folder seems to show cleaning happening as expected.
   
   However, for our use-case some of our tables will be almost exclusively 
inserts, so I'm worried the current behavior will result in many parquet files 
and degrading performance. From reading this thread it seems like this behavior 
might be related to 
https://hudi.apache.org/docs/configurations.html#logFileToParquetCompressionRatio
 but from the description it's still not clear to me how this property should 
be configured to get the desired behavior.
   
   For some reason GitHub is failing to upload my .hoodie folder screenshot. 
Will try again to upload in a bit.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] WaterKnight1998 commented on issue #1776: [SUPPORT] org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V

2020-07-03 Thread GitBox


WaterKnight1998 commented on issue #1776:
URL: https://github.com/apache/hudi/issues/1776#issuecomment-653611644


   > 
   > 
   > @WaterKnight1998 hudi is not yet fully supported on Hadoop 3. Will get 
this filed towards that jira
   
   Yes, I solved using Hadoop 2.10. What do you mean by "Will get this filed 
towards that jira"?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] WilliamWhispell opened a new issue #1789: [SUPPORT] What jars are needed to run on AWS Glue 1.0 ?

2020-07-03 Thread GitBox


WilliamWhispell opened a new issue #1789:
URL: https://github.com/apache/hudi/issues/1789


   **Describe the problem you faced**
   
   I'm trying to run a hudi write inside a glue job. My understanding is that 
Glue 1.0 uses Spark 2.4.3 and Hadoop 2.8.5.
   
   I've added hudi-spark-bundle_2.11-0.5.3.jar and spark-avro_2.11-2.4.3.jar as 
dependant jars on the Glue job.
   
   However, often the job fails with:
   
   class threw exception: java.lang.NoSuchMethodError: 
org.eclipse.jetty.util.thread.QueuedThreadPool.(III)V
at 
io.javalin.core.util.JettyServerUtil.defaultServer(JettyServerUtil.kt:43)
at io.javalin.Javalin.(Javalin.java:94)
at io.javalin.Javalin.create(Javalin.java:107)
at 
org.apache.hudi.timeline.service.TimelineService.startService(TimelineService.java:102)
at 
org.apache.hudi.client.embedded.EmbeddedTimelineService.startServer(EmbeddedTimelineService.java:74)
at 
org.apache.hudi.client.AbstractHoodieClient.startEmbeddedServerView(AbstractHoodieClient.java:102)
at 
org.apache.hudi.client.AbstractHoodieClient.(AbstractHoodieClient.java:69)
at 
org.apache.hudi.client.AbstractHoodieWriteClient.(AbstractHoodieWriteClient.java:83)
at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:137)
at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:124)
at 
org.apache.hudi.client.HoodieWriteClient.(HoodieWriteClient.java:120)
at 
org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:195)
at 
org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:135)
at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:108)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
at GlueApp$.main(script_2020-07-03-14-45-41.scala:84)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.amazonaws.services.glue.util.GlueExceptionWrapper$$anonfun$1.apply$mcV$sp(GlueExceptionWrapper.scala:35)
at com.amazonaws.
   
   This makes me think I have some type of dependency issue.
   
   Reading over the release notes 
https://hudi.apache.org/releases.html#migration-guide-for-this-release-2 - the 
only requirement I could find for spark was: IMPORTANT This version requires 
your runtime spark version to be upgraded to 2.4+.
   
   So I would expect this to work on Spark 2.4.3 but I'm not sure if the two 
jars I added are all that is needed.
   
   Here is what my code looks like (Scala 2.11):
   
   object GlueApp {
 def main(sysArgs: Array[String]) {
   val sc: SparkContext = new SparkContext()
   val glueContext: GlueContext = new GlueContext(sc)
   val spark: Spark

[GitHub] [hudi] nsivabalan edited a comment on pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-07-03 Thread GitBox


nsivabalan edited a comment on pull request #1469:
URL: https://github.com/apache/hudi/pull/1469#issuecomment-653546494


   @lamber-ken @vinothchandar : I took a stab at the global bloom index V2. I 
don't have permissions to lamberken's repo and hence couldn't update his 
branch. Here is my 
[branch](https://github.com/nsivabalan/hudi/tree/bloomIndexV2) and 
[commit](https://github.com/nsivabalan/hudi/commit/7f59a67743bbeee162181e2a2ca725fe9656cb8f)
 link. And 
[here](https://github.com/nsivabalan/hudi/commit/7f59a67743bbeee162181e2a2ca725fe9656cb8f#diff-fa376d426f0652ffeb1e1f807795196e)
 is the link to the GlobalBloomIndexV2. Please check it out. Have added and 
fixed tests for the same. 
   
   Also, I have two questions/clarifications.
   1: with regular bloom index V2, why do we need to sort based on both 
partition path and record keys. Why not just partition path suffice? 
   2: Correct me if I am wrong. But there is one corner case where both bloom 
index V2 and global version needs to be fixed. But it might incur an additional 
left outer join. So, wanted to confirm if its feasible. 
   Let's say for an incoming record, there is 1 or more files returned after 
range and bloom look up. But in key checker, lets say none of the files had the 
record key. In this scenario, the output of tag location may not have the 
record only. 
   
   If this is a feasible case, then the fix I could think of is.
   Do not return empty candidates from LazyRangeAndBloomChecker. So that result 
after LazyKeyChecker will not contain such records. With this fix, 
LazyKeyChecker will return only existing records in storage. Once we have the 
result from LazyKeyChecker, we might have to do left outer join with incoming 
records to find those non existent records and add them to final tagged record 
list. 
   
   Similar fix needs to be done with global version as well. 
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-07-03 Thread GitBox


nsivabalan commented on pull request #1469:
URL: https://github.com/apache/hudi/pull/1469#issuecomment-653546494


   @lamber-ken @vinothchandar : I took a stab at the global bloom index V2. I 
don't have permissions to lamberken's repo and hence couldn't update his 
branch. Here is my 
[branch](https://github.com/nsivabalan/hudi/tree/bloomIndexV2) and 
[commit](https://github.com/nsivabalan/hudi/commit/7f59a67743bbeee162181e2a2ca725fe9656cb8f)
 link. Please check it out. Have added and fixed tests for the same. 
   
   Also, I have two questions/clarifications.
   1: with regular bloom index V2, why do we need to sort based on both 
partition path and record keys. Why not just partition path suffice? 
   2: Correct me if I am wrong. But there is one corner case where both bloom 
index V2 and global version needs to be fixed. But it might incur an additional 
left outer join. So, wanted to confirm if its feasible. 
   Let's say for an incoming record, there is 1 or more files returned after 
range and bloom look up. But in key checker, lets say none of the files had the 
record key. In this scenario, the output of tag location may not have the 
record only. 
   
   If this is a feasible case, then the fix I could think of is.
   Do not return empty candidates from LazyRangeAndBloomChecker. So that result 
after LazyKeyChecker will not contain such records. With this fix, 
LazyKeyChecker will return only existing records in storage. Once we have the 
result from LazyKeyChecker, we might have to do left outer join with incoming 
records to find those non existent records and add them to final tagged record 
list. 
   
   Similar fix needs to be done with global version as well. 
   
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nandurj commented on issue #1586: [SUPPORT] DMS with 2 key example

2020-07-03 Thread GitBox


nandurj commented on issue #1586:
URL: https://github.com/apache/hudi/issues/1586#issuecomment-653527159


   I am using multiple keys to create CoW tables by using below properties 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
   hoodie.datasource.write.recordkey.field=customer_id,product_id
   
   But the delta streamer is not picking up the second key, It is only picking 
up the first key customer_id. I have verified this by _hoodie_redord_key value 
on the table only shows customer_id



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1783: [HUDI-1064] Trim hoodie table name

2020-07-03 Thread GitBox


leesf commented on a change in pull request #1783:
URL: https://github.com/apache/hudi/pull/1783#discussion_r449544401



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -52,7 +52,7 @@ private[hudi] object HoodieSparkSqlWriter {
 
 val sparkContext = sqlContext.sparkContext
 val path = parameters.get("path")
-val tblName = parameters.get(HoodieWriteConfig.TABLE_NAME)
+val tblName = parameters.get(HoodieWriteConfig.TABLE_NAME).get.trim

Review comment:
   if parameters do not contains `HoodieWriteConfig.TABLE_NAME`, 
`parameters.get(HoodieWriteConfig.TABLE_NAME).get.trim` will throw 
`java.util.NoSuchElementException: None.get` exception





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] leesf commented on a change in pull request #1783: [HUDI-1064] Trim hoodie table name

2020-07-03 Thread GitBox


leesf commented on a change in pull request #1783:
URL: https://github.com/apache/hudi/pull/1783#discussion_r449544401



##
File path: hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
##
@@ -52,7 +52,7 @@ private[hudi] object HoodieSparkSqlWriter {
 
 val sparkContext = sqlContext.sparkContext
 val path = parameters.get("path")
-val tblName = parameters.get(HoodieWriteConfig.TABLE_NAME)
+val tblName = parameters.get(HoodieWriteConfig.TABLE_NAME).get.trim

Review comment:
   if parameters do not contains HoodieWriteConfig.TABLE_NAME, 
`parameters.get(HoodieWriteConfig.TABLE_NAME).get.trim` will throw 
`java.util.NoSuchElementException: None.get`





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] shenh062326 commented on pull request #1732: [HUDI-1004] Support update metrics in HoodieDeltaStreamerMetrics

2020-07-03 Thread GitBox


shenh062326 commented on pull request #1732:
URL: https://github.com/apache/hudi/pull/1732#issuecomment-653458290


   @vinothchandar Can you take a look at this pull request?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org