[jira] [Updated] (HUDI-4040) Add customerColumnPartitionerRow support

2022-05-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4040:
-
Labels: pull-request-available  (was: )

> Add customerColumnPartitionerRow support
> 
>
> Key: HUDI-4040
> URL: https://issues.apache.org/jira/browse/HUDI-4040
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4040) Add customerColumnPartitionerRow support

2022-05-04 Thread Hui An (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hui An updated HUDI-4040:
-
Description: 
like RDDCustomColumnsSortPartitioner, we can use this partitioner to sort 
customer columns users specified.
for example:

{code:scala}
df.write.format("hudi")
.option(HoodieWriteConfig.TABLE_NAME, "test_table")
.option(OPERATION.key, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
.option(RECORDKEY_FIELD.key, "session_id")
.option(PARTITIONPATH_FIELD.key, "date")
.option("hoodie.bulkinsert.user.defined.partitioner.class", 
"org.apache.hudi.execution.bulkinsert.CustomColumnsSortPartitionerWithRows")
.option("hoodie.bulkinsert.user.defined.partitioner.sort.columns", "page_type")
.mode(SaveMode.Append)
.save("hdfs://test/test_table")
{code}


> Add customerColumnPartitionerRow support
> 
>
> Key: HUDI-4040
> URL: https://issues.apache.org/jira/browse/HUDI-4040
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Hui An
>Priority: Major
>  Labels: pull-request-available
>
> like RDDCustomColumnsSortPartitioner, we can use this partitioner to sort 
> customer columns users specified.
> for example:
> {code:scala}
> df.write.format("hudi")
> .option(HoodieWriteConfig.TABLE_NAME, "test_table")
> .option(OPERATION.key, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
> .option(RECORDKEY_FIELD.key, "session_id")
> .option(PARTITIONPATH_FIELD.key, "date")
> .option("hoodie.bulkinsert.user.defined.partitioner.class", 
> "org.apache.hudi.execution.bulkinsert.CustomColumnsSortPartitionerWithRows")
> .option("hoodie.bulkinsert.user.defined.partitioner.sort.columns", 
> "page_type")
> .mode(SaveMode.Append)
> .save("hdfs://test/test_table")
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4041) Support compact according to precombinekey in the RealtimeCompactedRecordReader class

2022-05-04 Thread zyp (Jira)
zyp created HUDI-4041:
-

 Summary: Support compact according to precombinekey in the 
RealtimeCompactedRecordReader class
 Key: HUDI-4041
 URL: https://issues.apache.org/jira/browse/HUDI-4041
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: zyp






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] boneanxs opened a new pull request, #5502: [HUDI-4040]Bulk insert: Add customColumnsSortpartitionerWithRows

2022-05-04 Thread GitBox


boneanxs opened a new pull request, #5502:
URL: https://github.com/apache/hudi/pull/5502

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   like RDDCustomColumnsSortPartitioner, we can use this partitioner to sort 
customer columns users specified.
   for example:
   
   ```scala
   df.write.format("hudi")
   .option(HoodieWriteConfig.TABLE_NAME, "test_table")
   .option(OPERATION.key, DataSourceWriteOptions.BULK_INSERT_OPERATION_OPT_VAL)
   .option(RECORDKEY_FIELD.key, "session_id")
   .option(PARTITIONPATH_FIELD.key, "date")
   .option("hoodie.bulkinsert.user.defined.partitioner.class", 
"org.apache.hudi.execution.bulkinsert.CustomColumnsSortPartitionerWithRows")
   .option("hoodie.bulkinsert.user.defined.partitioner.sort.columns", 
"page_type")
   .mode(SaveMode.Append)
   .save("hdfs://test/test_table")
   ```
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (HUDI-3850) Column pruning when doing snapshot query

2022-05-04 Thread zyp (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zyp reassigned HUDI-3850:
-

Assignee: zyp

> Column pruning when doing snapshot query
> 
>
> Key: HUDI-3850
> URL: https://issues.apache.org/jira/browse/HUDI-3850
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zyp
>Assignee: zyp
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


hudi-bot commented on PR #5445:
URL: https://github.com/apache/hudi/pull/5445#issuecomment-1118228398

   
   ## CI report:
   
   * 1e9b3ac4c34f97f5ccf3a639cc74b7081eeaab37 UNKNOWN
   * a5669a78b314a5dc4166bcc4d41d2a377653da75 UNKNOWN
   * b0e9a8c8abd79fab2add18176b20d8884cba90fc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8434)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4040) Add customerColumnPartitionerRow support

2022-05-04 Thread Hui An (Jira)
Hui An created HUDI-4040:


 Summary: Add customerColumnPartitionerRow support
 Key: HUDI-4040
 URL: https://issues.apache.org/jira/browse/HUDI-4040
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Hui An






--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3933) Adds unit tests for partition value handling with different key generators

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3933:
-
Reviewers: Raymond Xu

> Adds unit tests for partition value handling with different key generators
> --
>
> Key: HUDI-3933
> URL: https://issues.apache.org/jira/browse/HUDI-3933
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-3667) Unit tests in hudi-integ-tests are not executed in CI

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-3667.

Resolution: Fixed

> Unit tests in hudi-integ-tests are not executed in CI
> -
>
> Key: HUDI-3667
> URL: https://issues.apache.org/jira/browse/HUDI-3667
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: tests-ci
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.11.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[hudi] branch master updated: [HUDI-3667] Run unit tests of hudi-integ-tests in CI (#5078)

2022-05-04 Thread xushiyan
This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new f66e83dc65 [HUDI-3667] Run unit tests of hudi-integ-tests in CI (#5078)
f66e83dc65 is described below

commit f66e83dc6521a57a52385d61d7905fd34ff0e2f2
Author: Y Ethan Guo 
AuthorDate: Wed May 4 23:39:18 2022 -0700

[HUDI-3667] Run unit tests of hudi-integ-tests in CI (#5078)
---
 azure-pipelines.yml| 18 ++
 .../integ/testsuite/configuration/DeltaConfig.java |  6 ++--
 .../apache/hudi/integ/testsuite/dag/DagUtils.java  | 40 +++---
 .../TestDFSHoodieTestSuiteWriterAdapter.java   | 27 +--
 .../testsuite/converter/TestDeleteConverter.java   |  3 +-
 .../hudi/integ/testsuite/dag/TestDagUtils.java | 13 ---
 .../TestGenericRecordPayloadEstimator.java | 11 +++---
 .../testsuite/job/TestHoodieTestSuiteJob.java  |  3 ++
 .../reader/TestDFSHoodieDatasetInputReader.java| 19 ++
 9 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/azure-pipelines.yml b/azure-pipelines.yml
index 6c01321004..8a2d7f0de0 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -150,6 +150,24 @@ stages:
 displayName: IT modules
 timeoutInMinutes: '120'
 steps:
+  - task: Maven@3
+displayName: maven install
+inputs:
+  mavenPomFile: 'pom.xml'
+  goals: 'clean install'
+  options: -T 2.5C -Pintegration-tests -DskipTests
+  publishJUnitResults: false
+  jdkVersionOption: '1.8'
+  mavenOptions: '-Xmx4g $(MAVEN_OPTS)'
+  - task: Maven@3
+displayName: UT integ-test
+inputs:
+  mavenPomFile: 'pom.xml'
+  goals: 'test'
+  options: -Pintegration-tests -DskipUTs=false -DskipITs=true -pl 
hudi-integ-test test
+  publishJUnitResults: false
+  jdkVersionOption: '1.8'
+  mavenOptions: '-Xmx4g $(MAVEN_OPTS)'
   - task: AzureCLI@2
 displayName: Prepare for IT
 inputs:
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java
index d7280402d2..581cce954a 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java
@@ -18,14 +18,15 @@
 
 package org.apache.hudi.integ.testsuite.configuration;
 
-import com.fasterxml.jackson.databind.ObjectMapper;
-import org.apache.hadoop.conf.Configuration;
 import org.apache.hudi.common.config.SerializableConfiguration;
 import org.apache.hudi.common.util.Option;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.integ.testsuite.reader.DeltaInputType;
 import org.apache.hudi.integ.testsuite.writer.DeltaOutputMode;
 
+import com.fasterxml.jackson.databind.ObjectMapper;
+import org.apache.hadoop.conf.Configuration;
+
 import java.io.Serializable;
 import java.util.ArrayList;
 import java.util.HashMap;
@@ -69,6 +70,7 @@ public class DeltaConfig implements Serializable {
 public static final String TYPE = "type";
 public static final String NODE_NAME = "name";
 public static final String DEPENDENCIES = "deps";
+public static final String NO_DEPENDENCY_VALUE = "none";
 public static final String CHILDREN = "children";
 public static final String HIVE_QUERIES = "hive_queries";
 public static final String HIVE_PROPERTIES = "hive_props";
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/DagUtils.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/DagUtils.java
index 999bc43661..789d7e3423 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/DagUtils.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/DagUtils.java
@@ -18,25 +18,28 @@
 
 package org.apache.hudi.integ.testsuite.dag;
 
-import com.fasterxml.jackson.core.JsonGenerator;
-import com.fasterxml.jackson.core.JsonToken;
-import com.fasterxml.jackson.databind.DeserializationContext;
-import com.fasterxml.jackson.databind.JsonDeserializer;
-import com.fasterxml.jackson.databind.JsonSerializer;
-import com.fasterxml.jackson.databind.SerializerProvider;
-import com.fasterxml.jackson.databind.module.SimpleModule;
 import org.apache.hudi.common.util.ReflectionUtils;
 import org.apache.hudi.common.util.StringUtils;
 import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.integ.testsuite.configuration.DeltaConfig;
 import org.apache.hudi.integ.testsuite.dag.nodes.DagNode;
 
+import com.faster

[GitHub] [hudi] xushiyan merged pull request #5078: [HUDI-3667] Run unit tests of hudi-integ-tests in CI

2022-05-04 Thread GitBox


xushiyan merged PR #5078:
URL: https://github.com/apache/hudi/pull/5078


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5464: [HUDI-3980] Suport kerberos hbase index

2022-05-04 Thread GitBox


hudi-bot commented on PR #5464:
URL: https://github.com/apache/hudi/pull/5464#issuecomment-1118220174

   
   ## CI report:
   
   * 578d03bf52a55c34f5f162728db64ab3c6cf129e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8387)
 
   * 636d10e4d61c76f04dae08da247651fcd019b7ad Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8436)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5464: [HUDI-3980] Suport kerberos hbase index

2022-05-04 Thread GitBox


hudi-bot commented on PR #5464:
URL: https://github.com/apache/hudi/pull/5464#issuecomment-1118218555

   
   ## CI report:
   
   * 578d03bf52a55c34f5f162728db64ab3c6cf129e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8387)
 
   * 636d10e4d61c76f04dae08da247651fcd019b7ad UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] zhilinli123 commented on issue #5460: org.apache.hudi.exception.HoodieRemoteException: status code: 500, reason phrase: Server Error

2022-05-04 Thread GitBox


zhilinli123 commented on issue #5460:
URL: https://github.com/apache/hudi/issues/5460#issuecomment-1118217639


   https://user-images.githubusercontent.com/76689593/166872145-73004006-f46d-4b93-a646-2b234604cbb5.png";>
   
[20220505133151119.tar.gz](https://github.com/apache/hudi/files/8629071/20220505133151119.tar.gz)
   [temp.tar.gz](https://github.com/apache/hudi/files/8629072/temp.tar.gz)
   @yihua @danny0405  You guys, I've downloaded the timeline 20220505133151119. 
Does that help?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5078: [HUDI-3667] Run unit tests of hudi-integ-tests in CI

2022-05-04 Thread GitBox


hudi-bot commented on PR #5078:
URL: https://github.com/apache/hudi/pull/5078#issuecomment-1118216623

   
   ## CI report:
   
   * 9fac106587c2652d77d4753875e8759952781a55 UNKNOWN
   * 30f755ce655f7f7da22b80526f6c9ad7aabbcbf4 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8431)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on issue #5382: [SUPPORT] org.apache.hudi.hadoop.hive.HoodieCombineRealtimeFileSplit cannot be cast to org.apache.hadoop.hive.shims.HadoopShimsSecure$InputSplitShim

2022-05-04 Thread GitBox


danny0405 commented on issue #5382:
URL: https://github.com/apache/hudi/issues/5382#issuecomment-1118203393

   Can you show the create table descriptor in Hive CLI to see the table input 
format ? What input format does the table use ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] JerryYue-M commented on a diff in pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


JerryYue-M commented on code in PR #5445:
URL: https://github.com/apache/hudi/pull/5445#discussion_r865575505


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1217,6 +1234,42 @@ void testBuiltinFunctionWithCatalog(String operation) {
 assertRowsEquals(partitionResult, "[+I[1, 2022-02-02]]");
   }
 
+  @Test
+  public void testSource() throws Exception {
+
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_NAME, "t1");
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");
+
+// write 3 batches of data set
+TestData.writeData(TestData.dataSetInsert(1, 2), conf);
+TestData.writeData(TestData.dataSetInsert(3, 4), conf);
+TestData.writeData(TestData.dataSetInsert(5, 6), conf);
+
+Map options = new HashMap<>();
+String latestCommit = 
TestUtils.getLastCompleteInstant(tempFile.getAbsolutePath());
+
+options.clear();

Review Comment:
   ok, I will move to ITTestDataStreamWrite



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] JerryYue-M commented on a diff in pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


JerryYue-M commented on code in PR #5445:
URL: https://github.com/apache/hudi/pull/5445#discussion_r865574428


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.util;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.table.HoodieTableFactory;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.datastream.DataStreamSink;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.table.api.EnvironmentSettings;
+import org.apache.flink.table.api.internal.TableEnvironmentImpl;
+import org.apache.flink.table.catalog.Catalog;
+import org.apache.flink.table.catalog.ObjectIdentifier;
+import org.apache.flink.table.catalog.ObjectPath;
+import org.apache.flink.table.catalog.ResolvedCatalogTable;
+import org.apache.flink.table.catalog.exceptions.TableNotExistException;
+import org.apache.flink.table.connector.sink.DataStreamSinkProvider;
+import org.apache.flink.table.connector.source.DataStreamScanProvider;
+import org.apache.flink.table.connector.source.ScanTableSource;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.factories.FactoryUtil;
+import 
org.apache.flink.table.runtime.connector.sink.SinkRuntimeProviderContext;
+import 
org.apache.flink.table.runtime.connector.source.ScanRuntimeProviderContext;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ *  A tool class to construct hoodie flink pipeline.
+ *
+ *  How to use ?
+ *  Method {@link #builder(String)} returns a pipeline builder. The builder
+ *  can then define the hudi table columns, primary keys and partitions.
+ *
+ *  An example:
+ *  
+ *HoodiePipeline.Builder builder = HoodiePipeline.builder("myTable");
+ *DataStreamSink sinkStream = builder
+ *.column("f0 int")
+ *.column("f1 varchar(10)")
+ *.column("f2 varchar(20)")
+ *.pk("f0,f1")
+ *.partition("f2")
+ *.sink(input, false);
+ *  
+ */
+public class HoodiePipeline {
+
+  private static final Logger LOG = LogManager.getLogger(HoodiePipeline.class);
+
+  /**
+   * Returns the builder for hoodie pipeline construction.
+   */
+  public static Builder builder(String tableName) {
+return new Builder(tableName);
+  }
+
+  /**
+   * Builder for hudi source/sink pipeline construction.
+   */
+  public static class Builder {
+private final String tableName;
+private final List columns;
+private final Map options;
+
+private String pk;
+private List partitions;
+
+public Builder self() {
+  return this;

Review Comment:
   Indeed. I will change it to return this for more straight-forward view



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


danny0405 commented on code in PR #5445:
URL: https://github.com/apache/hudi/pull/5445#discussion_r865571243


##
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java:
##
@@ -0,0 +1,263 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.util;
+
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.table.HoodieTableFactory;
+
+import org.apache.flink.configuration.ConfigOption;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.streaming.api.datastream.DataStream;
+import org.apache.flink.streaming.api.datastream.DataStreamSink;
+import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
+import org.apache.flink.table.api.EnvironmentSettings;
+import org.apache.flink.table.api.internal.TableEnvironmentImpl;
+import org.apache.flink.table.catalog.Catalog;
+import org.apache.flink.table.catalog.ObjectIdentifier;
+import org.apache.flink.table.catalog.ObjectPath;
+import org.apache.flink.table.catalog.ResolvedCatalogTable;
+import org.apache.flink.table.catalog.exceptions.TableNotExistException;
+import org.apache.flink.table.connector.sink.DataStreamSinkProvider;
+import org.apache.flink.table.connector.source.DataStreamScanProvider;
+import org.apache.flink.table.connector.source.ScanTableSource;
+import org.apache.flink.table.data.RowData;
+import org.apache.flink.table.factories.FactoryUtil;
+import 
org.apache.flink.table.runtime.connector.sink.SinkRuntimeProviderContext;
+import 
org.apache.flink.table.runtime.connector.source.ScanRuntimeProviderContext;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+/**
+ *  A tool class to construct hoodie flink pipeline.
+ *
+ *  How to use ?
+ *  Method {@link #builder(String)} returns a pipeline builder. The builder
+ *  can then define the hudi table columns, primary keys and partitions.
+ *
+ *  An example:
+ *  
+ *HoodiePipeline.Builder builder = HoodiePipeline.builder("myTable");
+ *DataStreamSink sinkStream = builder
+ *.column("f0 int")
+ *.column("f1 varchar(10)")
+ *.column("f2 varchar(20)")
+ *.pk("f0,f1")
+ *.partition("f2")
+ *.sink(input, false);
+ *  
+ */
+public class HoodiePipeline {
+
+  private static final Logger LOG = LogManager.getLogger(HoodiePipeline.class);
+
+  /**
+   * Returns the builder for hoodie pipeline construction.
+   */
+  public static Builder builder(String tableName) {
+return new Builder(tableName);
+  }
+
+  /**
+   * Builder for hudi source/sink pipeline construction.
+   */
+  public static class Builder {
+private final String tableName;
+private final List columns;
+private final Map options;
+
+private String pk;
+private List partitions;
+
+public Builder self() {
+  return this;
+}
+
+private Builder(String tableName) {
+  this.tableName = tableName;
+  this.columns = new ArrayList<>();
+  this.options = new HashMap<>();
+  this.partitions = new ArrayList<>();
+}
+
+/**
+ * Add a table column definition.
+ *
+ * @param column the column format should be in the form like 'f0 int'
+ */
+public Builder column(String column) {
+  this.columns.add(column);
+  return self();
+}
+
+/**
+ * Add primary keys.
+ */
+public Builder pk(String... pks) {
+  this.pk = String.join(",", pks);
+  return self();
+}
+
+/**
+ * Add partition fields.
+ */
+public Builder partition(String... partitions) {
+  this.partitions = new ArrayList<>(Arrays.asList(partitions));
+  return self();
+}
+
+/**
+ * Add a config option.
+ */
+public Builder option(ConfigOption option, Object val) {
+  this.options.put(option.key(), val.toString());
+  return self();
+}
+
+public Builder option(String key, Object val) {
+  this.options.put(key, val.toString());
+  return self();
+}
+
+public Builder options(Ma

[GitHub] [hudi] danny0405 commented on a diff in pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


danny0405 commented on code in PR #5445:
URL: https://github.com/apache/hudi/pull/5445#discussion_r865567582


##
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/table/ITTestHoodieDataSource.java:
##
@@ -1217,6 +1234,42 @@ void testBuiltinFunctionWithCatalog(String operation) {
 assertRowsEquals(partitionResult, "[+I[1, 2022-02-02]]");
   }
 
+  @Test
+  public void testSource() throws Exception {
+
+Configuration conf = 
TestConfigurations.getDefaultConf(tempFile.getAbsolutePath());
+conf.setString(FlinkOptions.TABLE_NAME, "t1");
+conf.setString(FlinkOptions.TABLE_TYPE, "MERGE_ON_READ");
+
+// write 3 batches of data set
+TestData.writeData(TestData.dataSetInsert(1, 2), conf);
+TestData.writeData(TestData.dataSetInsert(3, 4), conf);
+TestData.writeData(TestData.dataSetInsert(5, 6), conf);
+
+Map options = new HashMap<>();
+String latestCommit = 
TestUtils.getLastCompleteInstant(tempFile.getAbsolutePath());
+
+options.clear();

Review Comment:
   Can this test also be moved to `ITTestDataStreamWrite `?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-04 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1118186504

   
   ## CI report:
   
   * 2d627024cd13ca2389008a649b3defd9fba3b04c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8435)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-04 Thread GitBox


hudi-bot commented on PR #5501:
URL: https://github.com/apache/hudi/pull/5501#issuecomment-1118185285

   
   ## CI report:
   
   * 2d627024cd13ca2389008a649b3defd9fba3b04c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] wf19970425 commented on issue #5500: metadata table function[SUPPORT]

2022-05-04 Thread GitBox


wf19970425 commented on issue #5500:
URL: https://github.com/apache/hudi/issues/5500#issuecomment-1118177780

   If metadata table is enabled, the scanning time will increase.  Is that 
reasonable?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4000) Docs around DBT

2022-05-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-4000:
--
Status: Patch Available  (was: In Progress)

> Docs around DBT
> ---
>
> Key: HUDI-4000
> URL: https://issues.apache.org/jira/browse/HUDI-4000
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> [https://github.com/apache/hudi/issues/5367]
> Do we have any step by step document on how to use Hudi in conjunction with 
> DBT. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4000) Docs around DBT

2022-05-04 Thread Vinoth Govindarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Govindarajan updated HUDI-4000:
--
Status: In Progress  (was: Open)

> Docs around DBT
> ---
>
> Key: HUDI-4000
> URL: https://issues.apache.org/jira/browse/HUDI-4000
> Project: Apache Hudi
>  Issue Type: Task
>  Components: docs
>Reporter: Ethan Guo
>Assignee: Vinoth Govindarajan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> [https://github.com/apache/hudi/issues/5367]
> Do we have any step by step document on how to use Hudi in conjunction with 
> DBT. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4018) Prepare minimal set of yamls to be tested against any write mode and against any query engine

2022-05-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4018:
-
Labels: pull-request-available  (was: )

> Prepare minimal set of yamls to be tested against any write mode and against 
> any query engine
> -
>
> Key: HUDI-4018
> URL: https://issues.apache.org/jira/browse/HUDI-4018
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] nsivabalan opened a new pull request, #5501: [HUDI-4018][HUDI-4027] Adding integ test yamls for immutable use-cases. Added delete partition support to integ tests

2022-05-04 Thread GitBox


nsivabalan opened a new pull request, #5501:
URL: https://github.com/apache/hudi/pull/5501

   ## What is the purpose of the pull request
   
   - Added pure immutable test yamls to integ test framework.
   - Added delete_partition support to integ test framework.
   
   ## Brief change log
   
   - Added pure immutable test yamls to integ test framework.
   - Added delete_partition support to integ test framework using 
spark-datasource. 
   
   ## Verify this pull request
   
   - Tested it locally for all new yamls and new nodes added. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-3850) Column pruning when doing snapshot query

2022-05-04 Thread zyp (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zyp closed HUDI-3850.
-
Resolution: Fixed

> Column pruning when doing snapshot query
> 
>
> Key: HUDI-3850
> URL: https://issues.apache.org/jira/browse/HUDI-3850
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: zyp
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] wf19970425 opened a new issue, #5500: metadata table function[SUPPORT]

2022-05-04 Thread GitBox


wf19970425 opened a new issue, #5500:
URL: https://github.com/apache/hudi/issues/5500

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
dev-subscr...@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. After the metadata table is enabled, the number of files scanned in a 
single session decreases, but why does the scanning time increase.
   
   **Expected behavior**
   
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.11
   
   * Flink version : 1.14
   
   * Hive version : -
   
   * Hadoop version : 2.x
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stack trace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] danny0405 commented on a diff in pull request #4739: [HUDI-3365] Make sure Metadata Table records are updated appropriately on HDFS

2022-05-04 Thread GitBox


danny0405 commented on code in PR #4739:
URL: https://github.com/apache/hudi/pull/4739#discussion_r86553


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##
@@ -1301,4 +1359,33 @@ public void close() {
 this.heartbeatClient.stop();
 this.txnManager.close();
   }
+
+  private void setWriteTimer(HoodieTable table) {
+String commitType = table.getMetaClient().getCommitActionType();
+if (commitType.equals(HoodieTimeline.COMMIT_ACTION)) {
+  writeTimer = metrics.getCommitCtx();
+} else if (commitType.equals(HoodieTimeline.DELTA_COMMIT_ACTION)) {
+  writeTimer = metrics.getDeltaCommitCtx();
+}
+  }
+
+  private void tryUpgrade(HoodieTableMetaClient metaClient, Option 
instantTime) {
+UpgradeDowngrade upgradeDowngrade =
+new UpgradeDowngrade(metaClient, config, context, 
upgradeDowngradeHelper);
+
+if 
(upgradeDowngrade.needsUpgradeOrDowngrade(HoodieTableVersion.current())) {
+  // Ensure no inflight commits by setting EAGER policy and explicitly 
cleaning all failed commits
+  List instantsToRollback = getInstantsToRollback(metaClient, 
HoodieFailedWritesCleaningPolicy.EAGER, instantTime);
+
+  Map> pendingRollbacks = 
getPendingRollbackInfos(metaClient);

Review Comment:
   May i know why we rollback the failed commits before doing the upgrade ? We 
already try to do that when start commit: 
https://github.com/apache/hudi/blob/1562bb658f8f29f57763eaa6f9bd5a2ed7e80a7c/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java#L935



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5286: HUDI-3836 Improve the way of fetching metadata partitions from table

2022-05-04 Thread GitBox


hudi-bot commented on PR #5286:
URL: https://github.com/apache/hudi/pull/5286#issuecomment-1118136698

   
   ## CI report:
   
   * 1acbc822bb1a3e8051539beabf640977f7d8fe6a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8427)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


hudi-bot commented on PR #5445:
URL: https://github.com/apache/hudi/pull/5445#issuecomment-1118136806

   
   ## CI report:
   
   * 1e9b3ac4c34f97f5ccf3a639cc74b7081eeaab37 UNKNOWN
   * 8e4612d5be8de2fdcb1c53be5828bcc0c6da9b5c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8388)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8395)
 
   * a5669a78b314a5dc4166bcc4d41d2a377653da75 UNKNOWN
   * b0e9a8c8abd79fab2add18176b20d8884cba90fc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8434)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


hudi-bot commented on PR #5445:
URL: https://github.com/apache/hudi/pull/5445#issuecomment-1118133994

   
   ## CI report:
   
   * 1e9b3ac4c34f97f5ccf3a639cc74b7081eeaab37 UNKNOWN
   * 8e4612d5be8de2fdcb1c53be5828bcc0c6da9b5c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8388)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8395)
 
   * a5669a78b314a5dc4166bcc4d41d2a377653da75 UNKNOWN
   * b0e9a8c8abd79fab2add18176b20d8884cba90fc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5445: [HUDI-3953]Flink Hudi module should support low-level source and sink…

2022-05-04 Thread GitBox


hudi-bot commented on PR #5445:
URL: https://github.com/apache/hudi/pull/5445#issuecomment-1118133074

   
   ## CI report:
   
   * 1e9b3ac4c34f97f5ccf3a639cc74b7081eeaab37 UNKNOWN
   * 8e4612d5be8de2fdcb1c53be5828bcc0c6da9b5c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8388)
 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8395)
 
   * a5669a78b314a5dc4166bcc4d41d2a377653da75 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5499: [MINOR] Optimize code logic

2022-05-04 Thread GitBox


hudi-bot commented on PR #5499:
URL: https://github.com/apache/hudi/pull/5499#issuecomment-1118132044

   
   ## CI report:
   
   * 3e492e1ade8e33a36fd0c5cf880f17786e6601d3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8433)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5199: [HUDI-3770] Delete the remaining temporary marker files after the syn…

2022-05-04 Thread GitBox


hudi-bot commented on PR #5199:
URL: https://github.com/apache/hudi/pull/5199#issuecomment-1118131844

   
   ## CI report:
   
   * 27e94c1167b9ccf04e5ad97d5f8a29fecfadb516 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8183)
 
   * ae7e668eb24de9605953282172f4b25ff2ca6ee4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8432)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5499: [MINOR] Optimize code logic

2022-05-04 Thread GitBox


hudi-bot commented on PR #5499:
URL: https://github.com/apache/hudi/pull/5499#issuecomment-1118130925

   
   ## CI report:
   
   * 3e492e1ade8e33a36fd0c5cf880f17786e6601d3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5199: [HUDI-3770] Delete the remaining temporary marker files after the syn…

2022-05-04 Thread GitBox


hudi-bot commented on PR #5199:
URL: https://github.com/apache/hudi/pull/5199#issuecomment-1118130743

   
   ## CI report:
   
   * 27e94c1167b9ccf04e5ad97d5f8a29fecfadb516 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8183)
 
   * ae7e668eb24de9605953282172f4b25ff2ca6ee4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5078: [HUDI-3667] Run unit tests of hudi-integ-tests in CI

2022-05-04 Thread GitBox


hudi-bot commented on PR #5078:
URL: https://github.com/apache/hudi/pull/5078#issuecomment-1118130667

   
   ## CI report:
   
   * 9fac106587c2652d77d4753875e8759952781a55 UNKNOWN
   * 5078d29eb429d7eca46c3d5c3aa72d94e088d43e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8068)
 
   * 30f755ce655f7f7da22b80526f6c9ad7aabbcbf4 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8431)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

2022-05-04 Thread GitBox


hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118129741

   
   ## CI report:
   
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4039) Make sure builtin key-generators can efficiently fetch record-key, partition-path

2022-05-04 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-4039:
-

 Summary: Make sure builtin key-generators can efficiently fetch 
record-key, partition-path
 Key: HUDI-4039
 URL: https://issues.apache.org/jira/browse/HUDI-4039
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin


With the introduction of row-writing Key Generators become aware of low-level 
Spark components like `Row`, `InternalRow` to be able to fetch record-key, and 
partition-path w/o the need to convert received object into Avro.

To further improve on performance of these, we should also try to avoid doing 
conversions from Spark's native data-types (for ex, `UTF8String`) into JVM 
native ones (like `String`) as this might entail unnecessary encoding/decoding, 
as well as underlying buffers being copied for no good use.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] qianchutao opened a new pull request, #5499: [MINOR] Optimize code logic

2022-05-04 Thread GitBox


qianchutao opened a new pull request, #5499:
URL: https://github.com/apache/hudi/pull/5499

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] qianchutao closed pull request #5498: [Minor] Optimize code logic

2022-05-04 Thread GitBox


qianchutao closed pull request #5498: [Minor] Optimize code logic
URL: https://github.com/apache/hudi/pull/5498


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4038) Avoid invoking `getDataSize` in the hot-path

2022-05-04 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-4038:
-
Labels: pull-request-available  (was: )

> Avoid invoking `getDataSize` in the hot-path
> 
>
> Key: HUDI-4038
> URL: https://issues.apache.org/jira/browse/HUDI-4038
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.11.1, 0.12.0
>
>
> `getDataSize` has non-trivial overhead of traversing already encoded Column 
> Groups stored in memory. We should sample its invocations to amortize its 
> costs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5498: [Minor] Optimize code logic

2022-05-04 Thread GitBox


hudi-bot commented on PR #5498:
URL: https://github.com/apache/hudi/pull/5498#issuecomment-1118113743

   
   ## CI report:
   
   * 14b72be4bb7c82f23a29957a1ad0c7a09d438bda Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8430)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5497: [HUDI-4038] Avoid calling `getDataSize` after every record written

2022-05-04 Thread GitBox


hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1118113724

   
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

2022-05-04 Thread GitBox


hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118113709

   
   ## CI report:
   
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426)
 
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8429)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] qianchutao commented on pull request #5498: [Minor] Optimize code logic

2022-05-04 Thread GitBox


qianchutao commented on PR #5498:
URL: https://github.com/apache/hudi/pull/5498#issuecomment-1118113422

   @hudi-bot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5498: [Minor] Optimize code logic

2022-05-04 Thread GitBox


hudi-bot commented on PR #5498:
URL: https://github.com/apache/hudi/pull/5498#issuecomment-1118112669

   
   ## CI report:
   
   * 14b72be4bb7c82f23a29957a1ad0c7a09d438bda UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

2022-05-04 Thread GitBox


hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118112640

   
   ## CI report:
   
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426)
 
   * 9c7e7eacfd743264be7c0d2e6bc18165722358f9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] qianchutao opened a new pull request, #5498: [Minor] Optimize code logic

2022-05-04 Thread GitBox


qianchutao opened a new pull request, #5498:
URL: https://github.com/apache/hudi/pull/5498

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (HUDI-4038) Avoid invoking `getDataSize` in the hot-path

2022-05-04 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-4038:
-

 Summary: Avoid invoking `getDataSize` in the hot-path
 Key: HUDI-4038
 URL: https://issues.apache.org/jira/browse/HUDI-4038
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: Alexey Kudinkin
Assignee: Alexey Kudinkin
 Fix For: 0.11.1, 0.12.0


`getDataSize` has non-trivial overhead of traversing already encoded Column 
Groups stored in memory. We should sample its invocations to amortize its costs.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5497: [WIP] Avoid calling `getDataSize` after every record written

2022-05-04 Thread GitBox


hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1118110328

   
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8428)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-3249) Performance Improvements

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3249:
--
Summary: Performance Improvements  (was: Performance improvements)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3249) Performance Improvements

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3249:
--
Epic Name: Performance Improvements  (was: Improve perf)

> Performance Improvements
> 
>
> Key: HUDI-3249
> URL: https://issues.apache.org/jira/browse/HUDI-3249
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5497: [WIP] Avoid calling `getDataSize` after every record written

2022-05-04 Thread GitBox


hudi-bot commented on PR #5497:
URL: https://github.com/apache/hudi/pull/5497#issuecomment-1118109102

   
   ## CI report:
   
   * b3a35b349472047355f36405abf442434e73f66c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin opened a new pull request, #5497: [WIP] Avoid calling `getDataSize` after every record written

2022-05-04 Thread GitBox


alexeykudinkin opened a new pull request, #5497:
URL: https://github.com/apache/hudi/pull/5497

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   `getDataSize` has non-trivial overhead in the current `ParquetWriter` impl, 
requiring traversal of already composed Column Groups in memory. Instead we can 
sample these calls to `getDataSize` to amortize its cost.
   
   ## Brief change log
   
- Sample memory checks of the currently written output size to avoid 
excessive block traversals
- Extracted HoodieBaseParquetWriter encapsulating shared functionality b/w 
`ParquetWriter` impls
   
   ## Verify this pull request
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5286: HUDI-3836 Improve the way of fetching metadata partitions from table

2022-05-04 Thread GitBox


hudi-bot commented on PR #5286:
URL: https://github.com/apache/hudi/pull/5286#issuecomment-1118107528

   
   ## CI report:
   
   * 78205ae18b1f7a67e39656cb4ad1ddfe4a52f3b3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8384)
 
   * 1acbc822bb1a3e8051539beabf640977f7d8fe6a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8427)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5286: HUDI-3836 Improve the way of fetching metadata partitions from table

2022-05-04 Thread GitBox


hudi-bot commented on PR #5286:
URL: https://github.com/apache/hudi/pull/5286#issuecomment-1118106121

   
   ## CI report:
   
   * 78205ae18b1f7a67e39656cb4ad1ddfe4a52f3b3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8384)
 
   * 1acbc822bb1a3e8051539beabf640977f7d8fe6a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BalaMahesh commented on issue #5494: [SUPPORT] Hudi 0.11.0 HoodieDeltaStreamer failing to start with error : java.lang.NoSuchFieldError: DROP_PARTITION_COLUMNS

2022-05-04 Thread GitBox


BalaMahesh commented on issue #5494:
URL: https://github.com/apache/hudi/issues/5494#issuecomment-1118106033

   @alexeykudinkin.  Apologies for the trouble, I accidentally placed 
hudi-hadoop-mr-bundle-0.10.1.jar in my spark class path earlier and 
HoodieTableConfig is being picked from that location and not able to find 
latest compiled class. I am closing this issue. Thanks for the your comments.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BalaMahesh closed issue #5494: [SUPPORT] Hudi 0.11.0 HoodieDeltaStreamer failing to start with error : java.lang.NoSuchFieldError: DROP_PARTITION_COLUMNS

2022-05-04 Thread GitBox


BalaMahesh closed issue #5494: [SUPPORT] Hudi 0.11.0 HoodieDeltaStreamer 
failing to start with error : java.lang.NoSuchFieldError: DROP_PARTITION_COLUMNS
URL: https://github.com/apache/hudi/issues/5494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5078: [HUDI-3667] Run unit tests of hudi-integ-tests in CI

2022-05-04 Thread GitBox


hudi-bot commented on PR #5078:
URL: https://github.com/apache/hudi/pull/5078#issuecomment-1118105998

   
   ## CI report:
   
   * 9fac106587c2652d77d4753875e8759952781a55 UNKNOWN
   * 5078d29eb429d7eca46c3d5c3aa72d94e088d43e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8068)
 
   * 30f755ce655f7f7da22b80526f6c9ad7aabbcbf4 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] xiarixiaoyao commented on issue #5452: Schema Evolution: Missing column for previous records when new entry does not have the same while upsert.

2022-05-04 Thread GitBox


xiarixiaoyao commented on issue #5452:
URL: https://github.com/apache/hudi/issues/5452#issuecomment-1118100800

   @santoshsb 
   createNewDF  cannot support rewrite DataFrame with nested schema change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BalaMahesh commented on issue #5494: [SUPPORT] Hudi 0.11.0 HoodieDeltaStreamer failing to start with error : java.lang.NoSuchFieldError: DROP_PARTITION_COLUMNS

2022-05-04 Thread GitBox


BalaMahesh commented on issue #5494:
URL: https://github.com/apache/hudi/issues/5494#issuecomment-1118084714

   I am trying to run 0.11.0 with the same config used for 0.10.1. Earlier it 
used to be hudi-utilities-bundle that we run, but now it is changed to 
hudi-utilities-slim-bundle along with hudi-sparkx.y-bundle . Somewhere the code 
recompilation/library versions are screwed is what I am feeling. But it is 
giving me hard time to figure that out.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] BalaMahesh commented on issue #5494: [SUPPORT] Hudi 0.11.0 HoodieDeltaStreamer failing to start with error : java.lang.NoSuchFieldError: DROP_PARTITION_COLUMNS

2022-05-04 Thread GitBox


BalaMahesh commented on issue #5494:
URL: https://github.com/apache/hudi/issues/5494#issuecomment-1118079346

   @alexeykudinkin  This is the full command and logs
   
   ./spark-submit  --jars 
packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.11.0.jar  
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-0.11.0.jar
 --props file:hudi/properties/gl_started.properties --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider --source-class 
org.apache.hudi.utilities.sources.JsonKafkaSource --target-base-path 
gs://xx/hudi/gl_cow/ --target-table hudi.gl_cow --op INSERT --table-type 
COPY_ON_WRITE --source-ordering-field time --continuous  --transformer-class 
org.apache.hudi.utilities.transform.AddDateHourColumnTransformer --source-limit 
150
   ```
   22/05/04 10:21:25 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
   log4j:WARN No appenders could be found for logger 
(org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator).
   log4j:WARN Please initialize the log4j system properly.
   log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for 
more info.
   Using Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties
   22/05/04 10:21:26 INFO SparkContext: Running Spark version 3.2.1
   22/05/04 10:21:26 INFO ResourceUtils: 
==
   22/05/04 10:21:26 INFO ResourceUtils: No custom resources configured for 
spark.driver.
   22/05/04 10:21:26 INFO ResourceUtils: 
==
   22/05/04 10:21:26 INFO SparkContext: Submitted application: 
delta-streamer-hudi.gl_cow
   22/05/04 10:21:26 INFO ResourceProfile: Default ResourceProfile created, 
executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: 
offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: 
cpus, amount: 1.0)
   22/05/04 10:21:26 INFO ResourceProfile: Limiting resource is cpu
   22/05/04 10:21:26 INFO ResourceProfileManager: Added ResourceProfile id: 0
   22/05/04 10:21:26 INFO SecurityManager: Changing view acls to: xx
   22/05/04 10:21:26 INFO SecurityManager: Changing modify acls to: xx
   22/05/04 10:21:26 INFO SecurityManager: Changing view acls groups to: 
   22/05/04 10:21:26 INFO SecurityManager: Changing modify acls groups to: 
   22/05/04 10:21:26 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(xx); groups with 
view permissions: Set(); users  with modify permissions: Set(xx); groups with 
modify permissions: Set()
   22/05/04 10:21:26 INFO deprecation: mapred.output.compression.codec is 
deprecated. Instead, use mapreduce.output.fileoutputformat.compress.codec
   22/05/04 10:21:26 INFO deprecation: mapred.output.compress is deprecated. 
Instead, use mapreduce.output.fileoutputformat.compress
   22/05/04 10:21:26 INFO deprecation: mapred.output.compression.type is 
deprecated. Instead, use mapreduce.output.fileoutputformat.compress.type
   22/05/04 10:21:26 INFO Utils: Successfully started service 'sparkDriver' on 
port 49423.
   22/05/04 10:21:26 INFO SparkEnv: Registering MapOutputTracker
   22/05/04 10:21:26 INFO SparkEnv: Registering BlockManagerMaster
   22/05/04 10:21:26 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
   22/05/04 10:21:26 INFO BlockManagerMasterEndpoint: 
BlockManagerMasterEndpoint up
   22/05/04 10:21:26 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
   22/05/04 10:21:26 INFO DiskBlockManager: Created local directory at 
/private/var/folders/6w/4y9hyhmj4d15hdqlnd74rp5cgp/T/blockmgr-2c16d6dd-7a31-44da-b728-bc8bfe8c11e7
   22/05/04 10:21:26 INFO MemoryStore: MemoryStore started with capacity 366.3 
MiB
   22/05/04 10:21:26 INFO SparkEnv: Registering OutputCommitCoordinator
   22/05/04 10:21:27 INFO Utils: Successfully started service 'SparkUI' on port 
8090.
   22/05/04 10:21:27 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://xx-x0-2:8090
   22/05/04 10:21:27 INFO SparkContext: Added JAR 
file://packaging/hudi-spark-bundle/target/hudi-spark3.2-bundle_2.12-0.11.0.jar 
at spark://xxx-x0-2:49423/jars/hudi-spark3.2-bundle_2.12-0.11.0.jar with 
timestamp 1651639886168
   22/05/04 10:21:27 INFO SparkContext: Added JAR 
file:packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-0.11.0.jar
 at spark://xx-x0-2:49423/jars/hudi-utilities-slim-bundle_2.12-0.11.0.jar with 
timestamp 1651639886168
   22/05/04 10:21:27 INFO Executor: Starting executor ID driver on host xx-x0-2
   22/05/04 10:21:27 INFO Executor: Fetching 
spark://xx-x0-2:49423/jars/hudi-spark3.2-bundle_2.1

[jira] [Updated] (HUDI-4034) Improve log merging performance for Metadata Table

2022-05-04 Thread Ethan Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ethan Guo updated HUDI-4034:

Description: For Metadata Table as a MOR table, there is a separate log 
record reader "HoodieMetadataMergedLogRecordReader".  We want to see if we can 
further improve the merging latency on different file systems including S3.

> Improve log merging performance for Metadata Table
> --
>
> Key: HUDI-4034
> URL: https://issues.apache.org/jira/browse/HUDI-4034
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata, performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>
> For Metadata Table as a MOR table, there is a separate log record reader 
> "HoodieMetadataMergedLogRecordReader".  We want to see if we can further 
> improve the merging latency on different file systems including S3.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-413) Use ColumnIndex in parquet to speed up scans

2022-05-04 Thread Ethan Guo (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531988#comment-17531988
 ] 

Ethan Guo commented on HUDI-413:


This is no longer needed as we have column stats index in Metadata Table.

> Use ColumnIndex in parquet to speed up scans
> 
>
> Key: HUDI-413
> URL: https://issues.apache.org/jira/browse/HUDI-413
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: help-requested
>
> [https://github.com/apache/parquet-format/blob/master/PageIndex.md]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-2673) Add integration/e2e test for kafka-connect functionality

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-2673:


Assignee: Raymond Xu  (was: Ethan Guo)

> Add integration/e2e test for kafka-connect functionality
> 
>
> Key: HUDI-2673
> URL: https://issues.apache.org/jira/browse/HUDI-2673
> Project: Apache Hudi
>  Issue Type: Task
>  Components: kafka-connect, tests-ci
>Reporter: Ethan Guo
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.12.0
>
>
> The integration test should use bundle jar and run in docker setup.  This can 
> prevent any issue in the bundle, like HUDI-3903, that is not covered by unit 
> and functional tests.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Closed] (HUDI-413) Use ColumnIndex in parquet to speed up scans

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu closed HUDI-413.
---
Resolution: Won't Do

> Use ColumnIndex in parquet to speed up scans
> 
>
> Key: HUDI-413
> URL: https://issues.apache.org/jira/browse/HUDI-413
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: help-requested
>
> [https://github.com/apache/parquet-format/blob/master/PageIndex.md]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4034) Improve log merging performance for Metadata Table

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4034:
-
Status: In Progress  (was: Open)

> Improve log merging performance for Metadata Table
> --
>
> Key: HUDI-4034
> URL: https://issues.apache.org/jira/browse/HUDI-4034
> Project: Apache Hudi
>  Issue Type: Task
>  Components: metadata, performance
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4017) Spark sql tests as part of github actions for diff spark versions

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-4017:
-
Status: In Progress  (was: Open)

> Spark sql tests as part of github actions for diff spark versions
> -
>
> Key: HUDI-4017
> URL: https://issues.apache.org/jira/browse/HUDI-4017
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark-sql
>Reporter: sivabalan narayanan
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Assigned] (HUDI-3856) Upgrade maven surefire to M5

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu reassigned HUDI-3856:


Assignee: Raymond Xu

> Upgrade maven surefire to M5
> 
>
> Key: HUDI-3856
> URL: https://issues.apache.org/jira/browse/HUDI-3856
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3687) Make sure CI run tests against all Spark versions

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-3687:
-
Status: In Progress  (was: Open)

> Make sure CI run tests against all Spark versions
> -
>
> Key: HUDI-3687
> URL: https://issues.apache.org/jira/browse/HUDI-3687
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Alexey Kudinkin
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.11.1
>
>
> Currently, CI only runs tests against Spark 2.4.4. Since we pledge to support 
> all patch versions of Spark w/in a particular supported minor version branch 
> of Spark (3.1, 3.2, etc), we need to run at Spark-related tests for all Spark 
> versions we're pledging the support for.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-2516) Upgrade to Junit 5.8.1

2022-05-04 Thread Raymond Xu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raymond Xu updated HUDI-2516:
-
Status: Open  (was: Patch Available)

> Upgrade to Junit 5.8.1
> --
>
> Key: HUDI-2516
> URL: https://issues.apache.org/jira/browse/HUDI-2516
> Project: Apache Hudi
>  Issue Type: Task
>  Components: Testing, tests-ci
>Reporter: Raymond Xu
>Assignee: Raymond Xu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

2022-05-04 Thread GitBox


hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118061811

   
   ## CI report:
   
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4016) Prepare a document to list all tests to be done as part of release certification

2022-05-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4016:
--
Status: In Progress  (was: Open)

> Prepare a document to list all tests to be done as part of release 
> certification
> 
>
> Key: HUDI-4016
> URL: https://issues.apache.org/jira/browse/HUDI-4016
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4027) add support to test non-core write operations (insert overwrite, delete partitions) to integ test framework

2022-05-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-4027:
--
Status: In Progress  (was: Open)

> add support to test non-core write operations (insert overwrite, delete 
> partitions) to integ test framework
> ---
>
> Key: HUDI-4027
> URL: https://issues.apache.org/jira/browse/HUDI-4027
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
> Fix For: 0.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3957) Support spark2 and scala12 testing w/ integ test bundle

2022-05-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3957:
--
Status: In Progress  (was: Open)

> Support spark2 and scala12 testing w/ integ test bundle
> ---
>
> Key: HUDI-3957
> URL: https://issues.apache.org/jira/browse/HUDI-3957
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.12.0
>
>
> currently, integ test bundle does not work for spark2 and scala12. Spark 
> session initialization (enableHive()) expects some hive classes which are 
> missing. 
>  
> {code:java}
> 22/04/22 20:26:08 WARN testsuite.HoodieTestSuiteJob: Running spark job w/ app 
> id local-1650673568081
> 22/04/22 20:26:08 INFO fs.FSUtils: Resolving file /tmp/test.propertiesto be a 
> remote file.
> Exception in thread "main" java.lang.IllegalArgumentException: Unable to 
> instantiate SparkSession with Hive support because Hive classes are not found.
>   at 
> org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:871)
>   at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.(HoodieTestSuiteJob.java:110)
>   at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:180)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:855)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 22/04/22 20:26:08 INFO spark.SparkContext: Invoking stop() from shutdown hook
> 22/04/22 20:26:08 INFO server.AbstractConnector: Stopped 
> Spark@889d9e8{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3957) Evaluate Support for spark2 and scala12

2022-05-04 Thread sivabalan narayanan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-3957:
--
Summary: Evaluate Support for spark2 and scala12   (was: Support spark2 and 
scala12 testing w/ integ test bundle)

> Evaluate Support for spark2 and scala12 
> 
>
> Key: HUDI-3957
> URL: https://issues.apache.org/jira/browse/HUDI-3957
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Minor
> Fix For: 0.12.0
>
>
> currently, integ test bundle does not work for spark2 and scala12. Spark 
> session initialization (enableHive()) expects some hive classes which are 
> missing. 
>  
> {code:java}
> 22/04/22 20:26:08 WARN testsuite.HoodieTestSuiteJob: Running spark job w/ app 
> id local-1650673568081
> 22/04/22 20:26:08 INFO fs.FSUtils: Resolving file /tmp/test.propertiesto be a 
> remote file.
> Exception in thread "main" java.lang.IllegalArgumentException: Unable to 
> instantiate SparkSession with Hive support because Hive classes are not found.
>   at 
> org.apache.spark.sql.SparkSession$Builder.enableHiveSupport(SparkSession.scala:871)
>   at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.(HoodieTestSuiteJob.java:110)
>   at 
> org.apache.hudi.integ.testsuite.HoodieTestSuiteJob.main(HoodieTestSuiteJob.java:180)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:855)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:930)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:939)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> 22/04/22 20:26:08 INFO spark.SparkContext: Invoking stop() from shutdown hook
> 22/04/22 20:26:08 INFO server.AbstractConnector: Stopped 
> Spark@889d9e8{HTTP/1.1, (http/1.1)}{0.0.0.0:8090} {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

2022-05-04 Thread GitBox


hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118058171

   
   ## CI report:
   
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403)
 
   * dba2edf235b1bc51a170145bef25424abb2c80dd Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8426)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] hudi-bot commented on pull request #5470: [WIP][HUDI-3993][Perf] Replacing UDF in Bulk Insert w/ RDD transformation

2022-05-04 Thread GitBox


hudi-bot commented on PR #5470:
URL: https://github.com/apache/hudi/pull/5470#issuecomment-1118056892

   
   ## CI report:
   
   * 1489d3759dfe86f9625ee533ea4ea710b32b18c3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8403)
 
   * dba2edf235b1bc51a170145bef25424abb2c80dd UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5462: [HUDI-3995] Making pref optimizations for bulk insert row writer path

2022-05-04 Thread GitBox


alexeykudinkin commented on code in PR #5462:
URL: https://github.com/apache/hudi/pull/5462#discussion_r865419555


##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##
@@ -97,87 +101,69 @@ public String getPartitionPath(Row row) {
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getPartitionPath(InternalRow internalRow, StructType 
structType) {
 try {
-  initDeserializer(structType);
-  Row row = sparkRowSerDe.deserializeRow(internalRow);
-  return getPartitionPath(row);
+  buildFieldSchemaInfoIfNeeded(structType);
+  return 
RowKeyGeneratorHelper.getPartitionPathFromInternalRow(internalRow, 
getPartitionPathFields(),
+  hiveStylePartitioning, partitionPathSchemaInfo);
 } catch (Exception e) {
   throw new HoodieIOException("Conversion of InternalRow to Row failed 
with exception " + e);
 }
   }
 
-  private void initDeserializer(StructType structType) {
-if (sparkRowSerDe == null) {
-  sparkRowSerDe = HoodieSparkUtils.getDeserializer(structType);
-}
-  }
-
-  void buildFieldPositionMapIfNeeded(StructType structType) {
+  void buildFieldSchemaInfoIfNeeded(StructType structType) {
 if (this.structType == null) {
-  // parse simple fields
-  getRecordKeyFields().stream()
-  .filter(f -> !(f.contains(".")))
+  getRecordKeyFields()
+  .stream().filter(f -> !f.isEmpty())
   .forEach(f -> {
-if (structType.getFieldIndex(f).isDefined()) {
-  recordKeyPositions.put(f, Collections.singletonList((Integer) 
(structType.getFieldIndex(f).get(;
+if (f.contains(DOT_STRING)) {

Review Comment:
   We don't need this conditional -- simple field ref is a special case of 
nested field-ref and so should be handled by the same path that handles nested 
fields



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##
@@ -18,48 +18,51 @@
 
 package org.apache.hudi.keygen;
 
-import org.apache.avro.generic.GenericRecord;
 import org.apache.hudi.ApiMaturityLevel;
 import org.apache.hudi.AvroConversionUtils;
-import org.apache.hudi.HoodieSparkUtils;
 import org.apache.hudi.PublicAPIMethod;
-import org.apache.hudi.client.utils.SparkRowSerDe;
 import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.util.collection.Pair;
 import org.apache.hudi.exception.HoodieIOException;
 import org.apache.hudi.exception.HoodieKeyException;
+
+import org.apache.avro.generic.GenericRecord;
 import org.apache.spark.sql.Row;
 import org.apache.spark.sql.catalyst.InternalRow;
 import org.apache.spark.sql.types.DataType;
 import org.apache.spark.sql.types.StructType;
-import scala.Function1;
 
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
+import java.util.concurrent.atomic.AtomicBoolean;
+
+import scala.Function1;
 
 /**
  * Base class for the built-in key generators. Contains methods structured for
  * code reuse amongst them.
  */
 public abstract class BuiltinKeyGenerator extends BaseKeyGenerator implements 
SparkKeyGeneratorInterface {
 
+  private static final String DOT_STRING = ".";
   private static final String STRUCT_NAME = "hoodieRowTopLevelField";
   private static final String NAMESPACE = "hoodieRow";
-  private transient Function1 converterFn = null;
-  private SparkRowSerDe sparkRowSerDe;
+  private Function1 converterFn = null;
   protected StructType structType;
+  private static AtomicBoolean validatePartitionFields = new 
AtomicBoolean(false);

Review Comment:
   Why is this static?



##
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/keygen/BuiltinKeyGenerator.java:
##
@@ -97,87 +101,69 @@ public String getPartitionPath(Row row) {
   @PublicAPIMethod(maturity = ApiMaturityLevel.EVOLVING)
   public String getPartitionPath(InternalRow internalRow, StructType 
structType) {
 try {
-  initDeserializer(structType);
-  Row row = sparkRowSerDe.deserializeRow(internalRow);
-  return getPartitionPath(row);
+  buildFieldSchemaInfoIfNeeded(structType);
+  return 
RowKeyGeneratorHelper.getPartitionPathFromInternalRow(internalRow, 
getPartitionPathFields(),
+  hiveStylePartitioning, partitionPathSchemaInfo);
 } catch (Exception e) {
   throw new HoodieIOException("Conversion of InternalRow to Row failed 
with exception " + e);
 }
   }
 
-  private void initDeserializer(StructType structType) {
-if (sparkRowSerDe == null) {
-  sparkRowSerDe = HoodieSparkUtils.getDeserializer(structType);
-}
-  }
-
-  void buildFieldPositionMapIfNeeded(StructType structType) {
+  void buildFieldSchemaInfoIfNeeded(StructType structType) {
 if (this.structType == null) {
-  // parse simple fields
-  getRecordKeyFields().stream()
-  .filter(f -> !(f.contains(".")))
+  getRecordKeyFields()
+   

[GitHub] [hudi] alexeykudinkin commented on issue #5494: [SUPPORT] Hudi 0.11.0 HoodieDeltaStreamer failing to start with error : java.lang.NoSuchFieldError: DROP_PARTITION_COLUMNS

2022-05-04 Thread GitBox


alexeykudinkin commented on issue #5494:
URL: https://github.com/apache/hudi/issues/5494#issuecomment-1117987755

   @BalaMahesh i've tried to reproduce this issue using your steps, and i'm not 
able to -- DeltaStreamer is starting as expected. Can you please provide full 
Spark Driver log (feel free to strip out all sensitive info from it)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] brysd opened a new issue, #5496: [SUPPORT] java.lang.ClassNotFoundException: org.apache.hudi.org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$TokenIdentifier

2022-05-04 Thread GitBox


brysd opened a new issue, #5496:
URL: https://github.com/apache/hudi/issues/5496

   **Spark submit fails immediately with hudi-spark3.2-bundle_2.12:0.11.0 and 
kerberos authentication**
   
   executing following on our environment will result in the above mentioned 
error
   ``` shell
   
   /usr/bin/spark3-submit --packages 
org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 --conf 
"spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension" 
--conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 --num-executors 4 --principal v...@bda2.vdab.be --keytab vdp2.keytab 
test_hudi_schema_evolution.py
   ```
   
   code in python script:
   
   ``` python
   import pyspark
   
   from pyspark.sql import SparkSession
   from pyspark.sql.types import StructType, StructField, StringType, 
IntegerType, BooleanType
   
   spark = SparkSession.builder.appName('testHudiSchemaEvolution') \
   .getOrCreate()
   
   ```
   
   Maybe we need something extra and this is related to kerberos 
authentication. In the logs however we can see that we correctly get 
authenticated. 
   
   
   **To Reproduce**
   
   Not sure how easy it is to reproduce this - we also apply kerberos 
authentication through keytab file as you can see in the spark3-submit command 
but basically we don't move forward from the basic session getOrCreate.
   
   
   **Expected behavior**
   
   No exceptions.
   
   **Environment Description**
   
   * Hudi version : 0.11.0
   
   * Spark version : 3.2
   
   * Hive version : 3.1.3000
   
   * Hadoop version : 3.1.1.7
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   running kerberos authentication with keytab file
   
   **Stacktrace**
   
   Exception thrown:
   
   ``` shell
   Traceback (most recent call last):
 File "/home/dbrys1/test_hudi_schema_evolution.py", line 22, in 
   spark = SparkSession.builder.appName('testHudiSchemaEvolution') \
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/pyspark.zip/pyspark/sql/session.py",
 line 228, in getOrCreate
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/pyspark.zip/pyspark/context.py",
 line 392, in getOrCreate
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/pyspark.zip/pyspark/context.py",
 line 147, in __init__
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/pyspark.zip/pyspark/context.py",
 line 209, in _do_init
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/pyspark.zip/pyspark/context.py",
 line 329, in _initialize_context
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
 line 1574, in __call__
 File 
"/opt/cloudera/parcels/SPARK3-3.2.0.3.2.7170.0-49-1.p0.18822714/lib/spark3/python/lib/py4j-0.10.9.2-src.zip/py4j/protocol.py",
 line 328, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling 
None.org.apache.spark.api.java.JavaSparkContext.
   : java.lang.NoClassDefFoundError: 
org/apache/hudi/org/apache/hadoop/hbase/protobuf/generated/AuthenticationProtos$TokenIdentifier
   at 
org.apache.hudi.org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier.readFields(AuthenticationTokenIdentifier.java:142)
   at 
org.apache.hadoop.security.token.Token.decodeIdentifier(Token.java:192)
   at 
org.apache.hadoop.security.token.Token.identifierToString(Token.java:444)
   at org.apache.hadoop.security.token.Token.toString(Token.java:464)
   at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.$anonfun$obtainDelegationTokens$2(HBaseDelegationTokenProvider.scala:52)
   at org.apache.spark.internal.Logging.logInfo(Logging.scala:57)
   at org.apache.spark.internal.Logging.logInfo$(Logging.scala:56)
   at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.logInfo(HBaseDelegationTokenProvider.scala:34)
   at 
org.apache.spark.deploy.security.HBaseDelegationTokenProvider.obtainDelegationTokens(HBaseDelegationTokenProvider.scala:52)
   at 
org.apache.spark.deploy.security.HadoopDelegationTokenManager.$anonfun$obtainDelegationTokens$2(HadoopDelegationTokenManager.scala:164)
   at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
   at scala.collection.Iterator.foreach(Iterator.scala:941)
   at scala.collection.Iterator.foreach$(Iterator.scala:941)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   at 
scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:213)
   at 
scala.col

[jira] [Updated] (HUDI-4002) HoodieWrapperFileSystem class cast issue

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4002:
--
Summary: HoodieWrapperFileSystem class cast issue  (was: 
org.apache.hudi.common.fs.HoodieWrapperFileSystem class cast issue)

> HoodieWrapperFileSystem class cast issue
> 
>
> Key: HUDI-4002
> URL: https://issues.apache.org/jira/browse/HUDI-4002
> Project: Apache Hudi
>  Issue Type: Task
>  Components: flink, writer-core
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.11.1
>
>
> https://github.com/apache/hudi/issues/5457



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #5473: [HUDI-4003] Try to read all the log file to parse schema

2022-05-04 Thread GitBox


alexeykudinkin commented on code in PR #5473:
URL: https://github.com/apache/hudi/pull/5473#discussion_r865347231


##
hudi-common/src/main/java/org/apache/hudi/common/table/TableSchemaResolver.java:
##
@@ -109,13 +110,18 @@ private MessageType getTableParquetSchemaFromDataFile() {
   // Determine the file format based on the file name, and then 
extract schema from it.
   if (instantAndCommitMetadata.isPresent()) {
 HoodieCommitMetadata commitMetadata = 
instantAndCommitMetadata.get().getRight();
-String filePath = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
-if 
(filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
-  // this is a log file
-  return readSchemaFromLogFile(new Path(filePath));
-} else {
-  return readSchemaFromBaseFile(filePath);
+Iterator filePaths = 
commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
+MessageType type = null;
+while (filePaths.hasNext() && type == null) {

Review Comment:
   Let's unify this behavior across both MOR/COW



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (HUDI-4003) Flink offline compaction may cause NPE when log file only contain delete opereation

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4003:
--
Reviewers: Alexey Kudinkin

> Flink offline compaction may cause NPE when log file only contain delete 
> opereation
> ---
>
> Key: HUDI-4003
> URL: https://issues.apache.org/jira/browse/HUDI-4003
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Affects Versions: 0.11.0
>Reporter: lanyuanxiaoyao
>Priority: Critical
>  Labels: pull-request-available
>
> Environment: Hudi 0.12.0 (Latest master), Flink 1.13.3, JDK 8
> My test:
>  # Two partitions: p1, p2
>  # Write data that p1 only delete record, p2 only update record
>  # Run offline compaction and it cause NPE
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.convertParquetSchemaToAvro(TableSchemaResolver.java:341)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:148)
>     at 
> org.apache.hudi.util.CompactionUtil.inferChangelogMode(CompactionUtil.java:131)
>     at 
> org.apache.hudi.sink.compact.HoodieFlinkCompactor$AsyncCompactionService.(HoodieFlinkCompactor.java:173)
>     at com.lanyuanxiaoyao.Compactor.main(Compactor.java:25) {code}
> Reson & Resolution:
>  # Flink offline compaction would get schema from latest data file to check  
> '_hoodie_operation' field have set or not 
> (org.apache.hudi.util.CompactionUtil#inferChangelogMode).
>  # For MOR table, it may get schema from a log file in random. But if it 
> choose the log file that only contains delete operation, the code will get a 
> NULL as result. 
> (org.apache.hudi.common.table.TableSchemaResolver#readSchemaFromLogFile(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path))
>  # Finally, it throw NPE when the code want to get the schema name. 
> (org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.parquet.schema.MessageType))
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> String filePath = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
> if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
>   // this is a log file
>   return readSchemaFromLogFile(new Path(filePath));
> } else {
>   return readSchemaFromBaseFile(filePath);
> }
>   } {code}
> I think that can try another log file to parse schema when it get NULL from a 
> log file.
> My solution is make the code try to scan all the file path to parse schema 
> until success.
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> Iterator filePaths = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
> MessageType type = null;
> while (filePaths.hasNext() && type == null) {
>   String filePath = filePaths.next();
>   if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
> // this is a log file
> type = readSchemaFromLogFile(new Path(filePath));
>   } else {
> type = readSchemaFromBaseFile(filePath);
>   }
> }
> return type;
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4003) Flink offline compaction may cause NPE when log file only contain delete opereation

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4003:
--
Affects Version/s: 0.11.0
   (was: 0.11.1)

> Flink offline compaction may cause NPE when log file only contain delete 
> opereation
> ---
>
> Key: HUDI-4003
> URL: https://issues.apache.org/jira/browse/HUDI-4003
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: compaction, flink
>Affects Versions: 0.11.0
>Reporter: lanyuanxiaoyao
>Priority: Critical
>  Labels: pull-request-available
>
> Environment: Hudi 0.12.0 (Latest master), Flink 1.13.3, JDK 8
> My test:
>  # Two partitions: p1, p2
>  # Write data that p1 only delete record, p2 only update record
>  # Run offline compaction and it cause NPE
> {code:java}
> Exception in thread "main" java.lang.NullPointerException
>     at 
> org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:264)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.convertParquetSchemaToAvro(TableSchemaResolver.java:341)
>     at 
> org.apache.hudi.common.table.TableSchemaResolver.getTableAvroSchemaFromDataFile(TableSchemaResolver.java:148)
>     at 
> org.apache.hudi.util.CompactionUtil.inferChangelogMode(CompactionUtil.java:131)
>     at 
> org.apache.hudi.sink.compact.HoodieFlinkCompactor$AsyncCompactionService.(HoodieFlinkCompactor.java:173)
>     at com.lanyuanxiaoyao.Compactor.main(Compactor.java:25) {code}
> Reson & Resolution:
>  # Flink offline compaction would get schema from latest data file to check  
> '_hoodie_operation' field have set or not 
> (org.apache.hudi.util.CompactionUtil#inferChangelogMode).
>  # For MOR table, it may get schema from a log file in random. But if it 
> choose the log file that only contains delete operation, the code will get a 
> NULL as result. 
> (org.apache.hudi.common.table.TableSchemaResolver#readSchemaFromLogFile(org.apache.hadoop.fs.FileSystem,
>  org.apache.hadoop.fs.Path))
>  # Finally, it throw NPE when the code want to get the schema name. 
> (org.apache.parquet.avro.AvroSchemaConverter#convert(org.apache.parquet.schema.MessageType))
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> String filePath = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().stream().findAny().get();
> if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
>   // this is a log file
>   return readSchemaFromLogFile(new Path(filePath));
> } else {
>   return readSchemaFromBaseFile(filePath);
> }
>   } {code}
> I think that can try another log file to parse schema when it get NULL from a 
> log file.
> My solution is make the code try to scan all the file path to parse schema 
> until success.
> {code:java}
> case MERGE_ON_READ:
>   // For MOR table, the file has data written may be a parquet file, .log 
> file, orc file or hfile.
>   // Determine the file format based on the file name, and then extract 
> schema from it.
>   if (instantAndCommitMetadata.isPresent()) {
> HoodieCommitMetadata commitMetadata = 
> instantAndCommitMetadata.get().getRight();
> Iterator filePaths = 
> commitMetadata.getFileIdAndFullPaths(metaClient.getBasePath()).values().iterator();
> MessageType type = null;
> while (filePaths.hasNext() && type == null) {
>   String filePath = filePaths.next();
>   if (filePath.contains(HoodieFileFormat.HOODIE_LOG.getFileExtension())) {
> // this is a log file
> type = readSchemaFromLogFile(new Path(filePath));
>   } else {
> type = readSchemaFromBaseFile(filePath);
>   }
> }
> return type;
>   } {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-4006) Fail on data loss semantics for deltastreamer Kafka sources

2022-05-04 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531931#comment-17531931
 ] 

Alexey Kudinkin commented on HUDI-4006:
---

[~jqi] can you please summarize in here what's your proposed solution is? 

> Fail on data loss semantics for deltastreamer Kafka sources
> ---
>
> Key: HUDI-4006
> URL: https://issues.apache.org/jira/browse/HUDI-4006
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Ji Qi
>Assignee: Ji Qi
>Priority: Minor
> Fix For: 0.12.0
>
>
> See https://github.com/apache/hudi/issues/5400 for more details



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[GitHub] [hudi] hudi-bot commented on pull request #5462: [HUDI-3995] Making pref optimizations for bulk insert row writer path

2022-05-04 Thread GitBox


hudi-bot commented on PR #5462:
URL: https://github.com/apache/hudi/pull/5462#issuecomment-1117870868

   
   ## CI report:
   
   * 97e6a89457322f13235d8cbe011357fab5ae67a4 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=8424)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Closed] (HUDI-4008) Delta Streamer does not respect hoodie.datasource.write.drop.partition.columns config

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin closed HUDI-4008.
-
Fix Version/s: 0.11.0
   Resolution: Duplicate

> Delta Streamer does not respect 
> hoodie.datasource.write.drop.partition.columns config
> -
>
> Key: HUDI-4008
> URL: https://issues.apache.org/jira/browse/HUDI-4008
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS EMR 6.5 / HUDI v0.10.1
>Reporter: Istvan Darvas
>Priority: Major
> Fix For: 0.11.0
>
>
> Hi Guys!
>  
> I have set this parameter in DeltaStreamer
>     hoodie.datasource.write.drop.partition.columns=true
> but parquet files which were writen by DeltaStreamer contain the partition 
> culumns.
>  
> I Also have to say, SparkAPI respects this parameter. So what I wrote with 
> spark.write that was OK
>  
> Darvi
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HUDI-4008) Delta Streamer does not respect hoodie.datasource.write.drop.partition.columns config

2022-05-04 Thread Alexey Kudinkin (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-4008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531925#comment-17531925
 ] 

Alexey Kudinkin commented on HUDI-4008:
---

[~Darvi77] please try this out w/ 0.11

> Delta Streamer does not respect 
> hoodie.datasource.write.drop.partition.columns config
> -
>
> Key: HUDI-4008
> URL: https://issues.apache.org/jira/browse/HUDI-4008
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.10.1
> Environment: AWS EMR 6.5 / HUDI v0.10.1
>Reporter: Istvan Darvas
>Priority: Major
> Fix For: 0.11.0
>
>
> Hi Guys!
>  
> I have set this parameter in DeltaStreamer
>     hoodie.datasource.write.drop.partition.columns=true
> but parquet files which were writen by DeltaStreamer contain the partition 
> culumns.
>  
> I Also have to say, SparkAPI respects this parameter. So what I wrote with 
> spark.write that was OK
>  
> Darvi
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4037) Packaging

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4037:
--
Fix Version/s: 0.12.0

> Packaging
> -
>
> Key: HUDI-4037
> URL: https://issues.apache.org/jira/browse/HUDI-4037
> Project: Apache Hudi
>  Issue Type: Epic
>Reporter: Alexey Kudinkin
>Priority: Major
> Fix For: 0.12.0
>
>
> This is an epic dedicated to the work around packaging various Hudi artifacts 
> (including bundling, compatibility, etc)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4023) de-couple spark from utilities bundle

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4023:
--
Epic Link: HUDI-4037

> de-couple spark from utilities bundle
> -
>
> Key: HUDI-4023
> URL: https://issues.apache.org/jira/browse/HUDI-4023
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: dependencies
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Major
> Fix For: 0.12.0
>
>
> we should be able to couple 
> utilities-slim + any of spark/presto/trino/kafka + any of sync bundle. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4011) Add a Hudi AWS bundle

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4011:
--
Epic Link: HUDI-4037

> Add a Hudi AWS bundle
> -
>
> Key: HUDI-4011
> URL: https://issues.apache.org/jira/browse/HUDI-4011
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Udit Mehrotra
>Assignee: Wenning Ding
>Priority: Major
> Fix For: 0.12.0
>
>
> As was raised in [https://github.com/apache/hudi/issues/5451,] there Hudi AWS 
> jars were moved out of hudi-spark-bundle. Hence, customers need to manually 
> pass jars like DynamoDb lock client, DynamoDb aws sdk etc to be able to use 
> DynamoDb lock provider implementation.
> We need an AWS specific bundle, that packages these dependencies to make it 
> easier for customers. They can use this bundle along with hudi-spark-bundle 
> when they need to use DynamoDb lock provider.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HUDI-4037) Packaging

2022-05-04 Thread Alexey Kudinkin (Jira)
Alexey Kudinkin created HUDI-4037:
-

 Summary: Packaging
 Key: HUDI-4037
 URL: https://issues.apache.org/jira/browse/HUDI-4037
 Project: Apache Hudi
  Issue Type: Epic
Reporter: Alexey Kudinkin


This is an epic dedicated to the work around packaging various Hudi artifacts 
(including bundling, compatibility, etc)



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-3303) CI Improvements

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-3303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-3303:
--
Summary: CI Improvements  (was: CI test Improvements)

> CI Improvements
> ---
>
> Key: HUDI-3303
> URL: https://issues.apache.org/jira/browse/HUDI-3303
> Project: Apache Hudi
>  Issue Type: Epic
>  Components: tests-ci
>Reporter: Raymond Xu
>Priority: Blocker
> Fix For: 0.11.1, 0.12.0
>
>
> Automate tests that need to be manually performed before releases.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4012) Evaluate support for spark2 and scala12

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4012:
--
Epic Link: HUDI-3303  (was: HUDI-1250)

> Evaluate support for spark2 and scala12
> ---
>
> Key: HUDI-4012
> URL: https://issues.apache.org/jira/browse/HUDI-4012
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: spark
>Reporter: sivabalan narayanan
>Priority: Major
>
> Evaluate support for spark2 and scala12 and may be deprecate the usage going 
> forward. 
>  
> check stats from nexxus. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4013) Document all manual tests done as part of 0.11 release certification

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4013:
--
Epic Link: HUDI-3303  (was: HUDI-1250)

> Document all manual tests done as part of 0.11 release certification 
> -
>
> Key: HUDI-4013
> URL: https://issues.apache.org/jira/browse/HUDI-4013
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (HUDI-4014) Document all backwards compatability testing w/ scripts and commands

2022-05-04 Thread Alexey Kudinkin (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4014:
--
Epic Link: HUDI-3303  (was: HUDI-1250)

> Document all backwards compatability testing w/ scripts and commands
> 
>
> Key: HUDI-4014
> URL: https://issues.apache.org/jira/browse/HUDI-4014
> Project: Apache Hudi
>  Issue Type: Test
>  Components: tests-ci
>Reporter: sivabalan narayanan
>Assignee: Ethan Guo
>Priority: Major
>
> as part of 0.11 release testing, we did a bunch of backwards compatability 
> testing. lets document all of them w/ steps and commands if feasible. Until 
> we automate these, we do not want to spend time researching the same again. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


  1   2   >