[GitHub] [hudi] hudi-bot edited a comment on pull request #3764: [MINOR] Fix typo,'paritition' corrected to 'partition'

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3764:
URL: https://github.com/apache/hudi/pull/3764#issuecomment-938381930


   
   ## CI report:
   
   * 264e8f48602d37c833edddab90df9cd534fb2277 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2540)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot commented on pull request #3764: [MINOR] Fix typo,'paritition' corrected to 'partition'

2021-10-07 Thread GitBox


hudi-bot commented on pull request #3764:
URL: https://github.com/apache/hudi/pull/3764#issuecomment-938381930


   
   ## CI report:
   
   * 264e8f48602d37c833edddab90df9cd534fb2277 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dongkelun opened a new pull request #3764: [MINOR] Fix typo,'paritition' corrected to 'partition'

2021-10-07 Thread GitBox


dongkelun opened a new pull request #3764:
URL: https://github.com/apache/hudi/pull/3764


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *Fix typo,'paritition' corrected to 'partition'*
   
   ## Brief change log
   
   *(for example:)*
 - *Fix typo,'paritition' corrected to 'partition'*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zhangyue19921010 commented on a change in pull request #3719: [HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requ

2021-10-07 Thread GitBox


zhangyue19921010 commented on a change in pull request #3719:
URL: https://github.com/apache/hudi/pull/3719#discussion_r724726381



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieROTablePathFilter.java
##
@@ -175,8 +181,12 @@ public boolean accept(Path path) {
 metaClientCache.put(baseDir.toString(), metaClient);
   }
 
-  fsView = 
FileSystemViewManager.createInMemoryFileSystemView(engineContext,
-  metaClient, 
HoodieInputFormatUtils.buildMetadataConfig(getConf()));
+  fsView = hoodieTableFileSystemViewCache.get(baseDir.toString());
+  if (null == fsView) {
+fsView = 
FileSystemViewManager.createInMemoryFileSystemView(engineContext, metaClient, 
HoodieInputFormatUtils.buildMetadataConfig(getConf()));
+hoodieTableFileSystemViewCache.put(baseDir.toString(), fsView);

Review comment:
   Sure thing. Changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinishjail97 commented on pull request #3763: [MINOR] - Fixed typo in docker demo docs for kafkacat -> kcat

2021-10-07 Thread GitBox


vinishjail97 commented on pull request #3763:
URL: https://github.com/apache/hudi/pull/3763#issuecomment-938368030


   Thanks Kyle, the change LGTM. I am not part of the Hudi reviewers group. 
@vinothchandar will stamp it I guess ? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] kywe665 opened a new pull request #3763: [MINOR] - Fixed typo in docker demo docs for kafkacat -> kcat

2021-10-07 Thread GitBox


kywe665 opened a new pull request #3763:
URL: https://github.com/apache/hudi/pull/3763


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before 
opening a pull request.*
   
   ## What is the purpose of the pull request
   
   The [Docker Demo documentation](https://hudi.apache.org/docs/docker_demo) is 
out of date. "kafkacat" no longer works as of Aug 2021 and needs to be replaced 
with "kcat". See this link for why: 
https://github.com/edenhill/kcat#what-happened-to-kafkacat
   
   ## Brief change log
   
 - replaced all instances of kafkacat with kcat in .docker-demo.md*
   
   ## Verify this pull request
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3623: [WIP][HUDI-2409] Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3623:
URL: https://github.com/apache/hudi/pull/3623#issuecomment-915056982


   
   ## CI report:
   
   * 44b255665f688477279fce5d07bf29c5537b7f05 UNKNOWN
   * 20c9cfdb70b3652c80fc4339789285473f6a7cbc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2539)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3623: [WIP][HUDI-2409] Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3623:
URL: https://github.com/apache/hudi/pull/3623#issuecomment-915056982


   
   ## CI report:
   
   * 78577241f38f2021052fb62c8c19ed67d0db012e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2538)
 
   * 44b255665f688477279fce5d07bf29c5537b7f05 UNKNOWN
   * 20c9cfdb70b3652c80fc4339789285473f6a7cbc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2539)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3623: [WIP][HUDI-2409] Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3623:
URL: https://github.com/apache/hudi/pull/3623#issuecomment-915056982


   
   ## CI report:
   
   * 78577241f38f2021052fb62c8c19ed67d0db012e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2538)
 
   * 44b255665f688477279fce5d07bf29c5537b7f05 UNKNOWN
   * 20c9cfdb70b3652c80fc4339789285473f6a7cbc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3623: [WIP][HUDI-2409] Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3623:
URL: https://github.com/apache/hudi/pull/3623#issuecomment-915056982


   
   ## CI report:
   
   * 78577241f38f2021052fb62c8c19ed67d0db012e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2538)
 
   * 44b255665f688477279fce5d07bf29c5537b7f05 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3719: [HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3719:
URL: https://github.com/apache/hudi/pull/3719#issuecomment-927270024


   
   ## CI report:
   
   * 82b6fa38d0f2e8fca6a7804a550b89a4328644f2 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2537)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3623: [WIP][HUDI-2409] Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3623:
URL: https://github.com/apache/hudi/pull/3623#issuecomment-915056982


   
   ## CI report:
   
   * 72fc50ea33d6267ebdc9a0ecd81cb4df3c833814 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2320)
 
   * 78577241f38f2021052fb62c8c19ed67d0db012e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2538)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3623: [WIP][HUDI-2409] Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3623:
URL: https://github.com/apache/hudi/pull/3623#issuecomment-915056982


   
   ## CI report:
   
   * 72fc50ea33d6267ebdc9a0ecd81cb4df3c833814 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2320)
 
   * 78577241f38f2021052fb62c8c19ed67d0db012e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3719: [HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3719:
URL: https://github.com/apache/hudi/pull/3719#issuecomment-927270024


   
   ## CI report:
   
   * f4d2fc3d664279975d143494274b787c6a6d5db1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2401)
 
   * 82b6fa38d0f2e8fca6a7804a550b89a4328644f2 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2537)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3719: [HUDI-2489]Tuning HoodieROTablePathFilter by caching hoodieTableFileSystemView, aiming to reduce unnecessary list/get requests

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3719:
URL: https://github.com/apache/hudi/pull/3719#issuecomment-927270024


   
   ## CI report:
   
   * f4d2fc3d664279975d143494274b787c6a6d5db1 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2401)
 
   * 82b6fa38d0f2e8fca6a7804a550b89a4328644f2 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #3203: [HUDI-2086] Redo the logical of mor_incremental_view for hive

2021-10-07 Thread GitBox


danny0405 commented on a change in pull request #3203:
URL: https://github.com/apache/hudi/pull/3203#discussion_r724668250



##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergedLogReader.java
##
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.realtime;
+
+import java.io.IOException;
+import java.text.MessageFormat;
+import java.util.Iterator;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+/**
+ * Record Reader implementation to read avro data, to support inc queries.
+ */
+public class HoodieMergedLogReader extends AbstractRealtimeRecordReader
+implements RecordReader {
+  private static final Logger LOG = 
LogManager.getLogger(AbstractRealtimeRecordReader.class);
+  private final HoodieMergedLogRecordScanner logRecordScanner;
+  private final Iterator> 
logRecordsKeyIterator;
+  private ArrayWritable valueObj;

Review comment:
   `logRecordsKeyIterator ` => `iterator `

##
File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/HoodieMergedLogReader.java
##
@@ -0,0 +1,144 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.hadoop.realtime;
+
+import java.io.IOException;
+import java.text.MessageFormat;
+import java.util.Iterator;
+
+import org.apache.avro.generic.GenericRecord;
+import org.apache.hadoop.io.ArrayWritable;
+import org.apache.hadoop.io.NullWritable;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.RecordReader;
+import org.apache.hudi.avro.HoodieAvroUtils;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.log.HoodieMergedLogRecordScanner;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.hadoop.utils.HoodieRealtimeRecordReaderUtils;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+
+/**
+ * Record Reader implementation to read avro data, to support inc queries.
+ */
+public class HoodieMergedLogReader extends AbstractRealtimeRecordReader
+implements RecordReader {
+  private static final Logger LOG = 
LogManager.getLogger(AbstractRealtimeRecordReader.class);
+  private final HoodieMergedLogRecordScanner logRecordScanner;
+  private final Iterator> 
logRecordsKeyIterator;
+  private ArrayWritable valueObj;
+
+  private int end;
+  private int offset;
+
+  public HoodieMergedLogReader(RealtimeSplit split, JobConf job, 
HoodieMergedLogRecordScanner logRecordScanner) {
+super(split, job);
+this.logRecordScanner = logRecordScanner;
+this.end = logRecordScanner.getRecords().size();
+this.logRecordsKeyIterator = logRecordScanner.iterator();
+this.valueObj = new ArrayW

[GitHub] [hudi] nsivabalan commented on pull request #3203: [HUDI-2086] Redo the logical of mor_incremental_view for hive

2021-10-07 Thread GitBox


nsivabalan commented on pull request #3203:
URL: https://github.com/apache/hudi/pull/3203#issuecomment-938298379


   sorry for long delay. I will review this by this weekend. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xushiyan commented on a change in pull request #3416: [HUDI-2362] Add external config file support

2021-10-07 Thread GitBox


xushiyan commented on a change in pull request #3416:
URL: https://github.com/apache/hudi/pull/3416#discussion_r724653630



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/DFSPropertiesConfiguration.java
##
@@ -43,70 +45,88 @@
 
   private static final Logger LOG = 
LogManager.getLogger(DFSPropertiesConfiguration.class);
 
+  private static final String DEFAULT_PROPERTIES_FILE = "hudi-defaults.conf";
+
+  public static final String CONF_FILE_DIR_ENV_NAME = "HUDI_CONF_DIR";
+
+  // props read from hudi-defaults.conf
+  private static final TypedProperties HUDI_CONF_PROPS = loadGlobalProps();
+
   private final FileSystem fs;
 
-  private final Path rootFile;
+  private Path currentFilePath;
 
+  // props read from user defined configuration file or input stream
   private final TypedProperties props;
 
   // Keep track of files visited, to detect loops
-  private final Set visitedFiles;
+  private final Set visitedFilePaths;
 
-  public DFSPropertiesConfiguration(FileSystem fs, Path rootFile, 
TypedProperties defaults) {
+  public DFSPropertiesConfiguration(FileSystem fs, Path filePath) {
 this.fs = fs;
-this.rootFile = rootFile;
-this.props = defaults;
-this.visitedFiles = new HashSet<>();
-visitFile(rootFile);
-  }
-
-  public DFSPropertiesConfiguration(FileSystem fs, Path rootFile) {
-this(fs, rootFile, new TypedProperties());
+this.currentFilePath = filePath;
+this.props = new TypedProperties();
+this.visitedFilePaths = new HashSet<>();
+addPropsFromFile(filePath);
   }
 
   public DFSPropertiesConfiguration() {
 this.fs = null;
-this.rootFile = null;
+this.currentFilePath = null;
 this.props = new TypedProperties();
-this.visitedFiles = new HashSet<>();
+this.visitedFilePaths = new HashSet<>();
   }
 
-  private String[] splitProperty(String line) {
-int ind = line.indexOf('=');
-String k = line.substring(0, ind).trim();
-String v = line.substring(ind + 1).trim();
-return new String[] {k, v};
+  /**
+   * Load global props from hudi-defaults.conf which is under 
CONF_FILE_DIR_ENV_NAME.
+   * @return Typed Properties
+   */
+  public static TypedProperties loadGlobalProps() {
+DFSPropertiesConfiguration conf = new DFSPropertiesConfiguration();
+Path defaultConfPath = getDefaultConfPath();
+if (defaultConfPath != null) {
+  conf.addPropsFromFile(defaultConfPath);
+}
+return conf.getConfig();
   }
 
-  private void visitFile(Path file) {
+  /**
+   * Add properties from external configuration files.
+   *
+   * @param filePath File path for configuration file
+   */
+  public void addPropsFromFile(Path filePath) {
 try {
-  if (visitedFiles.contains(file.getName())) {
-throw new IllegalStateException("Loop detected; file " + file + " 
already referenced");
+  if (visitedFilePaths.contains(filePath.toString())) {
+throw new IllegalStateException("Loop detected; file " + filePath + " 
already referenced");
   }
-  visitedFiles.add(file.getName());
-  BufferedReader reader = new BufferedReader(new 
InputStreamReader(fs.open(file)));
-  addProperties(reader);
+  visitedFilePaths.add(filePath.toString());
+  currentFilePath = filePath;
+  FileSystem fileSystem = fs != null ? fs : filePath.getFileSystem(new 
Configuration());
+  BufferedReader reader = new BufferedReader(new 
InputStreamReader(fileSystem.open(filePath)));

Review comment:
   use try with resources to close?

##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/config/DFSPropertiesConfiguration.java
##
@@ -117,7 +137,49 @@ public void addProperties(BufferedReader reader) throws 
IOException {
 }
   }
 
+  public static TypedProperties getGlobalConfig() {
+final TypedProperties globalProps = new TypedProperties();
+globalProps.putAll(HUDI_CONF_PROPS);
+return globalProps;
+  }
+
   public TypedProperties getConfig() {
-return props;
+return getConfig(false);
+  }
+
+  public TypedProperties getConfig(Boolean includeHudiConf) {
+if (includeHudiConf) {
+  TypedProperties mergedProps = new TypedProperties();
+  mergedProps.putAll(HUDI_CONF_PROPS);
+  mergedProps.putAll(props);
+  return mergedProps;
+} else {
+  return props;
+}
+  }
+
+  private static Path getDefaultConfPath() {
+String confDir = System.getenv(CONF_FILE_DIR_ENV_NAME);
+if (confDir == null) {
+  LOG.warn("Cannot find " + CONF_FILE_DIR_ENV_NAME + ", please set it as 
the dir of " + DEFAULT_PROPERTIES_FILE);
+  return null;
+}
+return new Path("file://" + confDir + File.separator + 
DEFAULT_PROPERTIES_FILE);
+  }
+
+  private String[] splitProperty(String line) {
+line = line.replaceAll("\\s+"," ");
+String delimiter = line.contains("=") ? "=" : " ";
+int ind = line.indexOf(delimiter);
+String k = line.substring(0, ind).trim();
+String v = line.substrin

[GitHub] [hudi] danny0405 commented on a change in pull request #3203: [HUDI-2086] Redo the logical of mor_incremental_view for hive

2021-10-07 Thread GitBox


danny0405 commented on a change in pull request #3203:
URL: https://github.com/apache/hudi/pull/3203#discussion_r724660978



##
File path: hudi-common/src/main/java/org/apache/hudi/common/fs/FSUtils.java
##
@@ -336,6 +336,11 @@ public static String getFileExtensionFromLog(Path logPath) 
{
 return matcher.group(3);
   }
 
+  public static String getLogFileExtension(String fullName) {
+Matcher matcher = LOG_FILE_PATTERN.matcher(fullName);

Review comment:
   I think using `FSUtils.isLogFile` directly is more suitable.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto edited a comment on issue #3751: [SUPPORT] Slow Write Speeds to Hudi

2021-10-07 Thread GitBox


rubenssoto edited a comment on issue #3751:
URL: https://github.com/apache/hudi/issues/3751#issuecomment-938290542


   @MikeBuh since July 16, Athena has full support for MoR tables
   
   https://docs.aws.amazon.com/athena/latest/ug/release-note-2021-07-16.html
   
   https://docs.aws.amazon.com/athena/latest/ug/querying-hudi.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] rubenssoto commented on issue #3751: [SUPPORT] Slow Write Speeds to Hudi

2021-10-07 Thread GitBox


rubenssoto commented on issue #3751:
URL: https://github.com/apache/hudi/issues/3751#issuecomment-938290542


   @MikeBuh since July 16, Athena has full support for MoR tables
   
   https://docs.aws.amazon.com/athena/latest/ug/release-note-2021-07-16.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#issuecomment-938271221


   
   ## CI report:
   
   * 5fb7a2afa196fd75ada005d26a0fb9fce5472545 UNKNOWN
   * cb7e9cea8fa966437a892be1e0917443c034e21e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2536)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3761: [HUDI-2513] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3761:
URL: https://github.com/apache/hudi/pull/3761#issuecomment-938264265


   
   ## CI report:
   
   * cf20d97ab77a55797f1bcb4ee7dcb614681e8ae3 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2535)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] guanziyue commented on issue #3755: [Delta Streamer] file name mismatch with meta when compaction running

2021-10-07 Thread GitBox


guanziyue commented on issue #3755:
URL: https://github.com/apache/hudi/issues/3755#issuecomment-938281439


   It seems that the file left in reconcile stage is different with commit 
meta. Could you kindly share relevant logs and file status about marker file?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2275) HoodieDeltaStreamerException when using OCC and a second concurrent writer

2021-10-07 Thread Nishith Agarwal (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425912#comment-17425912
 ] 

Nishith Agarwal commented on HUDI-2275:
---

[~dave_hagman] To ensure that the checkpoints from deltastreamer commits are 
carried over when a concurrent datasource spark job is running, one need to 
enable the following configuration: 
[https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java#L371]

Can you please check if you have enabled this config ?

 

 

> HoodieDeltaStreamerException when using OCC and a second concurrent writer
> --
>
> Key: HUDI-2275
> URL: https://issues.apache.org/jira/browse/HUDI-2275
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Affects Versions: 0.9.0
>Reporter: Dave Hagman
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
>  I am trying to utilize [Optimistic Concurrency 
> Control|https://hudi.apache.org/docs/concurrency_control] in order to allow 
> two writers to update a single table simultaneously. The two writers are:
>  * Writer A: Deltastreamer job consuming continuously from Kafka
>  * Writer B: A spark datasource-based writer that is consuming parquet files 
> out of S3
>  * Table Type: Copy on Write
>  
> After a few commits from each writer the deltastreamer will fail with the 
> following exception:
>  
> {code:java}
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to find 
> previous checkpoint. Please double check if this table was indeed built via 
> delta streamer. Last Commit :Option{val=[20210803165741__commit__COMPLETED]}, 
> Instants :[[20210803165741__commit__COMPLETED]], CommitMetadata={
>  "partitionToWriteStats" : {
>  ...{code}
>  
> What appears to be happening is a lack of commit isolation between the two 
> writers
>  Writer B (spark datasource writer) will land commits which are eventually 
> picked up by Writer A (Delta Streamer). This is an issue because the Delta 
> Streamer needs checkpoint information which the spark datasource of course 
> does not include in its commits. My understanding was that OCC was built for 
> this very purpose (among others). 
> OCC config for Delta Streamer:
> {code:java}
> hoodie.write.concurrency.mode=optimistic_concurrency_control
>  hoodie.cleaner.policy.failed.writes=LAZY
>  
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
>  hoodie.write.lock.zookeeper.url=
>  hoodie.write.lock.zookeeper.port=2181
>  hoodie.write.lock.zookeeper.lock_key=writer_lock
>  hoodie.write.lock.zookeeper.base_path=/hudi-write-locks{code}
>  
> OCC config for spark datasource:
> {code:java}
> // Multi-writer concurrency
>  .option("hoodie.cleaner.policy.failed.writes", "LAZY")
>  .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>  .option(
>  "hoodie.write.lock.provider",
>  
> org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider.class.getCanonicalName()
>  )
>  .option("hoodie.write.lock.zookeeper.url", jobArgs.zookeeperHost)
>  .option("hoodie.write.lock.zookeeper.port", jobArgs.zookeeperPort)
>  .option("hoodie.write.lock.zookeeper.lock_key", "writer_lock")
>  .option("hoodie.write.lock.zookeeper.base_path", "/hudi-write-locks"){code}
> h3. Steps to Reproduce:
>  * Start a deltastreamer job against some table Foo
>  * In parallel, start writing to the same table Foo using spark datasource 
> writer
>  * Note that after a few commits from each the deltastreamer is likely to 
> fail with the above exception when the datasource writer creates non-isolated 
> inflight commits
> NOTE: I have not tested this with two of the same datasources (ex. two 
> deltastreamer jobs)
> NOTE 2: Another detail that may be relevant is that the two writers are on 
> completely different spark clusters but I assumed this shouldn't be an issue 
> since we're locking using Zookeeper



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2531:
-
Fix Version/s: (was: 0.10.0)

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: hudi-umbrellas, sev:critical, user-support-issues
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2531) [UMBRELLA] Support Dataset APIs in writer paths

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2531:
-
Fix Version/s: 0.10.0

> [UMBRELLA] Support Dataset APIs in writer paths
> ---
>
> Key: HUDI-2531
> URL: https://issues.apache.org/jira/browse/HUDI-2531
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: hudi-umbrellas, sev:critical, user-support-issues
> Fix For: 0.10.0
>
>
> To make use of Dataset APIs in writer paths instead of RDD.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#issuecomment-938271221


   
   ## CI report:
   
   * 5fb7a2afa196fd75ada005d26a0fb9fce5472545 UNKNOWN
   * cb7e9cea8fa966437a892be1e0917443c034e21e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2536)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1854) Corrupt blocks in GCS log files

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-1854.
--
Resolution: Cannot Reproduce

> Corrupt blocks in GCS log files
> ---
>
> Key: HUDI-1854
> URL: https://issues.apache.org/jira/browse/HUDI-1854
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: Nishith Agarwal
>Priority: Major
>  Labels: sev:critical, sev:triage
> Attachments: Screen Shot 2021-04-28 at 10.42.50 AM.png
>
>
> Details on how to reproduce this can be found here -> 
> [https://github.com/apache/hudi/issues/2692]
>  
> We need a GCS, google data proc environment to reproduce this. 
>  
> [~vburenin] Would you be able to help try out hudi 0.7 and follow the steps 
> mentioned in this ticket to help reproduce this issue and find the root cause 
> ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1834) Please delete old releases from mirroring system

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1834:
-
Priority: Blocker  (was: Major)

> Please delete old releases from mirroring system
> 
>
> Key: HUDI-1834
> URL: https://issues.apache.org/jira/browse/HUDI-1834
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sebb
>Priority: Blocker
>
> To reduce the load on the ASF mirrors, projects are required to archive old 
> releases [1]
> It's unfair to expect the 3rd party mirrors to carry old releases.
> Please can you archive all non-current releases?
> [Remember to update the download page before dropping the files from 
> dist.apache.org!]
> The following releases appear to be non-current:
> 0.5.0-incubating/
> 0.5.1-incubating/
> 0.5.2-incubating/
> 0.5.3/
> 0.6.0/
> 0.7.0/
> These can be removed as follows:
> svn rm -m"Archiving old release" 
> https://dist.apache.org/repos/dist/release/hudi/0.5.0-incubating
> To avoid breaking any links, first please ensure that any links are either 
> removed from the website or updated to point to the archive server.
> If necessary, please also update your release procedures to ensure old 
> releases are archived
> once a new release has been published.
> Thanks!
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1834) Please delete old releases from mirroring system

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1834:
-
Fix Version/s: 0.10.0

> Please delete old releases from mirroring system
> 
>
> Key: HUDI-1834
> URL: https://issues.apache.org/jira/browse/HUDI-1834
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Sebb
>Priority: Blocker
> Fix For: 0.10.0
>
>
> To reduce the load on the ASF mirrors, projects are required to archive old 
> releases [1]
> It's unfair to expect the 3rd party mirrors to carry old releases.
> Please can you archive all non-current releases?
> [Remember to update the download page before dropping the files from 
> dist.apache.org!]
> The following releases appear to be non-current:
> 0.5.0-incubating/
> 0.5.1-incubating/
> 0.5.2-incubating/
> 0.5.3/
> 0.6.0/
> 0.7.0/
> These can be removed as follows:
> svn rm -m"Archiving old release" 
> https://dist.apache.org/repos/dist/release/hudi/0.5.0-incubating
> To avoid breaking any links, first please ensure that any links are either 
> removed from the website or updated to point to the archive server.
> If necessary, please also update your release procedures to ensure old 
> releases are archived
> once a new release has been published.
> Thanks!
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2003) Auto Compute Compression ratio for input data to output parquet/orc file size

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2003:
-
Labels: user-support-issues  (was: )

> Auto Compute Compression ratio for input data to output parquet/orc file size
> -
>
> Key: HUDI-2003
> URL: https://issues.apache.org/jira/browse/HUDI-2003
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: Vinay
>Priority: Major
>  Labels: user-support-issues
>
> Context : 
> Submitted  a spark job to read 3-4B ORC records and wrote to Hudi format. 
> Creating the following table with all the runs that I had carried out based 
> on different options
>  
> ||CONFIG ||Number of Files Created||Size of each file||
> |PARQUET_FILE_MAX_BYTES=DEFAULT|30K|21MB|
> |PARQUET_FILE_MAX_BYTES=1GB|3700|178MB|
> |PARQUET_FILE_MAX_BYTES=1GB
> COPY_ON_WRITE_TABLE_INSERT_SPLIT_SIZE=110|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=1GB
> BULKINSERT_PARALLELISM=100|Same as before|Same as before|
> |PARQUET_FILE_MAX_BYTES=4GB|1600|675MB|
> |PARQUET_FILE_MAX_BYTES=6GB|669|1012MB|
> Based on this runs, it feels that the compression ratio is off. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table

2021-10-07 Thread GitBox


nsivabalan commented on a change in pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#discussion_r724642330



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
##
@@ -132,18 +149,31 @@ protected AbstractHoodieLogRecordScanner(FileSystem fs, 
String basePath, List keys) {
+currentInstantLogBlocks = new ArrayDeque<>();

Review comment:
   One thing to be cautious about seek based approach vs full scan. In full 
scan, we do one time full scan and prepare a hashmap of records. so, any number 
of look up can be done without any cost. 
   But with seek based approach, if users calls 
   scan(list of 3 keys)
   scan(list of 5 keys)
   we might have to read/parse through the log blocks twice, since everytime we 
are looking for only interested keys. so, we should be cautious in using the 
seek based read for metadata table. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#issuecomment-938271221


   
   ## CI report:
   
   * 5fb7a2afa196fd75ada005d26a0fb9fce5472545 UNKNOWN
   * cb7e9cea8fa966437a892be1e0917443c034e21e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2056) Spark speculation may produce dirty parquet files

2021-10-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425906#comment-17425906
 ] 

Vinoth Chandar commented on HUDI-2056:
--

This should be solved using the marker file mechanism. no?

> Spark speculation may produce dirty parquet files
> -
>
> Key: HUDI-2056
> URL: https://issues.apache.org/jira/browse/HUDI-2056
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Gary Li
>Assignee: Gary Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan commented on a change in pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile log blocks in metadata table

2021-10-07 Thread GitBox


nsivabalan commented on a change in pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#discussion_r724642330



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/log/AbstractHoodieLogRecordReader.java
##
@@ -132,18 +149,31 @@ protected AbstractHoodieLogRecordScanner(FileSystem fs, 
String basePath, List keys) {
+currentInstantLogBlocks = new ArrayDeque<>();

Review comment:
   One thing to be cautious about seek based approach vs full scan. In full 
scan, we do one time full scan and prepare a hashmap of records. so, any number 
of look up can be done without any cost. 
   But with seek based approach, if users calls 
   scan(list of 3 keys)
   scan(list of 5 keys)
   we might have to read/parse through the log blocks twice since everytime we 
are looking for only interested keys. so, we should be cautious in using the 
seek based read for metadata table. 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Parent: HUDI-1628
Issue Type: Sub-task  (was: Bug)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.10.0
>
>
> As of now, in upsert path,
>  * hudi builds a workloadProfile to understand total inserts and updates(with 
> location info) 
>  * Following which, small files info are populated
>  * Then buckets are populated with above info. 
>  * These buckets are later used when getPartition(Object key) is invoked in 
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire 
> JavaRDDs in the driver and hudi does save the workload profile 
> as well. 
> For large write intensive batch jobs(COW types), caching this incurs 
> additional overhead. So, this effort is trying to see if we can avoid doing 
> this by some means. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1628) Improve data locality during ingestion

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1628:
-
Labels: hudi-umbrellas  (was: )

> Improve data locality during ingestion
> --
>
> Key: HUDI-1628
> URL: https://issues.apache.org/jira/browse/HUDI-1628
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Writer Core
>Reporter: satish
>Assignee: Thirumalai Raj R
>Priority: Major
>  Labels: hudi-umbrellas
> Fix For: 0.10.0
>
>
> Today the upsert partitioner does the file sizing/bin-packing etc for
> inserts and then sends some inserts over to existing file groups to
> maintain file size.
> We can abstract all of this into strategies and some kind of pipeline
> abstractions and have it also consider "affinity" to an existing file group
> based
> on say information stored in the metadata table?
> See http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/browser
>  for more details



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-860) Ability to do small file handling without need for caching

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-860:

Parent: (was: HUDI-538)
Issue Type: Bug  (was: Sub-task)

> Ability to do small file handling without need for caching
> --
>
> Key: HUDI-860
> URL: https://issues.apache.org/jira/browse/HUDI-860
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: Vinoth Chandar
>Assignee: Ethan Guo
>Priority: Blocker
> Fix For: 0.10.0
>
>
> As of now, in upsert path,
>  * hudi builds a workloadProfile to understand total inserts and updates(with 
> location info) 
>  * Following which, small files info are populated
>  * Then buckets are populated with above info. 
>  * These buckets are later used when getPartition(Object key) is invoked in 
> UpsertPartitioner.
> In step1: to build global workload profile, we had to do an action on entire 
> JavaRDDs in the driver and hudi does save the workload profile 
> as well. 
> For large write intensive batch jobs(COW types), caching this incurs 
> additional overhead. So, this effort is trying to see if we can avoid doing 
> this by some means. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2172) upserts failing due to _hoodie_record_key being null in the hudi table

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2172:
-
Labels: user-support-issues  (was: )

> upserts failing due to _hoodie_record_key being null in the hudi table
> --
>
> Key: HUDI-2172
> URL: https://issues.apache.org/jira/browse/HUDI-2172
> Project: Apache Hudi
>  Issue Type: Bug
>Affects Versions: 0.6.0
> Environment: AWS EMR emr-5.32.0 , spark version 2.4.7
>Reporter: Varun
>Priority: Major
>  Labels: user-support-issues
>
> Exception:
>  
> {code:java}
> java.lang.NullPointerException  
> at 
> org.apache.hudi.common.util.ParquetUtils.fetchRecordKeyPartitionPathFromParquet(ParquetUtils.java:146)
>   
> at 
> org.apache.hudi.io.HoodieKeyLocationFetchHandle.locations(HoodieKeyLocationFetchHandle.java:53)
>   
> at 
> org.apache.hudi.index.simple.HoodieSimpleIndex.lambda$fetchRecordLocations$c57f549b$1(HoodieSimpleIndex.java:179)
>   
> at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
>   
> at 
> org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$3$1.apply(JavaRDDLike.scala:143)
>   
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)  
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)  
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
>   
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
>  
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) 
>  
> at org.apache.spark.scheduler.Task.run(Task.scala:123)  
> at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)  
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
>  
> We are using hudi as our storage engine for the output of our Spark jobs. We 
> use AWS EMR to run the jobs. Recently we started observing that some of the 
> upsert commits are leaving the table in an inconsistent state i.e. 
> _hoodie_record_key is observed to be null for a record which is updated 
> during that commit.
>  
> *How are we checking that __hoodie_record__key is null?*
> {code:java}
> val df = spark.read
> .format("org.apache.hudi")
> .load("s3://myLocation/my-table" + "/*/*/*/*")
> 
> df.filter($"_hoodie_record_key".isNull).show(false)
> // Output
> +--+--+--+
> |_hoodie_record_key|_hoodie_partition_path|primaryKey|
> +--+--+--+
> |null  |2021/07/01|xx|
> +--+--+--+
> {code}
>  
>  
> One thing to note here is that the record which has null for 
> _hoodie_record_key was already present in the hudi table and was updated 
> during the commit
>  
>  What is even weird for us is that there is only a single record in the hudi 
> table with _hoodie_record_key as null, and all other records are fine
>  
>  We have verified that the column that is used as _hoodie_record_key 
> (RECORDKEY_FIELD_OPT_KEY) is present in the record and is NOT NULL
>  
>  After rolling back the faulty commit which introduced that record, rerunning 
> the job works fine .i.e there are no records with __hoodie_record_key null_
> _HoodieWriter Config_
> __
> {code:java}
> val hudiOptions = Map[String, String](  
>RECORDKEY_FIELD_OPT_KEY -> "primaryKey",  
>  PARTITIONPATH_FIELD_OPT_KEY -> "partitionKey",  
>  PRECOMBINE_FIELD_OPT_KEY -> "updateTime",  
>  KEYGENERATOR_CLASS_OPT_KEY -> classOf[ComplexKeyGenerator].getName,  
>  CLEANER_COMMITS_RETAINED_PROP -> "5"
> )dataframe.write.format("org.apache.hudi")  
>   .option(HoodieWriteConfig.TABLE_NAME, "myTable")  
>   .options(hudiOptions)  
>   .option(HoodieIndexConfig.INDEX_TYPE_PROP,"SIMPLE")  
>   .mode(SaveMode.Append)  
>   .save("s3://mylocation/")
> {code}
>  
> We are using a custom RecordPayload class which inherits from 
> *OverwriteWithLatestAvroPayload*
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2091) Add Uber's grafana dashboard to OSS

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-2091.
--
Resolution: Fixed

> Add Uber's grafana dashboard to OSS
> ---
>
> Key: HUDI-2091
> URL: https://issues.apache.org/jira/browse/HUDI-2091
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: metrics
>Reporter: Nishith Agarwal
>Assignee: Prashant Wason
>Priority: Major
>
> cc [~vinoth]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot commented on pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile blocks in metadata table

2021-10-07 Thread GitBox


hudi-bot commented on pull request #3762:
URL: https://github.com/apache/hudi/pull/3762#issuecomment-938271221


   
   ## CI report:
   
   * 5fb7a2afa196fd75ada005d26a0fb9fce5472545 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425905#comment-17425905
 ] 

Vinoth Chandar commented on HUDI-2287:
--

[~rxu] could you triage and close?

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2287:
-
Priority: Blocker  (was: Major)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Blocker
> Fix For: 0.10.0
>
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2287:
-
Fix Version/s: 0.10.0

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Major
> Fix For: 0.10.0
>
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] fuyun2024 commented on pull request #3722: HUDI-2491 hoodie.datasource.hive_sync.mode=hms mode is supported in s…

2021-10-07 Thread GitBox


fuyun2024 commented on pull request #3722:
URL: https://github.com/apache/hudi/pull/3722#issuecomment-938270825


   @nsivabalan  Thank you for your comments, but I don't know where the mistake 
is. I'm a novice 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2287:


Assignee: Raymond Xu

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Assignee: Raymond Xu
>Priority: Major
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2287) Partition pruning not working on Hudi dataset

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2287:
-
Parent: HUDI-1297
Issue Type: Sub-task  (was: Bug)

> Partition pruning not working on Hudi dataset
> -
>
> Key: HUDI-2287
> URL: https://issues.apache.org/jira/browse/HUDI-2287
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Performance
>Reporter: Rajkumar Gunasekaran
>Priority: Major
>
> Hi, we have created  a Hudi dataset which has two level partition like this
> {code:java}
> s3://somes3bucket/partition1=value/partition2=value
> {code}
> where _partition1_ and _partition2_ is of type string
> When running a simple count query using Hudi format in spark-shell, it takes 
> almost 3 minutes to complete
>   
> {code:scala}
> spark.read.format("hudi").load("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
>  
> res1: Long = 
> attempt 1: 3.2 minutes
>  attempt 2: 2.5 minutes
> {code}
> In the Spark UI ~9000 tasks (which is approximately equivalent to the total 
> no of files in the ENTIRE dataset s3://somes3bucket) are used for 
> computation. Seems like spark is reading the entire dataset instead of 
> *partition pruning.*...and then filtering the dataset based on the where 
> clause
> Whereas, if I use the parquet format to read the dataset, the query only 
> takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)
> {code:scala}
> spark.read.parquet("s3://somes3bucket").
>  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
>  count()
> res2: Long = 
> ~ 30 seconds
> {code}
> In the spark UI, only 1361 (ie 1361 tasks) files are scanned (vis-a-vis ~9000 
> files in Hudi) and takes only 15 seconds
> Any idea why partition pruning is not working when using Hudi format? 
> Wondering if I am missing any configuration during the creation of the 
> dataset?
> PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0 and here is 
> the configuration I have used for creating the dataset
> {code:scala}
> df.writeStream
>  .trigger(Trigger.ProcessingTime(s"${param.triggerTimeInSeconds} seconds"))
>  .partitionBy("partition1","partition2")
>  .format("org.apache.hudi")
>  .option(HoodieWriteConfig.TABLE_NAME, param.hiveNHudiTableName.get)
>  //--
>  .option(HoodieStorageConfig.PARQUET_COMPRESSION_CODEC, "snappy")
>  .option(HoodieStorageConfig.PARQUET_FILE_MAX_BYTES, 
> param.expectedFileSizeInBytes)
>  .option(HoodieStorageConfig.PARQUET_BLOCK_SIZE_BYTES, 
> HoodieStorageConfig.DEFAULT_PARQUET_BLOCK_SIZE_BYTES)
>  //--
>  .option(HoodieCompactionConfig.PARQUET_SMALL_FILE_LIMIT_BYTES, 
> (param.expectedFileSizeInBytes / 100) * 80)
>  .option(HoodieCompactionConfig.INLINE_COMPACT_PROP, "true")
>  .option(HoodieCompactionConfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP, 
> param.runCompactionAfterNDeltaCommits.get)
>  //--
>  .option(DataSourceWriteOptions.TABLE_TYPE_OPT_KEY, 
> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL)
>  .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY, "record_key_id")
>  .option(DataSourceWriteOptions.KEYGENERATOR_CLASS_OPT_KEY, 
> classOf[CustomKeyGenerator].getName)
>  .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY, 
> "partition1:SIMPLE,partition2:SIMPLE")
>  .option(DataSourceWriteOptions.OPERATION_OPT_KEY, 
> DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
>  .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY, 
> hudiTablePrecombineKey)
>  .option(DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY, "true")
>  //.option(DataSourceWriteOptions.HIVE_USE_JDBC_OPT_KEY, "false")
>  .option(DataSourceWriteOptions.HIVE_STYLE_PARTITIONING_OPT_KEY, "true")
>  .option(DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY, 
> "partition1,partition2")
>  .option(DataSourceWriteOptions.HIVE_DATABASE_OPT_KEY, param.hiveDb.get)
>  .option(DataSourceWriteOptions.HIVE_TABLE_OPT_KEY, 
> param.hiveNHudiTableName.get)
>  .option(DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY, 
> classOf[MultiPartKeysValueExtractor].getName)
>  .outputMode(OutputMode.Append())
>  .queryName(s"${param.hiveDb}_${param.hiveNHudiTableName}_query"){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2363) COW : Listing leaf files and directories twice

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2363:
-
Labels: user-support-issues  (was: )

> COW : Listing leaf files and directories twice
> --
>
> Key: HUDI-2363
> URL: https://issues.apache.org/jira/browse/HUDI-2363
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: selvaraj
>Priority: Major
>  Labels: user-support-issues
> Attachments: Screen Shot 2021-08-25 at 5.36.52 PM.png
>
>
> Team,
> In our organization we are still using Hudi 0.5.0.  We would upgrade to the 
> latest version in couple of quarters.   
> problem scenario :
> Many use cases in our project using COW and hive sync is disabled.  One of 
> the Hudi contains two years worth of data , which are partitioned by date.  
> For every write on this table, i notice that Listing leaf files and 
> directories job triggered twice. Normally it is triggered only once.  Attache 
> the screenshot. 
>  
> once the first  listing leaf files and directories are done then another 
> listing of leaf files and directories logs are rolled. 
> I  spent some time in investigating the source code but couldn't trace where 
> exactly it is being invoked .
>  
> How can it be avoided here? Unfortunately this one is adding up more latency 
> in our flow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2363) COW : Listing leaf files and directories twice

2021-10-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425904#comment-17425904
 ] 

Vinoth Chandar commented on HUDI-2363:
--

I think these are long fixed in recent releases. 0.7.0 IIRC. are you able to 
try out newer versions

> COW : Listing leaf files and directories twice
> --
>
> Key: HUDI-2363
> URL: https://issues.apache.org/jira/browse/HUDI-2363
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Writer Core
>Reporter: selvaraj
>Priority: Major
> Attachments: Screen Shot 2021-08-25 at 5.36.52 PM.png
>
>
> Team,
> In our organization we are still using Hudi 0.5.0.  We would upgrade to the 
> latest version in couple of quarters.   
> problem scenario :
> Many use cases in our project using COW and hive sync is disabled.  One of 
> the Hudi contains two years worth of data , which are partitioned by date.  
> For every write on this table, i notice that Listing leaf files and 
> directories job triggered twice. Normally it is triggered only once.  Attache 
> the screenshot. 
>  
> once the first  listing leaf files and directories are done then another 
> listing of leaf files and directories logs are rolled. 
> I  spent some time in investigating the source code but couldn't trace where 
> exactly it is being invoked .
>  
> How can it be avoided here? Unfortunately this one is adding up more latency 
> in our flow.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan opened a new pull request #3762: [HUDI-1294] Adding inline read and seekable read for hfile blocks in metadata table

2021-10-07 Thread GitBox


nsivabalan opened a new pull request #3762:
URL: https://github.com/apache/hudi/pull/3762


   ## What is the purpose of the pull request
   
   - Added support to read HFile log blocks via inline FileSystem in metadata 
table.
   - Also added support to read for a list of keys(batch get) rather than full 
scan in metadata table. 
   
   ## Brief change log
   - Added two new configs to HoodieMetadataConfig. 
`hoodie.metadata.enable.inline.reading.log.files` and 
`hoodie.metadata.enable.full.scan.log.files`. 
   - Since we are adding support for seek based read, renamed 
AbstractHoodieLogRecordScanner to AbstractHoodieLogRecordReader. and so have 
renamed HoodieMetadataMergedLogRecordReader. 
   - Added new method to HoodieMetadataMergedLogRecordReader to support this 
purpose(i.e. reading records for a list of keys) w/o doing full scan. 
   ```
   public List>>> 
getRecordsByKeys(List keys) {
   
   }
   ```
   - Added new method to HoodieDataBlock for the new requirement. Base class 
does not have any impl. HoodieHFileDataBlock overrides and gives a concrete 
impl where in records are read via inline FileSystem with seek based approach. 
   ```
   public List getRecords(List keys) throws IOException {
   }
   ```
   - HoodieDataBlock also adheres to enableInline config even if not for batch 
get. Basically 3 options are possible. a: full scan w/o inline. b. full scan 
with inlining. c. batch get (with inline) 
   - have fixed metadata reader (HoodieBackedTableMetadata) to leverage the new 
apis based on config values. 
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - Added tests to TestHoodieRealtimeRecordReader to verify the change.
 - Found some gaps in testing HFileWriter and Reader especially around seek 
based read and have added TestHoodieHFileReaderWriter to test these cases.
 - Enabled inline and batch get reads to 1 test in 
TestHoodieBackedMetadata. 
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2509) OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2509:
-
Labels: sev:critical user-support-issues  (was: )

> OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with 
> some null value column
> ---
>
> Key: HUDI-2509
> URL: https://issues.apache.org/jira/browse/HUDI-2509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>  Labels: sev:critical, user-support-issues
>
> https://github.com/apache/hudi/issues/3735



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2416) Move FAQs to website

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-2416.
--
Resolution: Fixed

> Move FAQs to website
> 
>
> Key: HUDI-2416
> URL: https://issues.apache.org/jira/browse/HUDI-2416
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
>
> We intend to move all the docs from cWiki to website. FAQs is a good starting 
> point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1194) Reorganize HoodieHiveClient and make it fully support Hive Metastore API

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1194:
-
Parent: HUDI-2519
Issue Type: Sub-task  (was: Improvement)

> Reorganize HoodieHiveClient and make it fully support Hive Metastore API
> 
>
> Key: HUDI-1194
> URL: https://issues.apache.org/jira/browse/HUDI-1194
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> Currently there are three ways in HoodieHiveClient to perform Hive 
> functionalities. One is through Hive JDBC, one is through Hive Metastore API. 
> One is through Hive Driver.
>  
>  There’s a parameter called +{{hoodie.datasource.hive_sync.use_jdbc}}+ to 
> control whether use Hive JDBC or not. However, this parameter does not 
> accurately describe the situation.
>  Basically, current logic is when set +*use_jdbc*+ to true, most of the 
> methods in HoodieHiveClient will use JDBC, and few methods in 
> HoodieHiveClient will use Hive Metastore API.
>  When set +*use_jdbc*+ to false, most of the methods in HoodieHiveClient will 
> use Hive Driver, and few methods in HoodieHiveClient will use Hive Metastore 
> API.
> Here is a table shows that what will actually be used when setting use_jdbc 
> to ture/false.
> |Method|use_jdbc=true|use_jdbc=false|
> |{{addPartitionsToTable}}|JDBC|Hive Driver|
> |{{updatePartitionsToTable}}|JDBC|Hive Driver|
> |{{scanTablePartitions}}|Metastore API|Metastore API|
> |{{updateTableDefinition}}|JDBC|Hive Driver|
> |{{createTable}}|JDBC|Hive Driver|
> |{{getTableSchema}}|JDBC|Metastore API|
> |{{doesTableExist}}|Metastore API|Metastore API|
> |getLastCommitTimeSynced|Metastore API|Metastore API|
> [~bschell] and I developed several Metastore API implementation for 
> {{createTable, }}{{addPartitionsToTable}}{{, }}{{updatePartitionsToTable}}{{, 
> }}{{updateTableDefinition }}{{which will be helpful for several issues: e.g. 
> resolving null partition hive sync issue and supporting ALTER_TABLE cascade 
> with AWS glue catalog}}{{. }}
> {{But it seems hard to organize three implementations within the current 
> config. So we plan to separate HoodieHiveClient into three classes:}}
>  # {{HoodieHiveClient which implements all the APIs through Metastore API.}}
>  # {{HoodieHiveJDBCClient which extends from HoodieHiveClient and overwrite 
> several the APIs through Hive JDBC.}}
>  # {{HoodieHiveDriverClient which extends from HoodieHiveClient and overwrite 
> several the APIs through Hive Driver.}}
> {{And we introduce a new parameter 
> }}+*hoodie.datasource.hive_sync.hive_client_class*+ which could** _**_ let 
> you choose which Hive Client class to use.
> {{}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2416) Move FAQs to website

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2416:
-
Status: Closed  (was: Patch Available)

> Move FAQs to website
> 
>
> Key: HUDI-2416
> URL: https://issues.apache.org/jira/browse/HUDI-2416
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
>
> We intend to move all the docs from cWiki to website. FAQs is a good starting 
> point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-2416) Move FAQs to website

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-2416:
--

> Move FAQs to website
> 
>
> Key: HUDI-2416
> URL: https://issues.apache.org/jira/browse/HUDI-2416
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Docs, Usability
>Reporter: Pratyaksh Sharma
>Assignee: Pratyaksh Sharma
>Priority: Major
>  Labels: pull-request-available
>
> We intend to move all the docs from cWiki to website. FAQs is a good starting 
> point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-6) Support for Hive 3.x

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-6.
---
Resolution: Duplicate

> Support for Hive 3.x
> 
>
> Key: HUDI-6
> URL: https://issues.apache.org/jira/browse/HUDI-6
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/579



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-6) Support for Hive 3.x

2021-10-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425903#comment-17425903
 ] 

Vinoth Chandar commented on HUDI-6:
---

HUDI-2519 is tracking everythig under this. planned out for 0.10.0. Hive 3.x 
already works but has some issues syncing IIUC. we can follow up on that JIRA. !

> Support for Hive 3.x
> 
>
> Key: HUDI-6
> URL: https://issues.apache.org/jira/browse/HUDI-6
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/579



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1194) Reorganize HoodieHiveClient and make it fully support Hive Metastore API

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1194:
-
Status: Open  (was: New)

> Reorganize HoodieHiveClient and make it fully support Hive Metastore API
> 
>
> Key: HUDI-1194
> URL: https://issues.apache.org/jira/browse/HUDI-1194
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Wenning Ding
>Priority: Major
>  Labels: pull-request-available
>
> Currently there are three ways in HoodieHiveClient to perform Hive 
> functionalities. One is through Hive JDBC, one is through Hive Metastore API. 
> One is through Hive Driver.
>  
>  There’s a parameter called +{{hoodie.datasource.hive_sync.use_jdbc}}+ to 
> control whether use Hive JDBC or not. However, this parameter does not 
> accurately describe the situation.
>  Basically, current logic is when set +*use_jdbc*+ to true, most of the 
> methods in HoodieHiveClient will use JDBC, and few methods in 
> HoodieHiveClient will use Hive Metastore API.
>  When set +*use_jdbc*+ to false, most of the methods in HoodieHiveClient will 
> use Hive Driver, and few methods in HoodieHiveClient will use Hive Metastore 
> API.
> Here is a table shows that what will actually be used when setting use_jdbc 
> to ture/false.
> |Method|use_jdbc=true|use_jdbc=false|
> |{{addPartitionsToTable}}|JDBC|Hive Driver|
> |{{updatePartitionsToTable}}|JDBC|Hive Driver|
> |{{scanTablePartitions}}|Metastore API|Metastore API|
> |{{updateTableDefinition}}|JDBC|Hive Driver|
> |{{createTable}}|JDBC|Hive Driver|
> |{{getTableSchema}}|JDBC|Metastore API|
> |{{doesTableExist}}|Metastore API|Metastore API|
> |getLastCommitTimeSynced|Metastore API|Metastore API|
> [~bschell] and I developed several Metastore API implementation for 
> {{createTable, }}{{addPartitionsToTable}}{{, }}{{updatePartitionsToTable}}{{, 
> }}{{updateTableDefinition }}{{which will be helpful for several issues: e.g. 
> resolving null partition hive sync issue and supporting ALTER_TABLE cascade 
> with AWS glue catalog}}{{. }}
> {{But it seems hard to organize three implementations within the current 
> config. So we plan to separate HoodieHiveClient into three classes:}}
>  # {{HoodieHiveClient which implements all the APIs through Metastore API.}}
>  # {{HoodieHiveJDBCClient which extends from HoodieHiveClient and overwrite 
> several the APIs through Hive JDBC.}}
>  # {{HoodieHiveDriverClient which extends from HoodieHiveClient and overwrite 
> several the APIs through Hive Driver.}}
> {{And we introduce a new parameter 
> }}+*hoodie.datasource.hive_sync.hive_client_class*+ which could** _**_ let 
> you choose which Hive Client class to use.
> {{}}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-6) Support for Hive 3.x

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6:
--
Issue Type: New Feature  (was: Improvement)

> Support for Hive 3.x
> 
>
> Key: HUDI-6
> URL: https://issues.apache.org/jira/browse/HUDI-6
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/579



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-6) Support for Hive 3.x

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6:
--
Priority: Blocker  (was: Major)

> Support for Hive 3.x
> 
>
> Key: HUDI-6
> URL: https://issues.apache.org/jira/browse/HUDI-6
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Blocker
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/579



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-6) Support for Hive 3.x

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-6:
--
Fix Version/s: 0.10.0

> Support for Hive 3.x
> 
>
> Key: HUDI-6
> URL: https://issues.apache.org/jira/browse/HUDI-6
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Hive Integration
>Reporter: Vinoth Chandar
>Priority: Major
> Fix For: 0.10.0
>
>
> https://github.com/uber/hudi/issues/579



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-938266751


   BTW, could help add me to HUDI JIRA group so that I can assign the task to 
me? @nsivabalan @codope 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2339) Create Table If Not Exists Failed After Alter Table

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2339:
-
Parent: HUDI-1658
Issue Type: Sub-task  (was: Bug)

> Create Table If Not Exists Failed After Alter Table
> ---
>
> Key: HUDI-2339
> URL: https://issues.apache.org/jira/browse/HUDI-2339
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: pengzhiwei
>Priority: Major
>  Labels: pull-request-available
>
> An Exception throw out after alter table for create table if not exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot edited a comment on pull request #3761: [HUDI-2513] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


hudi-bot edited a comment on pull request #3761:
URL: https://github.com/apache/hudi/pull/3761#issuecomment-938264265


   
   ## CI report:
   
   * cf20d97ab77a55797f1bcb4ee7dcb614681e8ae3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=2535)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (HUDI-2532) Set right default value for max delta commits for compaction in metadata table

2021-10-07 Thread sivabalan narayanan (Jira)
sivabalan narayanan created HUDI-2532:
-

 Summary: Set right default value for max delta commits for 
compaction in metadata table 
 Key: HUDI-2532
 URL: https://issues.apache.org/jira/browse/HUDI-2532
 Project: Apache Hudi
  Issue Type: Improvement
Reporter: sivabalan narayanan
 Fix For: 0.10.0


Set right default value of 10 for max delta commits for compaction in metadata 
table. As of now, its set as 24 which is huge. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2389) Translate ByteDance Hudi Blog from Chinese to English

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-2389.
--
Fix Version/s: 0.9.0
   Resolution: Fixed

> Translate ByteDance Hudi Blog from Chinese to English
> -
>
> Key: HUDI-2389
> URL: https://issues.apache.org/jira/browse/HUDI-2389
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2347) Write a blog for marker mechanisms

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-2347.
--
Fix Version/s: 0.9.0
   Resolution: Fixed

> Write a blog for marker mechanisms
> --
>
> Key: HUDI-2347
> URL: https://issues.apache.org/jira/browse/HUDI-2347
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Ethan Guo
>Assignee: Ethan Guo
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.9.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1870) Move spark avro serialization class into hudi repo

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1870:
-
Fix Version/s: 0.10.0

> Move spark avro serialization class into hudi repo
> --
>
> Key: HUDI-1870
> URL: https://issues.apache.org/jira/browse/HUDI-1870
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Gary Li
>Assignee: XiaoyuGeng
>Priority: Blocker
> Fix For: 0.10.0
>
>
> in Spark 3.1.1, avro serialization-related class become private. We need to 
> mvoe those classes into Hudi's repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1870) Move spark avro serialization class into hudi repo

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1870:
-
Priority: Blocker  (was: Major)

> Move spark avro serialization class into hudi repo
> --
>
> Key: HUDI-1870
> URL: https://issues.apache.org/jira/browse/HUDI-1870
> Project: Apache Hudi
>  Issue Type: Improvement
>Reporter: Gary Li
>Assignee: XiaoyuGeng
>Priority: Blocker
>
> in Spark 3.1.1, avro serialization-related class become private. We need to 
> mvoe those classes into Hudi's repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] hudi-bot commented on pull request #3761: [HUDI-2513] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


hudi-bot commented on pull request #3761:
URL: https://github.com/apache/hudi/pull/3761#issuecomment-938264265


   
   ## CI report:
   
   * cf20d97ab77a55797f1bcb4ee7dcb614681e8ae3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (HUDI-2509) OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread Adam Z CHEN (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425900#comment-17425900
 ] 

Adam Z CHEN commented on HUDI-2509:
---

[https://github.com/apache/hudi/pull/3761] PR raised

> OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with 
> some null value column
> ---
>
> Key: HUDI-2509
> URL: https://issues.apache.org/jira/browse/HUDI-2509
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Sagar Sumit
>Priority: Major
>
> https://github.com/apache/hudi/issues/3735



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-938263412


   https://github.com/apache/hudi/pull/3761 PR raised


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] peanut-chenzhong opened a new pull request #3761: [HUDI-2513] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


peanut-chenzhong opened a new pull request #3761:
URL: https://github.com/apache/hudi/pull/3761


   [HUDI-2513] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when 
upsert data with some null value column
   
   ## Committer checklist
   
- [Y ] Has a corresponding JIRA in PR title & commit

- [Y ] Commit message is descriptive of the change

- [Y ] CI is green
   
- [ N] Necessary doc changes done or have another open PR
  
- [N ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2424) Error checking bloom filter index (NPE)

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2424:
-
Labels: user-support-issues  (was: )

> Error checking bloom filter index (NPE)
> ---
>
> Key: HUDI-2424
> URL: https://issues.apache.org/jira/browse/HUDI-2424
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Jakub Kubala
>Priority: Major
>  Labels: user-support-issues
>
> Hi,
> Recently we have encountered an issue with Hudi where NPE is thrown out of 
> nowhere during processing the content.
> As we have over 100k of the content to process, I cannot easily narrow down 
> to what is the troublesome piece.
> We are using configurations that come with AWS EMR v5.30 (Hudi 0.5.2) and 
> v5.33(Hudi 0.7.0)
>  
> {code:java}
> 21/09/10 18:31:14 WARN TaskSetManager: Lost task 1.0 in stage 38.0 (TID 
> 23804, ip-10-208-160-140.eu-central-1.compute.internal, executor 2): 
> java.lang.RuntimeException: org.apache.hudi.exception.HoodieIndexException: 
> Error checking bloom filter index. at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at 
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:154)
>  at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55) at 
> org.apache.spark.scheduler.Task.run(Task.scala:123) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: 
> org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter 
> index. at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:110)
>  at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:60)
>  at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:119)
>  ... 15 more Caused by: java.lang.NullPointerException at 
> org.apache.hudi.io.HoodieKeyLookupHandle.addKey(HoodieKeyLookupHandle.java:99)
>  at 
> org.apache.hudi.index.bloom.HoodieBloomIndexCheckFunction$LazyKeyCheckIterator.computeNext(HoodieBloomIndexCheckFunction.java:97)
>  ... 17 more21/09/10 18:31:14 INFO TaskSetManager: Starting task 1.1 in stage 
> 38.0 (TID 23805, ip-10-208-160-140.eu-central-1.compute.internal, executor 1, 
> partition 1, NODE_LOCAL, 7662 bytes) 21/09/10 18:31:18 INFO TaskSetManager: 
> Lost task 1.1 in stage 38.0 (TID 23805) on 
> ip-10-208-160-140.eu-central-1.compute.internal, executor 1: 
> java.lang.RuntimeException (org.apache.hudi.exception.HoodieIndexException: 
> Error checking bloom filter index. ) [duplicate 1] 21/09/10 18:31:18 INFO 
> TaskSetManager: Starting task 1.2 in stage 38.0 (TID 23806, 
> ip-10-208-160-140.eu-central-1.compute.internal, executor 1, partition 1, 
> NODE_LOCAL, 7662 bytes) 21/09/10 18:31:21 INFO TaskSetManager: Lost task 1.2 
> in stage 38.0 (TID 23806) on ip-10-208-160-140.eu-central-1.compute.internal, 
> executor 1: java.lang.RuntimeException 
> (org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter 
> index. ) [duplicate 2] 21/09/10 18:31:21 INFO TaskSetManager: Starting task 
> 1.3 in stage 38.0 (TID 23807, 
> ip-10-208-160-140.eu-central-1.compute.internal, executor 2, partition 1, 
> NODE_LOCAL, 7662 bytes) 21/09/10 18:31:25 WARN TaskSetManager: Lost task 1.3 
> in stage 38.0 (TID 23807, ip-10-208-160-140.eu-central-1.compute.internal, 
> executor 2): java.lang.RuntimeException: 
> org.apache.hudi.exception.HoodieIndexException: Error checking bloom filter 
> index. at 
> org.apache.hudi.client.utils.LazyIterableIterator.next(LazyIterableIterator.java:121)
>  at 
> scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at 
> scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435) at 
> scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441) at 
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:

[jira] [Updated] (HUDI-2427) SQL stmt broken with spark 3.1.x

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2427:
-
Labels: sev:high user-support-issues  (was: )

> SQL stmt broken with spark 3.1.x
> 
>
> Key: HUDI-2427
> URL: https://issues.apache.org/jira/browse/HUDI-2427
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Common Core
>Reporter: nicolas paris
>Priority: Major
>  Labels: sev:high, user-support-issues
>
> In my experiments, the new SQL stmt features of hudi 0.9 does not work with 
> spark 3.1.x but only with spark 3.0.x
> step to reproduce:
>  {{spark-3.1.2-bin-hadoop2.7/bin/spark-shell \
> --packages 
> org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.1.2
>  \
> --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
> --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
> spark.sql("""
> create table h3 using hudi
> as
> select 1 as id, 'a1' as name, 10 as price
> """)
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.catalyst.expressions.Alias.(Lorg/apache/spark/sql/catalyst/expressions/Expression;Ljava/lang/String;Lorg/apache/spark/sql/catalyst/expressions/ExprId;Lscala/collection/Seq;Lscala/Option;)V
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.$anonfun$alignOutputFields$6(InsertIntoHoodieTableCommand.scala:152)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.alignOutputFields(InsertIntoHoodieTableCommand.scala:148)
>   at 
> org.apache.spark.sql.hudi.command.InsertIntoHoodieTableCommand$.run(InsertIntoHoodieTableCommand.scala:95)
>   at 
> org.apache.spark.sql.hudi.command.CreateHoodieTableAsSelectCommand.run(CreateHoodieTableAsSelectCommand.scala:84)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
>   at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:228)
>   at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
>   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:618)
>   at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:613)
>   ... 60 elided}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2426) spark sql extensions breaks read.table from metastore

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2426:
-
Labels: user-support-issues  (was: )

> spark sql extensions breaks read.table from metastore
> -
>
> Key: HUDI-2426
> URL: https://issues.apache.org/jira/browse/HUDI-2426
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Hive Integration
>Reporter: nicolas paris
>Priority: Major
>  Labels: user-support-issues
>
> when adding the hudi spark sql support, this breaks the ability to read a 
> hudi metastore from spark:
>  bash-4.2$ ./spark3.0.2/bin/spark-shell --packages 
> org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0,org.apache.spark:spark-avro_2.12:3.1.2
>  --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf 
> 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'
>  
> scala> spark.table("default.test_hudi_table").show
> java.lang.UnsupportedOperationException: Unsupported parseMultipartIdentifier 
> method
>  at 
> org.apache.spark.sql.parser.HoodieCommonSqlParser.parseMultipartIdentifier(HoodieCommonSqlParser.scala:65)
>  at org.apache.spark.sql.SparkSession.table(SparkSession.scala:581)
>  ... 47 elided
>  
> removing the config makes the hive table readable again from spark
> this affect at least spark 3.0.x and 3.1.x



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2470) use commit_time in the WHERE STATEMENT to optimize the incremental query

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2470:
-
Parent: HUDI-1658
Issue Type: Sub-task  (was: Improvement)

> use commit_time in the WHERE STATEMENT to optimize the  incremental query
> -
>
> Key: HUDI-2470
> URL: https://issues.apache.org/jira/browse/HUDI-2470
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Incremental Pull, Performance
>Reporter: David_Liang
>Assignee: David_Liang
>Priority: Major
>
> In the module of DeltaStreamer,  Option of  QUERY_TYPE_OPT_KEY and 
> BEGIN_INSTANTTIME_OPT_KEY is used to tell the DeltaStreamer to query data 
> after the specific time.  
> Such as method is not very convenient for user.  So if we can implement the 
> function that User can set BEGIN_INSTANTTIME_OPT_KEY  and 
> BEGIN_INSTANTTIME_OPT_KEY  at the sql, which is not only  very convinient for 
> user, also very a  elegant implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-2470) use commit_time in the WHERE STATEMENT to optimize the incremental query

2021-10-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425888#comment-17425888
 ] 

Vinoth Chandar commented on HUDI-2470:
--

Assigned it you! Look forward to the PR 

> use commit_time in the WHERE STATEMENT to optimize the  incremental query
> -
>
> Key: HUDI-2470
> URL: https://issues.apache.org/jira/browse/HUDI-2470
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Incremental Pull, Performance
>Reporter: David_Liang
>Assignee: David_Liang
>Priority: Major
>
> In the module of DeltaStreamer,  Option of  QUERY_TYPE_OPT_KEY and 
> BEGIN_INSTANTTIME_OPT_KEY is used to tell the DeltaStreamer to query data 
> after the specific time.  
> Such as method is not very convenient for user.  So if we can implement the 
> function that User can set BEGIN_INSTANTTIME_OPT_KEY  and 
> BEGIN_INSTANTTIME_OPT_KEY  at the sql, which is not only  very convinient for 
> user, also very a  elegant implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-2470) use commit_time in the WHERE STATEMENT to optimize the incremental query

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reassigned HUDI-2470:


Assignee: David_Liang

> use commit_time in the WHERE STATEMENT to optimize the  incremental query
> -
>
> Key: HUDI-2470
> URL: https://issues.apache.org/jira/browse/HUDI-2470
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Incremental Pull, Performance
>Reporter: David_Liang
>Assignee: David_Liang
>Priority: Major
>
> In the module of DeltaStreamer,  Option of  QUERY_TYPE_OPT_KEY and 
> BEGIN_INSTANTTIME_OPT_KEY is used to tell the DeltaStreamer to query data 
> after the specific time.  
> Such as method is not very convenient for user.  So if we can implement the 
> function that User can set BEGIN_INSTANTTIME_OPT_KEY  and 
> BEGIN_INSTANTTIME_OPT_KEY  at the sql, which is not only  very convinient for 
> user, also very a  elegant implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1951) Hash Index for HUDI

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1951:
-
Parent: HUDI-1822
Issue Type: Sub-task  (was: New Feature)

> Hash Index for HUDI
> ---
>
> Key: HUDI-1951
> URL: https://issues.apache.org/jira/browse/HUDI-1951
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: XiaoyuGeng
>Assignee: XiaoyuGeng
>Priority: Major
>  Labels: pull-request-available
>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2489) Tuning HoodieROTablePathFilter by caching, aiming to reduce unnecessary list/get requests

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2489:
-
Priority: Blocker  (was: Major)

> Tuning HoodieROTablePathFilter by caching, aiming to reduce unnecessary 
> list/get requests
> -
>
> Key: HUDI-2489
> URL: https://issues.apache.org/jira/browse/HUDI-2489
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Yue Zhang
>Priority: Blocker
>  Labels: pull-request-available
>
> Cache HoodieTableFileSystemView at baseDir level
> The same as HoodieTableMetaClient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2409) Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2409:
-
Priority: Blocker  (was: Major)

> Using HBase shaded jars in Hudi presto bundle 
> --
>
> Key: HUDI-2409
> URL: https://issues.apache.org/jira/browse/HUDI-2409
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Yue Zhang
>Priority: Blocker
>  Labels: pull-request-available
>
> Execute {{hbase-server}} and {{hbase-client}} dependency in 
> Hudi-presto-bundle.
> Add {{hbase-shaded-client}} and {{hbase-shaded-server}} in Hudi-presto-bundle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2409) Using HBase shaded jars in Hudi presto bundle

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2409:
-
Fix Version/s: 0.10.0

> Using HBase shaded jars in Hudi presto bundle 
> --
>
> Key: HUDI-2409
> URL: https://issues.apache.org/jira/browse/HUDI-2409
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Yue Zhang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Execute {{hbase-server}} and {{hbase-client}} dependency in 
> Hudi-presto-bundle.
> Add {{hbase-shaded-client}} and {{hbase-shaded-server}} in Hudi-presto-bundle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2489) Tuning HoodieROTablePathFilter by caching, aiming to reduce unnecessary list/get requests

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2489:
-
Fix Version/s: 0.10.0

> Tuning HoodieROTablePathFilter by caching, aiming to reduce unnecessary 
> list/get requests
> -
>
> Key: HUDI-2489
> URL: https://issues.apache.org/jira/browse/HUDI-2489
> Project: Apache Hudi
>  Issue Type: Task
>Reporter: Yue Zhang
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Cache HoodieTableFileSystemView at baseDir level
> The same as HoodieTableMetaClient



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] peanut-chenzhong commented on issue #3735: [SUPPORT] OverwriteNonDefaultsWithLatestAvroPayload doesn`t work when upsert data with some null value column

2021-10-07 Thread GitBox


peanut-chenzhong commented on issue #3735:
URL: https://github.com/apache/hudi/issues/3735#issuecomment-938254613


   @codope sure, will raise PR soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-2199) DynamoDB based external index implementation

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2199:
-
Parent: HUDI-1822
Issue Type: Sub-task  (was: New Feature)

> DynamoDB based external index implementation
> 
>
> Key: HUDI-2199
> URL: https://issues.apache.org/jira/browse/HUDI-2199
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Index
>Reporter: Vinoth Chandar
>Assignee: Biswajit mohapatra
>Priority: Major
>
> We have a HBaseIndex, that provides uses with ability to store fileID <=> 
> recordKey mappings in an external kv store, for fast lookups during upsert 
> operations. We can potentially create a similar one for DynamoDB. 
> We just use a single column family in HBase, so we should be able to largely 
> re-use the code/key-value schema across them even. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-686) Implement BloomIndexV2 that does not depend on memory caching

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-686:

Parent: HUDI-1822
Issue Type: Sub-task  (was: Improvement)

> Implement BloomIndexV2 that does not depend on memory caching
> -
>
> Key: HUDI-686
> URL: https://issues.apache.org/jira/browse/HUDI-686
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Index, Performance
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Rajesh Mahindra
>Priority: Major
>  Labels: pull-request-available
> Attachments: Screen Shot 2020-03-19 at 10.15.10 AM.png, Screen Shot 
> 2020-03-19 at 10.15.10 AM.png, Screen Shot 2020-03-19 at 10.15.10 AM.png, 
> image-2020-03-19-10-17-43-048.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Main goals here is to provide a much simpler index, without advanced 
> optimizations like auto tuned parallelism/skew handling but a better 
> out-of-experience for small workloads. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2510) QuickStart html page is showing 404

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2510:
-
Status: Closed  (was: Patch Available)

> QuickStart html page is showing 404
> ---
>
> Key: HUDI-2510
> URL: https://issues.apache.org/jira/browse/HUDI-2510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available
>
> Some external entities such as GCP are linking to 
> [https://hudi.apache.org/quickstart.html] for quick start. 
>  
> [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc]
>  
> Can we create an alias to the actual quick start link?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (HUDI-2510) QuickStart html page is showing 404

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar resolved HUDI-2510.
--
Resolution: Fixed

> QuickStart html page is showing 404
> ---
>
> Key: HUDI-2510
> URL: https://issues.apache.org/jira/browse/HUDI-2510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available
>
> Some external entities such as GCP are linking to 
> [https://hudi.apache.org/quickstart.html] for quick start. 
>  
> [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc]
>  
> Can we create an alias to the actual quick start link?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (HUDI-2510) QuickStart html page is showing 404

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar reopened HUDI-2510:
--

> QuickStart html page is showing 404
> ---
>
> Key: HUDI-2510
> URL: https://issues.apache.org/jira/browse/HUDI-2510
> Project: Apache Hudi
>  Issue Type: Bug
>Reporter: Rajesh Mahindra
>Assignee: Vinoth Govindarajan
>Priority: Major
>  Labels: pull-request-available
>
> Some external entities such as GCP are linking to 
> [https://hudi.apache.org/quickstart.html] for quick start. 
>  
> [https://cloud.google.com/blog/products/data-analytics/getting-started-with-new-table-formats-on-dataproc]
>  
> Can we create an alias to the actual quick start link?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1297) [Umbrella] Spark Datasource Support

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1297:
-
Summary: [Umbrella] Spark Datasource Support  (was: [Umbrella] Revamp Spark 
Datasource support using Spark 3 APIs)

> [Umbrella] Spark Datasource Support
> ---
>
> Key: HUDI-1297
> URL: https://issues.apache.org/jira/browse/HUDI-1297
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: Vinoth Chandar
>Assignee: Vinoth Chandar
>Priority: Critical
>  Labels: hudi-umbrellas
>
> Yet to be fully scoped out
> But high level, we want to 
>  * First class support for streaming reads/writes via structured streaming
>  * Row based reader/writers all the way
>  * Support for File/Partition pruning using Hudi metadata tables



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2526) Make spark.sql.parquet.writeLegacyFormat configurable

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2526:
-
Parent: HUDI-1297
Issue Type: Sub-task  (was: Improvement)

> Make spark.sql.parquet.writeLegacyFormat configurable
> -
>
> Key: HUDI-2526
> URL: https://issues.apache.org/jira/browse/HUDI-2526
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: Sagar Sumit
>Assignee: sivabalan narayanan
>Priority: Major
>
> From the community, 
> "I am observing that HUDI bulk inser in 0.9.0 version is not honoring
> spark.sql.parquet.writeLegacyFormat=true
> config. Can you suggest way to set this config.
> Reason to use this config:
> Current Bulk insert use spark dataframe writer and don't do avro conversion. 
> The decimal columns in my DF are written as INT32 type in parquet.
> The upsert functionality which uses avro conversion is generating Fixed 
> Length byte array for decimal types which is failing with datatype mismatch." 
> The main reason is that the [config is 
> hardcoded|https://github.com/apache/hudi/blob/46808dcb1fe22491326a9e831dd4dde4c70796fb/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/io/storage/row/HoodieRowParquetWriteSupport.java#L48].
>  We can make it configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2071) Support Reading Bootstrap MOR RT Table In Spark DataSource Table

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2071:
-
Parent: HUDI-1297
Issue Type: Sub-task  (was: Improvement)

> Support Reading Bootstrap MOR RT Table  In Spark DataSource Table
> -
>
> Key: HUDI-2071
> URL: https://issues.apache.org/jira/browse/HUDI-2071
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Reporter: pengzhiwei
>Priority: Critical
>
> Currently spark datasource table use the HoodieBootstrapRelation to read 
> bootstrap table.
> However, for bootstrap mor rt table, we have not support yet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2505) [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2505:
-
Labels: hudi-umbrellas sev:critical  (was: sev:critical)

> [UMBRELLA] Spark DataSource APIs and Spark SQL discrepancies
> 
>
> Key: HUDI-2505
> URL: https://issues.apache.org/jira/browse/HUDI-2505
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: Raymond Xu
>Priority: Critical
>  Labels: hudi-umbrellas, sev:critical
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2362) Hudi external configuration file support

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2362:
-
Priority: Blocker  (was: Major)

> Hudi external configuration file support
> 
>
> Key: HUDI-2362
> URL: https://issues.apache.org/jira/browse/HUDI-2362
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
>
> Many big data applications like Hadoop, Hive have an XML configuration file 
> that users can have a concentrated place to set all the parameters.
> Also to support Spark SQL, it might be easier for Hudi to have a 
> configuration file which could avoid setting Hudi parameters inside Hive CLI 
> or Spark SQL CLI.
> Here is an example:
> {{}}
> {code:java}
> # Enable optimistic concurrency control by default, to disable it, remove the 
> following two configs
> hoodie.write.concurrency.mode   optimistic_concurrency_control
> hoodie.cleaner.policy.failed.writes LAZY
> hoodie.write.lock.provider  
> org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
> hoodie.write.lock.zookeeper.url ip-192-168-1-239.ec2.internal
> hoodie.write.lock.zookeeper.port2181
> hoodie.write.lock.zookeeper.base_path   hudi_occ_lock
> hoodie.index.type   BLOOM
> # Only applies if index type is HBASE
> hoodie.index.hbase.zkquorum ip-192-168-1-239.ec2.internal
> hoodie.index.hbase.zkport   2181
> # Only applies if Hive sync is enable
> hoodie.datasource.hive_sync.jdbcurl 
> jdbc:hive2://ip-192-168-1-239.ec2.internal:1
> {code}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2362) Hudi external configuration file support

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2362:
-
Fix Version/s: 0.10.0

> Hudi external configuration file support
> 
>
> Key: HUDI-2362
> URL: https://issues.apache.org/jira/browse/HUDI-2362
> Project: Apache Hudi
>  Issue Type: New Feature
>Reporter: Wenning Ding
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.10.0
>
>
> Many big data applications like Hadoop, Hive have an XML configuration file 
> that users can have a concentrated place to set all the parameters.
> Also to support Spark SQL, it might be easier for Hudi to have a 
> configuration file which could avoid setting Hudi parameters inside Hive CLI 
> or Spark SQL CLI.
> Here is an example:
> {{}}
> {code:java}
> # Enable optimistic concurrency control by default, to disable it, remove the 
> following two configs
> hoodie.write.concurrency.mode   optimistic_concurrency_control
> hoodie.cleaner.policy.failed.writes LAZY
> hoodie.write.lock.provider  
> org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
> hoodie.write.lock.zookeeper.url ip-192-168-1-239.ec2.internal
> hoodie.write.lock.zookeeper.port2181
> hoodie.write.lock.zookeeper.base_path   hudi_occ_lock
> hoodie.index.type   BLOOM
> # Only applies if index type is HBASE
> hoodie.index.hbase.zkquorum ip-192-168-1-239.ec2.internal
> hoodie.index.hbase.zkport   2181
> # Only applies if Hive sync is enable
> hoodie.datasource.hive_sync.jdbcurl 
> jdbc:hive2://ip-192-168-1-239.ec2.internal:1
> {code}
> {{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1608) MOR fetches all records for read optimized query w/ spark sql

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-1608:
-
Labels: pull-request-available sev:high user-support-issues  (was: 
pull-request-available sev:high)

> MOR fetches all records for read optimized query w/ spark sql
> -
>
> Key: HUDI-1608
> URL: https://issues.apache.org/jira/browse/HUDI-1608
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.7.0
>Reporter: sivabalan narayanan
>Priority: Critical
>  Labels: pull-request-available, sev:high, user-support-issues
>
> Script to reproduce in local spark:
>  
> [https://gist.github.com/nsivabalan/7250b794788516f1aec35650c2632364]
>  
> ```
> scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, 
> _hoodie_partition_path, id, __op from hudi_trips_snapshot order by 
> _hoodie_record_key").show(false)
> +-++++-+
> |_hoodie_commit_time|_hoodie_record_key|_hoodie_partition_path|id|__op|
> +-++++-+
> |20210210070347    |1                |1970-01-01           |1 |null|
> |20210210070347    |2                |1970-01-01           |2 |null|
> |20210210070347    |3                |2020-01-04           |3 |D  |
> |20210210070347    |4                |1998-04-13           |4 |I  |
> |20210210070347    |5                |2020-01-01           |5 |I  |
> |*20210210070445*    |*6*                |*1998-04-13*           |*6* |*I*  |
> +-++++-+
> ```
> After an upsert, read optimized query returns records from both C1 and C2. 
> Also, I don't find any log files in partitions. all of them are parquet 
> files. 
>  
> ls /tmp/hudi_trips_cow/1998-04-13/
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-23-12025_20210210065058.parquet
> 0d1e6a84-d036-42e9-806e-a3075b6bc677-0_1-61-25595_20210210065127.parquet
> ls /tmp/hudi_trips_cow/1970-01-01/
> 7b836833-a656-485d-967a-871bdc653dc3-0_2-61-25596_20210210065127.parquet
> 7b836833-a656-485d-967a-871bdc653dc3-0_3-23-12027_20210210065058.parquet
>  
> Source of the issue: [https://github.com/apache/hudi/issues/2255]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2390) KeyGenerator discrepancy between DataFrame writer and SQL

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2390:
-
Labels: sev:critical user-support-issues  (was: sev:critical)

> KeyGenerator discrepancy between DataFrame writer and SQL
> -
>
> Key: HUDI-2390
> URL: https://issues.apache.org/jira/browse/HUDI-2390
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Spark Integration
>Affects Versions: 0.9.0
>Reporter: renhao
>Assignee: Yann Byron
>Priority: Critical
>  Labels: sev:critical, user-support-issues
>
> Test Case:
> {code:java}
>  import org.apache.hudi.QuickstartUtils._
>  import scala.collection.JavaConversions._
>  import org.apache.spark.sql.SaveMode._
>  import org.apache.hudi.DataSourceReadOptions._
>  import org.apache.hudi.DataSourceWriteOptions._
>  import org.apache.hudi.config.HoodieWriteConfig._{code}
> 1.准备数据
>  
> {code:java}
> spark.sql("create table test1(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')")
> spark.sql("insert into table test1 select 1,2,3")
> {code}
>  
> 2.创建hudi table test2
> {code:java}
> spark.sql("create table test2(a int,b string,c string) using hudi partitioned 
> by(b) options(primaryKey='a')"){code}
> 3.datasource向test2写入数据
>  
> {code:java}
> val base_data=spark.sql("select * from testdb.test1")
> base_data.write.format("hudi").
> option(TABLE_TYPE_OPT_KEY, COW_TABLE_TYPE_OPT_VAL).      
> option(RECORDKEY_FIELD_OPT_KEY, "a").      
> option(PARTITIONPATH_FIELD_OPT_KEY, "b").      
> option(KEYGENERATOR_CLASS_OPT_KEY, 
> "org.apache.hudi.keygen.SimpleKeyGenerator"). 
> option(OPERATION_OPT_KEY, "bulk_insert").      
> option(HIVE_SYNC_ENABLED_OPT_KEY, "true").      
> option(HIVE_PARTITION_FIELDS_OPT_KEY, "b").   
> option(HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY,"org.apache.hudi.hive.MultiPartKeysValueExtractor").
>       
> option(HIVE_DATABASE_OPT_KEY, "testdb").      
> option(HIVE_TABLE_OPT_KEY, "test2").      
> option(HIVE_USE_JDBC_OPT_KEY, "true").      
> option("hoodie.bulkinsert.shuffle.parallelism", 4).
> option("hoodie.datasource.write.hive_style_partitioning", "true").      
> option(TABLE_NAME, 
> "test2").mode(Append).save(s"/user/hive/warehouse/testdb.db/test2")
> {code}
>  
> 此时执行查询结果如下:
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
> 4.删除一条记录
> {code:java}
> spark.sql("delete from testdb.test2 where a=1"){code}
> 5.执行查询,a=1的记录未被删除
> {code:java}
> spark.sql("select a,b,c from testdb.test2").show{code}
> {code:java}
> +---+---+---+
> | a| b| c|
> +---+---+---+
> | 1| 3| 2|
> +---+---+---+{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-2275) HoodieDeltaStreamerException when using OCC and a second concurrent writer

2021-10-07 Thread Vinoth Chandar (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-2275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinoth Chandar updated HUDI-2275:
-
Parent: HUDI-1456
Issue Type: Sub-task  (was: Bug)

> HoodieDeltaStreamerException when using OCC and a second concurrent writer
> --
>
> Key: HUDI-2275
> URL: https://issues.apache.org/jira/browse/HUDI-2275
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: DeltaStreamer, Spark Integration, Writer Core
>Affects Versions: 0.9.0
>Reporter: Dave Hagman
>Assignee: Sagar Sumit
>Priority: Critical
>  Labels: sev:critical
> Fix For: 0.10.0
>
>
>  I am trying to utilize [Optimistic Concurrency 
> Control|https://hudi.apache.org/docs/concurrency_control] in order to allow 
> two writers to update a single table simultaneously. The two writers are:
>  * Writer A: Deltastreamer job consuming continuously from Kafka
>  * Writer B: A spark datasource-based writer that is consuming parquet files 
> out of S3
>  * Table Type: Copy on Write
>  
> After a few commits from each writer the deltastreamer will fail with the 
> following exception:
>  
> {code:java}
> org.apache.hudi.exception.HoodieDeltaStreamerException: Unable to find 
> previous checkpoint. Please double check if this table was indeed built via 
> delta streamer. Last Commit :Option{val=[20210803165741__commit__COMPLETED]}, 
> Instants :[[20210803165741__commit__COMPLETED]], CommitMetadata={
>  "partitionToWriteStats" : {
>  ...{code}
>  
> What appears to be happening is a lack of commit isolation between the two 
> writers
>  Writer B (spark datasource writer) will land commits which are eventually 
> picked up by Writer A (Delta Streamer). This is an issue because the Delta 
> Streamer needs checkpoint information which the spark datasource of course 
> does not include in its commits. My understanding was that OCC was built for 
> this very purpose (among others). 
> OCC config for Delta Streamer:
> {code:java}
> hoodie.write.concurrency.mode=optimistic_concurrency_control
>  hoodie.cleaner.policy.failed.writes=LAZY
>  
> hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
>  hoodie.write.lock.zookeeper.url=
>  hoodie.write.lock.zookeeper.port=2181
>  hoodie.write.lock.zookeeper.lock_key=writer_lock
>  hoodie.write.lock.zookeeper.base_path=/hudi-write-locks{code}
>  
> OCC config for spark datasource:
> {code:java}
> // Multi-writer concurrency
>  .option("hoodie.cleaner.policy.failed.writes", "LAZY")
>  .option("hoodie.write.concurrency.mode", "optimistic_concurrency_control")
>  .option(
>  "hoodie.write.lock.provider",
>  
> org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider.class.getCanonicalName()
>  )
>  .option("hoodie.write.lock.zookeeper.url", jobArgs.zookeeperHost)
>  .option("hoodie.write.lock.zookeeper.port", jobArgs.zookeeperPort)
>  .option("hoodie.write.lock.zookeeper.lock_key", "writer_lock")
>  .option("hoodie.write.lock.zookeeper.base_path", "/hudi-write-locks"){code}
> h3. Steps to Reproduce:
>  * Start a deltastreamer job against some table Foo
>  * In parallel, start writing to the same table Foo using spark datasource 
> writer
>  * Note that after a few commits from each the deltastreamer is likely to 
> fail with the above exception when the datasource writer creates non-isolated 
> inflight commits
> NOTE: I have not tested this with two of the same datasources (ex. two 
> deltastreamer jobs)
> NOTE 2: Another detail that may be relevant is that the two writers are on 
> completely different spark clusters but I assumed this shouldn't be an issue 
> since we're locking using Zookeeper



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >