[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-682314989 The file that isn't found is `'s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4957-299294_20200827155539.parquet'`. The available files in s3 that start with "9dee1248-c972-4ed3-80f5-15545ac4c534-0_2" are: ``` 2020-08-27 10:26 33525767 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3850-231917_20200827102526.parquet 2020-08-27 10:33 33526574 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3891-234401_20200827103318.parquet 2020-08-27 16:17 33545224 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet 2020-08-27 11:13 33530132 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4096-246791_20200827111254.parquet 2020-08-27 11:22 33530880 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4137-249295_20200827112139.parquet 2020-08-27 12:00 3353 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4301-259277_20200827115949.parquet 2020-08-27 12:20 33534377 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4383-264271_20200827121947.parquet 2020-08-27 12:42 33535631 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4465-269277_20200827124204.parquet 2020-08-27 12:54 33536084 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4506-271786_20200827125338.parquet 2020-08-27 13:07 33536635 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4547-274289_20200827130640.parquet 2020-08-27 13:20 33537444 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4588-276783_20200827131919.parquet 2020-08-27 13:32 33538151 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4629-279284_20200827133143.parquet 2020-08-27 13:46 33539531 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4670-281782_20200827134536.parquet 2020-08-27 14:14 33541130 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4752-286756_20200827141258.parquet 2020-08-27 14:30 33541913 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4793-289269_20200827142922.parquet 2020-08-27 14:49 33542820 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4834-291776_20200827144807.parquet 2020-08-27 15:08 33543459 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4875-294286_20200827150653.parquet 2020-08-27 15:30 33544369 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet ``` Contents of s3://my-bucket/my-table/.hoodie/20200827155539.commit ``` "9dee1248-c972-4ed3-80f5-15545ac4c534-0" : "daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet", ``` Contents of s3://my-bucket/my-table/.hoodie/20200827155539.compaction.requested ``` [20200827152840, [.9dee1248-c972-4ed3-80f5-15545ac4c534-0_20200827152840.log.1_32-4949-299212], 9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet, 9dee1248-c972-4ed3-80f5-15545ac4c534-0, daas_date=2020, [TOTAL_LOG_FILES -> 1.0, TOTAL_IO_READ_MB -> 32.0, TOTAL_LOG_FILES_SIZE -> 121966.0, TOTAL_IO_WRITE_MB -> 31.0, TOTAL_IO_MB -> 63.0, TOTAL_LOG_FILE_SIZE -> 121966.0]], ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran commented on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-682314989 The file that isn't found is `'s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4957-299294_20200827155539.parquet'`. The available files in s3 that start with "9dee1248-c972-4ed3-80f5-15545ac4c534-0_2" are: ``` 2020-08-27 10:26 33525767 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3850-231917_20200827102526.parquet 2020-08-27 10:33 33526574 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3891-234401_20200827103318.parquet 2020-08-27 16:17 33545224 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet 2020-08-27 11:13 33530132 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4096-246791_20200827111254.parquet 2020-08-27 11:22 33530880 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4137-249295_20200827112139.parquet 2020-08-27 12:00 3353 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4301-259277_20200827115949.parquet 2020-08-27 12:20 33534377 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4383-264271_20200827121947.parquet 2020-08-27 12:42 33535631 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4465-269277_20200827124204.parquet 2020-08-27 12:54 33536084 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4506-271786_20200827125338.parquet 2020-08-27 13:07 33536635 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4547-274289_20200827130640.parquet 2020-08-27 13:20 33537444 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4588-276783_20200827131919.parquet 2020-08-27 13:32 33538151 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4629-279284_20200827133143.parquet 2020-08-27 13:46 33539531 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4670-281782_20200827134536.parquet 2020-08-27 14:14 33541130 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4752-286756_20200827141258.parquet 2020-08-27 14:30 33541913 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4793-289269_20200827142922.parquet 2020-08-27 14:49 33542820 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4834-291776_20200827144807.parquet 2020-08-27 15:08 33543459 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4875-294286_20200827150653.parquet 2020-08-27 15:30 33544369 s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran edited a comment on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-682311268 @bvaradar The exception was raised, after running the structured streaming job for a while. Please find attached the driver logs with INFO level logging. [stderr_01.log](https://github.com/apache/hudi/files/5139921/stderr_01.log) : the structured streaming job fails with error `org.apache.hudi.exception.HoodieIOException: Consistency check failed to ensure all files APPEAR` [stderr_02.log](https://github.com/apache/hudi/files/5139922/stderr_02.log) : the structured streaming job is retried by YARN and compaction fails with a `java.io.FileNotFoundException` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] dm-tran commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"
dm-tran commented on issue #2020: URL: https://github.com/apache/hudi/issues/2020#issuecomment-682311268 @bvaradar The exception was raised, after running the structured streaming for a while. Please find attached the driver logs with INFO level logging. [stderr_01.log](https://github.com/apache/hudi/files/5139921/stderr_01.log) : the structured streaming job fails with error `org.apache.hudi.exception.HoodieIOException: Consistency check failed to ensure all files APPEAR` [stderr_02.log](https://github.com/apache/hudi/files/5139922/stderr_02.log) : the structured streaming job is retried by YARN and compaction fails with a `java.io.FileNotFoundException` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Closed] (HUDI-1222) Introduce MergeHelper.UpdateHandler as independent class
[ https://issues.apache.org/jira/browse/HUDI-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu closed HUDI-1222. - Resolution: Invalid > Introduce MergeHelper.UpdateHandler as independent class > -- > > Key: HUDI-1222 > URL: https://issues.apache.org/jira/browse/HUDI-1222 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > Making UpdateHandler class independent helps reduce the workload of > refactoring hudi-client -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1222) Introduce MergeHelper.UpdateHandler as independent class
[ https://issues.apache.org/jira/browse/HUDI-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangxianghu updated HUDI-1222: -- Status: Open (was: New) > Introduce MergeHelper.UpdateHandler as independent class > -- > > Key: HUDI-1222 > URL: https://issues.apache.org/jira/browse/HUDI-1222 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: wangxianghu >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > Making UpdateHandler class independent helps reduce the workload of > refactoring hudi-client -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] zherenyu831 closed issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream
zherenyu831 closed issue #2043: URL: https://github.com/apache/hudi/issues/2043 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] zherenyu831 commented on issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream
zherenyu831 commented on issue #2043: URL: https://github.com/apache/hudi/issues/2043#issuecomment-682305478 @bvaradar Thank you for reply, I also saw your blog pr before, and it work with pure structured streaming api Marked, will try to avoid this issue when batch writing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu closed pull request #2033: [HUDI-1222] Introduce MergeHelper.UpdateHandler as independent class
wangxianghu closed pull request #2033: URL: https://github.com/apache/hudi/pull/2033 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on pull request #2033: [HUDI-1222] Introduce MergeHelper.UpdateHandler as independent class
wangxianghu commented on pull request #2033: URL: https://github.com/apache/hudi/pull/2033#issuecomment-682303627 let`s keep it in HUDI-1089 closing now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on issue #1751: [SUPPORT] Hudi not working with Spark 3.0.0
nsivabalan commented on issue #1751: URL: https://github.com/apache/hudi/issues/1751#issuecomment-682301333 @bschell is driving this. Ref PR: https://github.com/apache/hudi/pull/1760. @bschell : any rough timelines ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching
nsivabalan commented on a change in pull request #1469: URL: https://github.com/apache/hudi/pull/1469#discussion_r478805908 ## File path: hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java ## @@ -0,0 +1,354 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.index.bloom; + +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.client.utils.LazyIterableIterator; +import org.apache.hudi.common.bloom.BloomFilter; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieKey; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordLocation; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.common.table.timeline.HoodieInstant; +import org.apache.hudi.common.util.DefaultSizeEstimator; +import org.apache.hudi.common.util.HoodieTimer; +import org.apache.hudi.common.util.Option; +import org.apache.hudi.common.util.collection.ExternalSpillableMap; +import org.apache.hudi.common.util.collection.Pair; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieIOException; +import org.apache.hudi.index.HoodieIndex; +import org.apache.hudi.io.HoodieBloomRangeInfoHandle; +import org.apache.hudi.io.HoodieKeyLookupHandle; +import org.apache.hudi.table.HoodieTable; +import org.apache.log4j.LogManager; +import org.apache.log4j.Logger; +import org.apache.spark.api.java.JavaPairRDD; +import org.apache.spark.api.java.JavaRDD; +import org.apache.spark.api.java.JavaSparkContext; +import scala.Tuple2; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Objects; +import java.util.Set; +import java.util.function.Function; +import java.util.stream.Collectors; + +/** + * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely on caching, or + * incurs the overhead of auto-tuning parallelism. + */ +public class HoodieBloomIndexV2 extends HoodieIndex { + + private static final Logger LOG = LogManager.getLogger(HoodieBloomIndexV2.class); + + public HoodieBloomIndexV2(HoodieWriteConfig config) { +super(config); + } + + @Override + public JavaRDD> tagLocation(JavaRDD> recordRDD, + JavaSparkContext jsc, + HoodieTable hoodieTable) { +return recordRDD +.sortBy((record) -> String.format("%s-%s", record.getPartitionPath(), record.getRecordKey()), +true, config.getBloomIndexV2Parallelism()) +.mapPartitions((itr) -> new LazyRangeAndBloomChecker(itr, hoodieTable)).flatMap(List::iterator) +.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism()) +.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable)) +.filter(Option::isPresent) Review comment: @vinothchandar : guess there could a bug here. If for a record, few files were matched from range and bloom lookup, but in LazyKeyChecker none of the files had the record, current code may not have this record in the final JavaRDD> returned. But we have to return this record w/ empty cur location. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-531) Add java doc for hudi test suite general classes
[ https://issues.apache.org/jira/browse/HUDI-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang updated HUDI-531: -- Fix Version/s: 0.5.1 > Add java doc for hudi test suite general classes > > > Key: HUDI-531 > URL: https://issues.apache.org/jira/browse/HUDI-531 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: vinoyang >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.5.1 > > > Currently, the general classes (under src/main dir) has no java docs. We > should add doc for those classes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (HUDI-531) Add java doc for hudi test suite general classes
[ https://issues.apache.org/jira/browse/HUDI-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vinoyang closed HUDI-531. - Resolution: Done Done via master branch: fa812482473f0cc8c2f34e2db07366cc3e5f7066 > Add java doc for hudi test suite general classes > > > Key: HUDI-531 > URL: https://issues.apache.org/jira/browse/HUDI-531 > Project: Apache Hudi > Issue Type: Sub-task > Components: Testing >Reporter: vinoyang >Assignee: wangxianghu >Priority: Major > Labels: pull-request-available > Fix For: 0.5.1 > > > Currently, the general classes (under src/main dir) has no java docs. We > should add doc for those classes. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[hudi] branch master updated: [HUDI-531] Add java doc for hudi test suite general classes (#1900)
This is an automated email from the ASF dual-hosted git repository. vinoyang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git The following commit(s) were added to refs/heads/master by this push: new fa81248 [HUDI-531] Add java doc for hudi test suite general classes (#1900) fa81248 is described below commit fa812482473f0cc8c2f34e2db07366cc3e5f7066 Author: Mathieu AuthorDate: Fri Aug 28 08:44:40 2020 +0800 [HUDI-531] Add java doc for hudi test suite general classes (#1900) --- .../org/apache/hudi/client/HoodieWriteClient.java | 2 +- hudi-integ-test/README.md | 6 +-- .../hudi/integ/testsuite/converter/Converter.java | 6 +++ .../integ/testsuite/dag/nodes/BulkInsertNode.java | 3 ++ .../hudi/integ/testsuite/dag/nodes/CleanNode.java | 4 ++ .../integ/testsuite/dag/nodes/CompactNode.java | 10 .../hudi/integ/testsuite/dag/nodes/DagNode.java| 8 ++- .../integ/testsuite/dag/nodes/HiveQueryNode.java | 3 ++ .../integ/testsuite/dag/nodes/HiveSyncNode.java| 3 ++ .../hudi/integ/testsuite/dag/nodes/InsertNode.java | 3 ++ .../integ/testsuite/dag/nodes/RollbackNode.java| 9 .../testsuite/dag/nodes/ScheduleCompactNode.java | 3 ++ .../testsuite/dag/nodes/SparkSQLQueryNode.java | 9 .../hudi/integ/testsuite/dag/nodes/UpsertNode.java | 3 ++ .../integ/testsuite/dag/nodes/ValidateNode.java| 9 .../testsuite/dag/scheduler/DagScheduler.java | 21 .../integ/testsuite/generator/DeltaGenerator.java | 2 +- .../GenericRecordFullPayloadGenerator.java | 58 ++ .../GenericRecordFullPayloadSizeEstimator.java | 12 + .../generator/UpdateGeneratorIterator.java | 3 ++ .../integ/testsuite/writer/DeltaWriterAdapter.java | 3 ++ 21 files changed, 174 insertions(+), 6 deletions(-) diff --git a/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java b/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java index 9f6df7b..142ff33 100644 --- a/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java +++ b/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java @@ -557,7 +557,7 @@ public class HoodieWriteClient extends AbstractHo metrics.updateCleanMetrics(durationMs, metadata.getTotalFilesDeleted()); LOG.info("Cleaned " + metadata.getTotalFilesDeleted() + " files" + " Earliest Retained Instant :" + metadata.getEarliestCommitToRetain() - + " cleanerElaspsedMs" + durationMs); + + " cleanerElapsedMs" + durationMs); } return metadata; } diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md index d87fec3..a497ad9 100644 --- a/hudi-integ-test/README.md +++ b/hudi-integ-test/README.md @@ -41,7 +41,7 @@ Depending on the type of workload generated, data is either ingested into the ta dataset or the corresponding workload operation is executed. For example compaction does not necessarily need a workload to be generated/ingested but can require an execution. -## Other actions/operatons +## Other actions/operations The test suite supports different types of operations besides ingestion such as Hive Query execution, Clean action etc. @@ -66,9 +66,9 @@ link#HudiDeltaStreamer page to learn about all the available configs applicable There are 2 ways to generate a workload pattern - 1.Programatically + 1.Programmatically -Choose to write up the entire DAG of operations programatically, take a look at `WorkflowDagGenerator` class. +Choose to write up the entire DAG of operations programmatically, take a look at `WorkflowDagGenerator` class. Once you're ready with the DAG you want to execute, simply pass the class name as follows: ``` diff --git a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java index e4ad0a7..89f3b88 100644 --- a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java +++ b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java @@ -29,5 +29,11 @@ import org.apache.spark.api.java.JavaRDD; */ public interface Converter extends Serializable { + /** + * Convert data from one format to another. + * + * @param inputRDD Input data + * @return Data in target format + */ JavaRDD convert(JavaRDD inputRDD); } \ No newline at end of file diff --git a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java index 7a8f405..bdf57f8 100644 --- a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java +++ b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java @@ -24,6 +
[GitHub] [hudi] yanghua merged pull request #1900: [HUDI-531] Add java doc for hudi test suite general classes
yanghua merged pull request #1900: URL: https://github.com/apache/hudi/pull/1900 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1104) Bulk insert Dataset - UserDefinedPartitioner
[ https://issues.apache.org/jira/browse/HUDI-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1104: - Labels: pull-request-available (was: ) > Bulk insert Dataset - UserDefinedPartitioner > > > Key: HUDI-1104 > URL: https://issues.apache.org/jira/browse/HUDI-1104 > Project: Apache Hudi > Issue Type: Improvement > Components: Writer Core >Reporter: sivabalan narayanan >Assignee: sivabalan narayanan >Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] nsivabalan opened a new pull request #2049: [HUDI-1104] [WIP] Bulk insert dedup
nsivabalan opened a new pull request #2049: URL: https://github.com/apache/hudi/pull/2049 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request *(For example: This pull request adds quick-start document.)* ## Brief change log *(for example:)* - *Modify AnnotationLocation checkstyle rule in checkstyle.xml* ## Verify this pull request *(Please pick either of the following options)* This pull request is a trivial rework / code cleanup without any test coverage. *(or)* This pull request is already covered by existing tests, such as *(please describe tests)*. (or) This change added tests and can be verified as follows: *(example:)* - *Added integration tests for end-to-end.* - *Added HoodieClientWriteTest to verify the change.* - *Manually verified the change by running a job locally.* ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action
satishkotha commented on pull request #2048: URL: https://github.com/apache/hudi/pull/2048#issuecomment-682247412 @vinothchandar @bvaradar FYI. There are few things that I'm not fully happy with. But would like to get initial feedback and get agreement on high level approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha opened a new pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action
satishkotha opened a new pull request #2048: URL: https://github.com/apache/hudi/pull/2048 ## What is the purpose of the pull request I am following up on feedback apache#1859, we want to make replace a top level action. This is WIP, but publishing to give a high level idea on changes required. ## Brief change log There are multiple challenges in making replace top level action: 1. In the write path, we create .commit in 2places - BaseActionCommitExecutor or HoodieSparkSqlWriter. We need to change both places create .replace file. 2. All post commit actions work on top of HoodieCommitMetadata. For now replace is also using same class to keep the change simple. We can split HoodieCommitMetadata into class hierarchy (not sure how json serialization works with inheritance) or discuss other alternatives 3. There are many assumptions in the code to say commit action type is tied to table type. For example, action type can only be either '.commit' or '.deltacommit'. [here](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java#L474). Changing this to add replace seemed error prone and tedious. we may have to add better abstractions here. 4. The way we identify if a parquet file is valid is by checking if theres a corresponding '.commit' file. If we just create .replace file, we have to change lot of places to make sure new files created by replace are used. 5. Everywhere, we try to invoke 'getCommitsTimeline'/'filterCommits', we need to review and make sure caller can handle replace actions. OR create new methods and refactor all invocations of getCommitsTimeline call new methods. I made most of the changes for #1,2,3 above. Need to discuss if this is the right approach and extend it to 4,5. ## Verify this pull request This change added tests. Verified basic actions using quick start and docker setup. Adding more tests in progress. But wanted to get feedback on high level approach. This is an example .hoodie folder from quick start setup: -rw-r--r-- 1 satishkotha wheel 1933 Aug 27 16:39 20200827163904.commit -rw-r--r-- 1 satishkotha wheel 0 Aug 27 16:39 20200827163904.commit.requested -rw-r--r-- 1 satishkotha wheel 1015 Aug 27 16:39 20200827163904.inflight -rw-r--r-- 1 satishkotha wheel 2610 Aug 27 16:39 20200827163927.replace -rw-r--r-- 1 satishkotha wheel 1024 Aug 27 16:39 20200827163927.replace.inflight -rw-r--r-- 1 satishkotha wheel 0 Aug 27 16:39 20200827163927.replace.requested drwxr-xr-x 2 satishkotha wheel64 Aug 27 16:39 archived -rw-r--r-- 1 satishkotha wheel 235 Aug 27 16:39 hoodie.properties ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason edited a comment on pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
prashantwason edited a comment on pull request #1804: URL: https://github.com/apache/hudi/pull/1804#issuecomment-682213033 > @prashantwason if you broadly agree, I will make the change and land this, so you can focus on rfc-15 more :) Sure @vinothchandar . Thanks for all the help. Lets get this rolling soon. I will look into the comments too but you can do the needful. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
prashantwason commented on pull request #1804: URL: https://github.com/apache/hudi/pull/1804#issuecomment-682213033 > @prashantwason if you broadly agree, I will make the change and land this, so you can focus on rfc-15 more :) Sure @vinothchandar . Thanks for all the help. Lets get this rolling soon. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
prashantwason commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r478655201 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.hudi.client.SparkTaskContextSupplier; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieUpsertException; +import org.apache.hudi.table.HoodieTable; + +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.Iterator; +import java.util.Map; +import java.util.PriorityQueue; +import java.util.Queue; + +/** + * Hoodie merge handle which writes records (new inserts or updates) sorted by their key. + * + * The implementation performs a merge-sort by comparing the key of the record being written to the list of + * keys in newRecordKeys (sorted in-memory). + */ +public class HoodieSortedMergeHandle extends HoodieMergeHandle { + + private Queue newRecordKeysSorted = new PriorityQueue<>(); + + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Iterator> recordItr, String partitionPath, String fileId, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, sparkTaskContextSupplier); +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Called by compactor code path. + */ + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Map> keyToNewRecordsOrig, String partitionPath, String fileId, + HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, keyToNewRecordsOrig, partitionPath, fileId, dataFileToBeMerged, +sparkTaskContextSupplier); + +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Go through an old record. Here if we detect a newer version shows up, we write the new one to the file. + */ + @Override + public void write(GenericRecord oldRecord) { +String key = oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString(); + +// To maintain overall sorted order across updates and inserts, write any new inserts whose keys are less than +// the oldRecord's key. +while (!newRecordKeysSorted.isEmpty() && newRecordKeysSorted.peek().compareTo(key) <= 0) { + String keyToPreWrite = newRecordKeysSorted.remove(); Review comment: If the inputItr is sorted then yes all this overhead can be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] satishkotha commented on a change in pull request #2044: [HUDI-1228] Add utility method to query extra metadata
satishkotha commented on a change in pull request #2044: URL: https://github.com/apache/hudi/pull/2044#discussion_r478620546 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -98,4 +101,31 @@ }).distinct().filter(s -> !s.isEmpty()).collect(Collectors.toList()); } + + /** + * Get extra metadata for specified key from latest commit/deltacommit instant. + */ + public static Option getExtraMetadataFromLatest(HoodieTableMetaClient metaClient, String extraMetadataKey) { +return metaClient.getCommitsTimeline().filterCompletedInstants().getReverseOrderedInstants().findFirst().map(instant -> +getMetadataValue(metaClient, extraMetadataKey, instant)).orElse(Option.empty()); + } + + /** + * Get extra metadata for specified key from all active commit/deltacommit instants. + */ + public static Map> getExtraMetadataTimeline(HoodieTableMetaClient metaClient, String extraMetadataKey) { Review comment: This is primarily for debugging. This helps with seeing how extra metadata value changed over time. If you have better suggestion for a name, let me know This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-619) Investigate and implement mechanism to have hive/presto/sparksql queries avoid stitching and return null values for hoodie columns
[ https://issues.apache.org/jira/browse/HUDI-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balaji Varadarajan reassigned HUDI-619: --- Assignee: Balaji Varadarajan > Investigate and implement mechanism to have hive/presto/sparksql queries > avoid stitching and return null values for hoodie columns > --- > > Key: HUDI-619 > URL: https://issues.apache.org/jira/browse/HUDI-619 > Project: Apache Hudi > Issue Type: Sub-task > Components: Hive Integration, Presto Integration, Spark Integration >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Fix For: 0.6.1 > > > This idea is suggested by Vinoth during RFC review. This ticket is to track > the feasibility and implementation of it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] n3nash commented on a change in pull request #2044: [HUDI-1228] Add utility method to query extra metadata
n3nash commented on a change in pull request #2044: URL: https://github.com/apache/hudi/pull/2044#discussion_r478612897 ## File path: hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java ## @@ -98,4 +101,31 @@ }).distinct().filter(s -> !s.isEmpty()).collect(Collectors.toList()); } + + /** + * Get extra metadata for specified key from latest commit/deltacommit instant. + */ + public static Option getExtraMetadataFromLatest(HoodieTableMetaClient metaClient, String extraMetadataKey) { +return metaClient.getCommitsTimeline().filterCompletedInstants().getReverseOrderedInstants().findFirst().map(instant -> +getMetadataValue(metaClient, extraMetadataKey, instant)).orElse(Option.empty()); + } + + /** + * Get extra metadata for specified key from all active commit/deltacommit instants. + */ + public static Map> getExtraMetadataTimeline(HoodieTableMetaClient metaClient, String extraMetadataKey) { Review comment: @satishkotha can you describe what this method is used for ? The naming is a little confusing.. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] nsivabalan edited a comment on issue #2029: Records seen with _hoodie_is_deleted set to true on non-existing record
nsivabalan edited a comment on issue #2029: URL: https://github.com/apache/hudi/issues/2029#issuecomment-682107789 can you try 0.6.0. we had a release recently and you should be able to use mvn artifacts. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2042: org.apache.hudi.exception.HoodieIOException: IOException when reading log file
bvaradar commented on issue #2042: URL: https://github.com/apache/hudi/issues/2042#issuecomment-682102033 @n3nash : Can you help take a look at this ? @sam-wmt : Can you please provide the full stack track of the corrupted log file exception ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream
bvaradar commented on issue #2043: URL: https://github.com/apache/hudi/issues/2043#issuecomment-682100271 @zherenyu831 : Can you model your query using pure structured streaming APIs and avoid foreachBatch. It looks like foreachBatch is triggering batch sink and not streaming sink APIs. We will have a blog shortly on the usage but you can reference the PR : https://github.com/apache/hudi/pull/1996/files#diff-cb5b78d0c2deafe117b643f5de250a17R50 Also, please note that we have discovered an issue related to batch writes https://issues.apache.org/jira/browse/HUDI-1230 I have sent an email to dev@ and users@ Mailing list on the config change to workaround. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on a change in pull request #1984: [HUDI-1200] Fix NullPointerException, CustomKeyGenerator does not work
bhasudha commented on a change in pull request #1984: URL: https://github.com/apache/hudi/pull/1984#discussion_r478582473 ## File path: hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java ## @@ -41,7 +41,7 @@ private static final String STRUCT_NAME = "hoodieRowTopLevelField"; private static final String NAMESPACE = "hoodieRow"; - protected transient TypedProperties config; + protected TypedProperties config; Review comment: I think so too. @liujinhui1994 can you try with Hudi 0.6.0 and see if that helps? We fixed some serialization issues there wrt KeyGenerators. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha commented on pull request #1597: [WIP] Added a MultiFormatTimestampBasedKeyGenerator that allows for multipl…
bhasudha commented on pull request #1597: URL: https://github.com/apache/hudi/pull/1597#issuecomment-682087937 closing this PR in favor of https://github.com/apache/hudi/pull/1433 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bhasudha closed pull request #1597: [WIP] Added a MultiFormatTimestampBasedKeyGenerator that allows for multipl…
bhasudha closed pull request #1597: URL: https://github.com/apache/hudi/pull/1597 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2019: Leak in DiskBasedMap
bvaradar commented on issue #2019: URL: https://github.com/apache/hudi/issues/2019#issuecomment-682063177 Closing this issue as we have a jira to track. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #2019: Leak in DiskBasedMap
bvaradar closed issue #2019: URL: https://github.com/apache/hudi/issues/2019 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #2031: [SUPPORT] java.lang.NoSuchMethodError: ExpressionEncoder.fromRow
bvaradar closed issue #2031: URL: https://github.com/apache/hudi/issues/2031 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2031: [SUPPORT] java.lang.NoSuchMethodError: ExpressionEncoder.fromRow
bvaradar commented on issue #2031: URL: https://github.com/apache/hudi/issues/2031#issuecomment-682062820 @vinothsiva1989 : I am assuming this issue is resolved with scala version. Please reopen if this is a different issue. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #2034: [SUPPORT] DateType can't be transformed to right data by kafka avro
bvaradar closed issue #2034: URL: https://github.com/apache/hudi/issues/2034 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #2034: [SUPPORT] DateType can't be transformed to right data by kafka avro
bvaradar commented on issue #2034: URL: https://github.com/apache/hudi/issues/2034#issuecomment-682062239 Thanks. Closing this issue as it is tracked in Jira This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar closed issue #1979: [SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?
bvaradar closed issue #1979: URL: https://github.com/apache/hudi/issues/1979 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1979: [SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?
bvaradar commented on issue #1979: URL: https://github.com/apache/hudi/issues/1979#issuecomment-682061419 Will close the ticket for now. Please reopen if we need to discuss more on this topic. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] prashanthvg89 commented on issue #2029: Records seen with _hoodie_is_deleted set to true on non-existing record
prashanthvg89 commented on issue #2029: URL: https://github.com/apache/hudi/issues/2029#issuecomment-682045840 I am using 0.5.2. Master is 0.6.1 right? What is the latest version this is fixed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.
vinothchandar commented on a change in pull request #1804: URL: https://github.com/apache/hudi/pull/1804#discussion_r478483860 ## File path: hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java ## @@ -0,0 +1,126 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.hudi.io; + +import org.apache.hudi.client.SparkTaskContextSupplier; +import org.apache.hudi.client.WriteStatus; +import org.apache.hudi.common.model.HoodieBaseFile; +import org.apache.hudi.common.model.HoodieRecord; +import org.apache.hudi.common.model.HoodieRecordPayload; +import org.apache.hudi.config.HoodieWriteConfig; +import org.apache.hudi.exception.HoodieUpsertException; +import org.apache.hudi.table.HoodieTable; + +import org.apache.avro.generic.GenericRecord; + +import java.io.IOException; +import java.util.Iterator; +import java.util.Map; +import java.util.PriorityQueue; +import java.util.Queue; + +/** + * Hoodie merge handle which writes records (new inserts or updates) sorted by their key. + * + * The implementation performs a merge-sort by comparing the key of the record being written to the list of + * keys in newRecordKeys (sorted in-memory). + */ +public class HoodieSortedMergeHandle extends HoodieMergeHandle { + + private Queue newRecordKeysSorted = new PriorityQueue<>(); + + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Iterator> recordItr, String partitionPath, String fileId, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, sparkTaskContextSupplier); +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Called by compactor code path. + */ + public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTable hoodieTable, + Map> keyToNewRecordsOrig, String partitionPath, String fileId, + HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier sparkTaskContextSupplier) { +super(config, instantTime, hoodieTable, keyToNewRecordsOrig, partitionPath, fileId, dataFileToBeMerged, +sparkTaskContextSupplier); + +newRecordKeysSorted.addAll(keyToNewRecords.keySet()); + } + + /** + * Go through an old record. Here if we detect a newer version shows up, we write the new one to the file. + */ + @Override + public void write(GenericRecord oldRecord) { +String key = oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString(); + +// To maintain overall sorted order across updates and inserts, write any new inserts whose keys are less than +// the oldRecord's key. +while (!newRecordKeysSorted.isEmpty() && newRecordKeysSorted.peek().compareTo(key) <= 0) { + String keyToPreWrite = newRecordKeysSorted.remove(); Review comment: I am thinking we don't need the map in HoodieMergeHandle or the priorityQueue. The record which have changeed i.e. the input iterator is already sorted. lets call it `inputItr` So , we can just compare the recordBeingWritten with inputItr.next() and write out the smallest one, if equal, we call the payload to merge. This will avoid any kind of memory overhead ## File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java ## @@ -94,6 +100,16 @@ public Builder parquetPageSize(int pageSize) { return this; } +public Builder hfileMaxFileSize(long maxFileSize) { + props.setProperty(HFILE_FILE_MAX_BYTES, String.valueOf(maxFileSize)); Review comment: not following. sorry. are you suggesting having a single config or two? So, we need to have a config per usage of HFile. so we can control the base file size for data, metadata, record index separately. We cannot have a generic base.file.size or hfile.size config here, at this level IMO. cc @prashantwason ## File path: hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java ## @@ -94,6 +100,16 @@ public Builder parquetPageSize(int pageSize) { return this; } +public Builder hfileMaxFileSize(long maxFileSize) { + props.setPr
[GitHub] [hudi] wangxianghu commented on a change in pull request #1946: [HUDI-1176]Support log4j2 config
wangxianghu commented on a change in pull request #1946: URL: https://github.com/apache/hudi/pull/1946#discussion_r47858 ## File path: hudi-utilities/src/test/resources/log4j2-surefire.properties ## @@ -0,0 +1,51 @@ +### +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +### +status = warn +name = PropertiesConfig + +# CONSOLE is set to be a ConsoleAppender. +appender.console.type = Console +appender.console.name = consoleLogger +# CONSOLE uses PatternLayout. +appender.console.layout.type = PatternLayout +appender.console.layout.pattern = %-4r [%t] %-5p %c %x - %m%n Review comment: Hi @hddong , thanks for your contribution. Can you explain why you configured two formats? If there is no special reason, keep them in the same format might be better. The rest LGTM cc@yanghua This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wangxianghu commented on a change in pull request #1900: [HUDI-531]Add java doc for hudi test suite general classes
wangxianghu commented on a change in pull request #1900: URL: https://github.com/apache/hudi/pull/1900#discussion_r478423255 ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java ## @@ -43,22 +44,39 @@ */ public class GenericRecordFullPayloadGenerator implements Serializable { - public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; // 10 KB + /** + * 10 KB. + */ + public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; private static Logger log = LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class); protected final Random random = new Random(); - // The source schema used to generate a payload + /** + * The source schema used to generate a payload. Review comment: > Why we should change these comment styles for fields? my bad, that`s the coding guidelines of Alibaba. rolled back already :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-1225) Avro Date logical type not handled correctly when converting to Spark Row
[ https://issues.apache.org/jira/browse/HUDI-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cdmikechen reassigned HUDI-1225: Assignee: Balaji Varadarajan (was: cdmikechen) Please review codes > Avro Date logical type not handled correctly when converting to Spark Row > - > > Key: HUDI-1225 > URL: https://issues.apache.org/jira/browse/HUDI-1225 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > [https://github.com/apache/hudi/issues/2034] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1225) Avro Date logical type not handled correctly when converting to Spark Row
[ https://issues.apache.org/jira/browse/HUDI-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cdmikechen updated HUDI-1225: - Status: In Progress (was: Open) > Avro Date logical type not handled correctly when converting to Spark Row > - > > Key: HUDI-1225 > URL: https://issues.apache.org/jira/browse/HUDI-1225 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Balaji Varadarajan >Assignee: Balaji Varadarajan >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > [https://github.com/apache/hudi/issues/2034] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (HUDI-1225) Avro Date logical type not handled correctly when converting to Spark Row
[ https://issues.apache.org/jira/browse/HUDI-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1225: - Labels: pull-request-available (was: ) > Avro Date logical type not handled correctly when converting to Spark Row > - > > Key: HUDI-1225 > URL: https://issues.apache.org/jira/browse/HUDI-1225 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Reporter: Balaji Varadarajan >Assignee: cdmikechen >Priority: Major > Labels: pull-request-available > Fix For: 0.6.1 > > > [https://github.com/apache/hudi/issues/2034] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] cdmikechen opened a new pull request #2047: [HUDI-1225] Fix: Avro Date logical type not handled correctly when converting to Spark Row
cdmikechen opened a new pull request #2047: URL: https://github.com/apache/hudi/pull/2047 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request Fix: Avro Date logical type not handled correctly when converting to Spark Row jira: https://issues.apache.org/jira/browse/HUDI-1225 and issue: https://github.com/apache/hudi/issues/2034 ## Brief change log - *Modify `org.apache.hudi.AvroConversionHelper` to cast int to date type correctly* ## Verify this pull request This change added tests and can be verified as follows: - *Added `org.apache.hudi.TestAvroConversionHelper` to verify the change* ## Committer checklist - [x] Has a corresponding JIRA in PR title & commit - [x] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #1900: [HUDI-531]Add java doc for hudi test suite general classes
yanghua commented on a change in pull request #1900: URL: https://github.com/apache/hudi/pull/1900#discussion_r478341019 ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java ## @@ -35,7 +35,13 @@ */ public class DeltaConfig implements Serializable { + /** + * Output destination type. Review comment: IMO, we do not need this comment. ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java ## @@ -43,22 +44,39 @@ */ public class GenericRecordFullPayloadGenerator implements Serializable { - public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; // 10 KB + /** + * 10 KB. + */ + public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; private static Logger log = LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class); protected final Random random = new Random(); - // The source schema used to generate a payload + /** + * The source schema used to generate a payload. Review comment: Why we should change these comment styles for fields? ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java ## @@ -35,7 +35,13 @@ */ public class DeltaConfig implements Serializable { + /** + * Output destination type. + */ private final DeltaOutputMode deltaOutputMode; + /** + * Input data type. Review comment: ditto ## File path: hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java ## @@ -43,22 +44,39 @@ */ public class GenericRecordFullPayloadGenerator implements Serializable { - public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; // 10 KB + /** + * 10 KB. Review comment: we may not change this. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Issue Comment Deleted] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators
[ https://issues.apache.org/jira/browse/HUDI-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liujinhui updated HUDI-1233: Comment: was deleted (was: Please help me to see, I think this function is quite good, can you give advice [~vinoth]) > deltastreamer Kafka consumption delay reporting indicators > -- > > Key: HUDI-1233 > URL: https://issues.apache.org/jira/browse/HUDI-1233 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: liujinhui >Assignee: liujinhui >Priority: Minor > > currently hudi-deltastreamer does not report the indicator of Kafka data > consumption delay, I suggest that this function can be added -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators
[ https://issues.apache.org/jira/browse/HUDI-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185697#comment-17185697 ] liujinhui commented on HUDI-1233: - Please help me to see, I think this function is quite good, can you give advice [~vinoth] > deltastreamer Kafka consumption delay reporting indicators > -- > > Key: HUDI-1233 > URL: https://issues.apache.org/jira/browse/HUDI-1233 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: liujinhui >Assignee: liujinhui >Priority: Minor > > currently hudi-deltastreamer does not report the indicator of Kafka data > consumption delay, I suggest that this function can be added -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators
liujinhui created HUDI-1233: --- Summary: deltastreamer Kafka consumption delay reporting indicators Key: HUDI-1233 URL: https://issues.apache.org/jira/browse/HUDI-1233 Project: Apache Hudi Issue Type: Improvement Components: DeltaStreamer Reporter: liujinhui currently hudi-deltastreamer does not report the indicator of Kafka data consumption delay, I suggest that this function can be added -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators
[ https://issues.apache.org/jira/browse/HUDI-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liujinhui reassigned HUDI-1233: --- Assignee: liujinhui > deltastreamer Kafka consumption delay reporting indicators > -- > > Key: HUDI-1233 > URL: https://issues.apache.org/jira/browse/HUDI-1233 > Project: Apache Hudi > Issue Type: Improvement > Components: DeltaStreamer >Reporter: liujinhui >Assignee: liujinhui >Priority: Minor > > currently hudi-deltastreamer does not report the indicator of Kafka data > consumption delay, I suggest that this function can be added -- This message was sent by Atlassian Jira (v8.3.4#803005)