[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-27 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-682314989


   The file that isn't found is 
`'s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4957-299294_20200827155539.parquet'`.
   
   The available files in s3 that start with 
"9dee1248-c972-4ed3-80f5-15545ac4c534-0_2" are: 
   ```
   2020-08-27 10:26 33525767 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3850-231917_20200827102526.parquet
   2020-08-27 10:33 33526574 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3891-234401_20200827103318.parquet
   2020-08-27 16:17 33545224 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet
   2020-08-27 11:13 33530132 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4096-246791_20200827111254.parquet
   2020-08-27 11:22 33530880 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4137-249295_20200827112139.parquet
   2020-08-27 12:00 3353 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4301-259277_20200827115949.parquet
   2020-08-27 12:20 33534377 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4383-264271_20200827121947.parquet
   2020-08-27 12:42 33535631 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4465-269277_20200827124204.parquet
   2020-08-27 12:54 33536084 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4506-271786_20200827125338.parquet
   2020-08-27 13:07 33536635 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4547-274289_20200827130640.parquet
   2020-08-27 13:20 33537444 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4588-276783_20200827131919.parquet
   2020-08-27 13:32 33538151 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4629-279284_20200827133143.parquet
   2020-08-27 13:46 33539531 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4670-281782_20200827134536.parquet
   2020-08-27 14:14 33541130 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4752-286756_20200827141258.parquet
   2020-08-27 14:30 33541913 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4793-289269_20200827142922.parquet
   2020-08-27 14:49 33542820 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4834-291776_20200827144807.parquet
   2020-08-27 15:08 33543459 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4875-294286_20200827150653.parquet
   2020-08-27 15:30 33544369 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet
   ```
   
   Contents of s3://my-bucket/my-table/.hoodie/20200827155539.commit
   
   ```
"9dee1248-c972-4ed3-80f5-15545ac4c534-0" : 
"daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet",
   ```
   
   Contents of 
s3://my-bucket/my-table/.hoodie/20200827155539.compaction.requested
   
   ```
   [20200827152840, 
[.9dee1248-c972-4ed3-80f5-15545ac4c534-0_20200827152840.log.1_32-4949-299212], 
9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet, 
9dee1248-c972-4ed3-80f5-15545ac4c534-0, daas_date=2020, [TOTAL_LOG_FILES -> 
1.0, TOTAL_IO_READ_MB -> 32.0, TOTAL_LOG_FILES_SIZE -> 121966.0, 
TOTAL_IO_WRITE_MB -> 31.0, TOTAL_IO_MB -> 63.0, TOTAL_LOG_FILE_SIZE -> 
121966.0]],
   ```
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-27 Thread GitBox


dm-tran commented on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-682314989


   The file that isn't found is 
`'s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4957-299294_20200827155539.parquet'`.
   
   The available files in s3 that start with 
"9dee1248-c972-4ed3-80f5-15545ac4c534-0_2" are: 
   ```
   2020-08-27 10:26 33525767 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3850-231917_20200827102526.parquet
   2020-08-27 10:33 33526574 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-3891-234401_20200827103318.parquet
   2020-08-27 16:17 33545224 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-39-2458_20200827155539.parquet
   2020-08-27 11:13 33530132 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4096-246791_20200827111254.parquet
   2020-08-27 11:22 33530880 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4137-249295_20200827112139.parquet
   2020-08-27 12:00 3353 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4301-259277_20200827115949.parquet
   2020-08-27 12:20 33534377 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4383-264271_20200827121947.parquet
   2020-08-27 12:42 33535631 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4465-269277_20200827124204.parquet
   2020-08-27 12:54 33536084 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4506-271786_20200827125338.parquet
   2020-08-27 13:07 33536635 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4547-274289_20200827130640.parquet
   2020-08-27 13:20 33537444 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4588-276783_20200827131919.parquet
   2020-08-27 13:32 33538151 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4629-279284_20200827133143.parquet
   2020-08-27 13:46 33539531 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4670-281782_20200827134536.parquet
   2020-08-27 14:14 33541130 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4752-286756_20200827141258.parquet
   2020-08-27 14:30 33541913 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4793-289269_20200827142922.parquet
   2020-08-27 14:49 33542820 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4834-291776_20200827144807.parquet
   2020-08-27 15:08 33543459 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4875-294286_20200827150653.parquet
   2020-08-27 15:30 33544369 
s3://my-bucket/my-table/daas_date=2020/9dee1248-c972-4ed3-80f5-15545ac4c534-0_2-4916-296786_20200827152840.parquet
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran edited a comment on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-27 Thread GitBox


dm-tran edited a comment on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-682311268


   @bvaradar The exception was raised, after running the structured streaming 
job for a while.
   
   Please find attached the driver logs with INFO level logging.
   
   [stderr_01.log](https://github.com/apache/hudi/files/5139921/stderr_01.log) 
: the structured streaming job fails with error 
`org.apache.hudi.exception.HoodieIOException: Consistency check failed to 
ensure all files APPEAR`
   [stderr_02.log](https://github.com/apache/hudi/files/5139922/stderr_02.log) 
: the structured streaming job is retried by YARN and compaction fails with a 
`java.io.FileNotFoundException`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] dm-tran commented on issue #2020: [SUPPORT] Compaction fails with "java.io.FileNotFoundException"

2020-08-27 Thread GitBox


dm-tran commented on issue #2020:
URL: https://github.com/apache/hudi/issues/2020#issuecomment-682311268


   @bvaradar The exception was raised, after running the structured streaming 
for a while.
   
   Please find attached the driver logs with INFO level logging.
   
   [stderr_01.log](https://github.com/apache/hudi/files/5139921/stderr_01.log) 
: the structured streaming job fails with error 
`org.apache.hudi.exception.HoodieIOException: Consistency check failed to 
ensure all files APPEAR`
   [stderr_02.log](https://github.com/apache/hudi/files/5139922/stderr_02.log) 
: the structured streaming job is retried by YARN and compaction fails with a 
`java.io.FileNotFoundException`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Closed] (HUDI-1222) Introduce MergeHelper.UpdateHandler as independent class

2020-08-27 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu closed HUDI-1222.
-
Resolution: Invalid

> Introduce  MergeHelper.UpdateHandler as independent class 
> --
>
> Key: HUDI-1222
> URL: https://issues.apache.org/jira/browse/HUDI-1222
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Making UpdateHandler class independent helps reduce the workload of 
> refactoring hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1222) Introduce MergeHelper.UpdateHandler as independent class

2020-08-27 Thread wangxianghu (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangxianghu updated HUDI-1222:
--
Status: Open  (was: New)

> Introduce  MergeHelper.UpdateHandler as independent class 
> --
>
> Key: HUDI-1222
> URL: https://issues.apache.org/jira/browse/HUDI-1222
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: wangxianghu
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Making UpdateHandler class independent helps reduce the workload of 
> refactoring hudi-client



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] zherenyu831 closed issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

2020-08-27 Thread GitBox


zherenyu831 closed issue #2043:
URL: https://github.com/apache/hudi/issues/2043


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] zherenyu831 commented on issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

2020-08-27 Thread GitBox


zherenyu831 commented on issue #2043:
URL: https://github.com/apache/hudi/issues/2043#issuecomment-682305478


   @bvaradar 
   Thank you for reply, I also saw your blog pr before, and it work with pure 
structured streaming api
   Marked, will try to avoid this issue when batch writing



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu closed pull request #2033: [HUDI-1222] Introduce MergeHelper.UpdateHandler as independent class

2020-08-27 Thread GitBox


wangxianghu closed pull request #2033:
URL: https://github.com/apache/hudi/pull/2033


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on pull request #2033: [HUDI-1222] Introduce MergeHelper.UpdateHandler as independent class

2020-08-27 Thread GitBox


wangxianghu commented on pull request #2033:
URL: https://github.com/apache/hudi/pull/2033#issuecomment-682303627


   let`s keep it in HUDI-1089
   closing now



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on issue #1751: [SUPPORT] Hudi not working with Spark 3.0.0

2020-08-27 Thread GitBox


nsivabalan commented on issue #1751:
URL: https://github.com/apache/hudi/issues/1751#issuecomment-682301333


   @bschell is driving this. Ref PR: https://github.com/apache/hudi/pull/1760. 
@bschell : any rough timelines ? 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan commented on a change in pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

2020-08-27 Thread GitBox


nsivabalan commented on a change in pull request #1469:
URL: https://github.com/apache/hudi/pull/1469#discussion_r478805908



##
File path: 
hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java
##
@@ -0,0 +1,354 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.index.bloom;
+
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.client.utils.LazyIterableIterator;
+import org.apache.hudi.common.bloom.BloomFilter;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieKey;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordLocation;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.DefaultSizeEstimator;
+import org.apache.hudi.common.util.HoodieTimer;
+import org.apache.hudi.common.util.Option;
+import org.apache.hudi.common.util.collection.ExternalSpillableMap;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieIOException;
+import org.apache.hudi.index.HoodieIndex;
+import org.apache.hudi.io.HoodieBloomRangeInfoHandle;
+import org.apache.hudi.io.HoodieKeyLookupHandle;
+import org.apache.hudi.table.HoodieTable;
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import scala.Tuple2;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Objects;
+import java.util.Set;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+
+/**
+ * Simplified re-implementation of {@link HoodieBloomIndex} that does not rely 
on caching, or
+ * incurs the overhead of auto-tuning parallelism.
+ */
+public class HoodieBloomIndexV2 extends 
HoodieIndex {
+
+  private static final Logger LOG = 
LogManager.getLogger(HoodieBloomIndexV2.class);
+
+  public HoodieBloomIndexV2(HoodieWriteConfig config) {
+super(config);
+  }
+
+  @Override
+  public JavaRDD> tagLocation(JavaRDD> 
recordRDD,
+  JavaSparkContext jsc,
+  HoodieTable hoodieTable) {
+return recordRDD
+.sortBy((record) -> String.format("%s-%s", 
record.getPartitionPath(), record.getRecordKey()),
+true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyRangeAndBloomChecker(itr, 
hoodieTable)).flatMap(List::iterator)
+.sortBy(Pair::getRight, true, config.getBloomIndexV2Parallelism())
+.mapPartitions((itr) -> new LazyKeyChecker(itr, hoodieTable))
+.filter(Option::isPresent)

Review comment:
   @vinothchandar : guess there could a bug here. If for a record, few 
files were matched from range and bloom lookup, but in LazyKeyChecker none of 
the files had the record, current code may not have this record in the final 
JavaRDD> returned. But we have to return this record w/ empty 
cur location. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-531) Add java doc for hudi test suite general classes

2020-08-27 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang updated HUDI-531:
--
Fix Version/s: 0.5.1

> Add java doc for hudi test suite general classes
> 
>
> Key: HUDI-531
> URL: https://issues.apache.org/jira/browse/HUDI-531
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> Currently, the general classes (under src/main dir) has no java docs. We 
> should add doc for those classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (HUDI-531) Add java doc for hudi test suite general classes

2020-08-27 Thread vinoyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vinoyang closed HUDI-531.
-
Resolution: Done

Done via master branch: fa812482473f0cc8c2f34e2db07366cc3e5f7066

> Add java doc for hudi test suite general classes
> 
>
> Key: HUDI-531
> URL: https://issues.apache.org/jira/browse/HUDI-531
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Testing
>Reporter: vinoyang
>Assignee: wangxianghu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.1
>
>
> Currently, the general classes (under src/main dir) has no java docs. We 
> should add doc for those classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[hudi] branch master updated: [HUDI-531] Add java doc for hudi test suite general classes (#1900)

2020-08-27 Thread vinoyang
This is an automated email from the ASF dual-hosted git repository.

vinoyang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new fa81248  [HUDI-531] Add java doc for hudi test suite general classes 
(#1900)
fa81248 is described below

commit fa812482473f0cc8c2f34e2db07366cc3e5f7066
Author: Mathieu 
AuthorDate: Fri Aug 28 08:44:40 2020 +0800

[HUDI-531] Add java doc for hudi test suite general classes (#1900)
---
 .../org/apache/hudi/client/HoodieWriteClient.java  |  2 +-
 hudi-integ-test/README.md  |  6 +--
 .../hudi/integ/testsuite/converter/Converter.java  |  6 +++
 .../integ/testsuite/dag/nodes/BulkInsertNode.java  |  3 ++
 .../hudi/integ/testsuite/dag/nodes/CleanNode.java  |  4 ++
 .../integ/testsuite/dag/nodes/CompactNode.java | 10 
 .../hudi/integ/testsuite/dag/nodes/DagNode.java|  8 ++-
 .../integ/testsuite/dag/nodes/HiveQueryNode.java   |  3 ++
 .../integ/testsuite/dag/nodes/HiveSyncNode.java|  3 ++
 .../hudi/integ/testsuite/dag/nodes/InsertNode.java |  3 ++
 .../integ/testsuite/dag/nodes/RollbackNode.java|  9 
 .../testsuite/dag/nodes/ScheduleCompactNode.java   |  3 ++
 .../testsuite/dag/nodes/SparkSQLQueryNode.java |  9 
 .../hudi/integ/testsuite/dag/nodes/UpsertNode.java |  3 ++
 .../integ/testsuite/dag/nodes/ValidateNode.java|  9 
 .../testsuite/dag/scheduler/DagScheduler.java  | 21 
 .../integ/testsuite/generator/DeltaGenerator.java  |  2 +-
 .../GenericRecordFullPayloadGenerator.java | 58 ++
 .../GenericRecordFullPayloadSizeEstimator.java | 12 +
 .../generator/UpdateGeneratorIterator.java |  3 ++
 .../integ/testsuite/writer/DeltaWriterAdapter.java |  3 ++
 21 files changed, 174 insertions(+), 6 deletions(-)

diff --git 
a/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java 
b/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
index 9f6df7b..142ff33 100644
--- a/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
+++ b/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
@@ -557,7 +557,7 @@ public class HoodieWriteClient extends AbstractHo
   metrics.updateCleanMetrics(durationMs, metadata.getTotalFilesDeleted());
   LOG.info("Cleaned " + metadata.getTotalFilesDeleted() + " files"
   + " Earliest Retained Instant :" + 
metadata.getEarliestCommitToRetain()
-  + " cleanerElaspsedMs" + durationMs);
+  + " cleanerElapsedMs" + durationMs);
 }
 return metadata;
   }
diff --git a/hudi-integ-test/README.md b/hudi-integ-test/README.md
index d87fec3..a497ad9 100644
--- a/hudi-integ-test/README.md
+++ b/hudi-integ-test/README.md
@@ -41,7 +41,7 @@ Depending on the type of workload generated, data is either 
ingested into the ta
 dataset or the corresponding workload operation is executed. For example 
compaction does not necessarily need a workload
 to be generated/ingested but can require an execution.
 
-## Other actions/operatons
+## Other actions/operations
 
 The test suite supports different types of operations besides ingestion such 
as Hive Query execution, Clean action etc.
 
@@ -66,9 +66,9 @@ link#HudiDeltaStreamer page to learn about all the available 
configs applicable
 
 There are 2 ways to generate a workload pattern
 
- 1.Programatically
+ 1.Programmatically
 
-Choose to write up the entire DAG of operations programatically, take a look 
at `WorkflowDagGenerator` class.
+Choose to write up the entire DAG of operations programmatically, take a look 
at `WorkflowDagGenerator` class.
 Once you're ready with the DAG you want to execute, simply pass the class name 
as follows:
 
 ```
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java
index e4ad0a7..89f3b88 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/converter/Converter.java
@@ -29,5 +29,11 @@ import org.apache.spark.api.java.JavaRDD;
  */
 public interface Converter extends Serializable {
 
+  /**
+   * Convert data from one format to another.
+   *
+   * @param inputRDD Input data
+   * @return Data in target format
+   */
   JavaRDD convert(JavaRDD inputRDD);
 }
\ No newline at end of file
diff --git 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java
 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java
index 7a8f405..bdf57f8 100644
--- 
a/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java
+++ 
b/hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/dag/nodes/BulkInsertNode.java
@@ -24,6 

[GitHub] [hudi] yanghua merged pull request #1900: [HUDI-531] Add java doc for hudi test suite general classes

2020-08-27 Thread GitBox


yanghua merged pull request #1900:
URL: https://github.com/apache/hudi/pull/1900


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1104) Bulk insert Dataset - UserDefinedPartitioner

2020-08-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1104:
-
Labels: pull-request-available  (was: )

> Bulk insert Dataset - UserDefinedPartitioner
> 
>
> Key: HUDI-1104
> URL: https://issues.apache.org/jira/browse/HUDI-1104
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Writer Core
>Reporter: sivabalan narayanan
>Assignee: sivabalan narayanan
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] nsivabalan opened a new pull request #2049: [HUDI-1104] [WIP] Bulk insert dedup

2020-08-27 Thread GitBox


nsivabalan opened a new pull request #2049:
URL: https://github.com/apache/hudi/pull/2049


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
 - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test 
coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please 
describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
 - *Added integration tests for end-to-end.*
 - *Added HoodieClientWriteTest to verify the change.*
 - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-08-27 Thread GitBox


satishkotha commented on pull request #2048:
URL: https://github.com/apache/hudi/pull/2048#issuecomment-682247412


   @vinothchandar @bvaradar FYI. There are few things that I'm not fully happy 
with. But would like to get initial feedback and get agreement on high level 
approach.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha opened a new pull request #2048: [HUDI-1072][WIP] Introduce REPLACE top level action

2020-08-27 Thread GitBox


satishkotha opened a new pull request #2048:
URL: https://github.com/apache/hudi/pull/2048


   
   ## What is the purpose of the pull request
   I am following up on feedback apache#1859, we want to make replace a top 
level action. This is WIP, but publishing to give a high level idea on changes 
required.
   
   ## Brief change log
   
   There are multiple challenges in making replace top level action:
   
   1. In the write path, we create .commit in 2places - 
BaseActionCommitExecutor or HoodieSparkSqlWriter. We need to change both places 
create  .replace file. 
   2. All post commit actions work on top of HoodieCommitMetadata. For now 
replace is also using same class to keep the change simple. We can split 
HoodieCommitMetadata into class hierarchy (not sure how json serialization 
works with inheritance) or discuss other alternatives
   3. There are many assumptions in the code to say commit action type is tied 
to table type. For example, action type can only be either '.commit' or 
'.deltacommit'. 
[here](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java#L474).
 Changing this to add replace seemed error prone and tedious. we may have to 
add better abstractions here.
   4. The way we identify if a parquet file is valid is by checking if theres a 
corresponding '.commit' file. If we just create .replace file, we have to 
change lot of places to make sure new files created by replace are used.
   5. Everywhere, we try to invoke 'getCommitsTimeline'/'filterCommits', we 
need to review and make sure caller can handle replace actions. OR create new 
methods and refactor all invocations of getCommitsTimeline call new methods.
   
   I made most of the changes for #1,2,3 above. Need to discuss if this is the 
right approach and extend it to 4,5.
   
   ## Verify this pull request
   This change added tests. Verified basic actions using quick start and docker 
setup. Adding more tests in progress. But wanted to get feedback on high level 
approach.
   
   This is an example .hoodie folder from quick start setup:
   -rw-r--r--  1 satishkotha  wheel  1933 Aug 27 16:39 20200827163904.commit
   -rw-r--r--  1 satishkotha  wheel 0 Aug 27 16:39 
20200827163904.commit.requested
   -rw-r--r--  1 satishkotha  wheel  1015 Aug 27 16:39 20200827163904.inflight
   -rw-r--r--  1 satishkotha  wheel  2610 Aug 27 16:39 20200827163927.replace
   -rw-r--r--  1 satishkotha  wheel  1024 Aug 27 16:39 
20200827163927.replace.inflight
   -rw-r--r--  1 satishkotha  wheel 0 Aug 27 16:39 
20200827163927.replace.requested
   drwxr-xr-x  2 satishkotha  wheel64 Aug 27 16:39 archived
   -rw-r--r--  1 satishkotha  wheel   235 Aug 27 16:39 hoodie.properties
   
   
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason edited a comment on pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-27 Thread GitBox


prashantwason edited a comment on pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#issuecomment-682213033


   > @prashantwason if you broadly agree, I will make the change and land this, 
so you can focus on rfc-15 more :)
   Sure @vinothchandar . Thanks for all the help. Lets get this rolling soon.
   
   I will look into the comments too but you can do the needful.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-27 Thread GitBox


prashantwason commented on pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#issuecomment-682213033


   > @prashantwason if you broadly agree, I will make the change and land this, 
so you can focus on rfc-15 more :)
   Sure @vinothchandar . Thanks for all the help. Lets get this rolling soon.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashantwason commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-27 Thread GitBox


prashantwason commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r478655201



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+  Map> keyToNewRecordsOrig, String partitionPath, 
String fileId,
+  HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier 
sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, keyToNewRecordsOrig, 
partitionPath, fileId, dataFileToBeMerged,
+sparkTaskContextSupplier);
+
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
+   */
+  @Override
+  public void write(GenericRecord oldRecord) {
+String key = 
oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+
+// To maintain overall sorted order across updates and inserts, write any 
new inserts whose keys are less than
+// the oldRecord's key.
+while (!newRecordKeysSorted.isEmpty() && 
newRecordKeysSorted.peek().compareTo(key) <= 0) {
+  String keyToPreWrite = newRecordKeysSorted.remove();

Review comment:
   If the inputItr is sorted then yes all this overhead can be removed. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] satishkotha commented on a change in pull request #2044: [HUDI-1228] Add utility method to query extra metadata

2020-08-27 Thread GitBox


satishkotha commented on a change in pull request #2044:
URL: https://github.com/apache/hudi/pull/2044#discussion_r478620546



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -98,4 +101,31 @@
 
 }).distinct().filter(s -> !s.isEmpty()).collect(Collectors.toList());
   }
+
+  /**
+   * Get extra metadata for specified key from latest commit/deltacommit 
instant.
+   */
+  public static Option 
getExtraMetadataFromLatest(HoodieTableMetaClient metaClient, String 
extraMetadataKey) {
+return 
metaClient.getCommitsTimeline().filterCompletedInstants().getReverseOrderedInstants().findFirst().map(instant
 ->
+getMetadataValue(metaClient, extraMetadataKey, 
instant)).orElse(Option.empty());
+  }
+
+  /**
+   * Get extra metadata for specified key from all active commit/deltacommit 
instants.
+   */
+  public static Map> 
getExtraMetadataTimeline(HoodieTableMetaClient metaClient, String 
extraMetadataKey) {

Review comment:
   This is primarily for debugging. This helps with seeing how extra 
metadata value changed over time.  If you have better suggestion for a name, 
let me know





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-619) Investigate and implement mechanism to have hive/presto/sparksql queries avoid stitching and return null values for hoodie columns

2020-08-27 Thread Balaji Varadarajan (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balaji Varadarajan reassigned HUDI-619:
---

Assignee: Balaji Varadarajan

> Investigate and implement mechanism to have hive/presto/sparksql queries 
> avoid stitching and return null values for hoodie columns 
> ---
>
> Key: HUDI-619
> URL: https://issues.apache.org/jira/browse/HUDI-619
> Project: Apache Hudi
>  Issue Type: Sub-task
>  Components: Hive Integration, Presto Integration, Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.1
>
>
> This idea is suggested by Vinoth during RFC review. This ticket is to track 
> the feasibility and implementation of it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] n3nash commented on a change in pull request #2044: [HUDI-1228] Add utility method to query extra metadata

2020-08-27 Thread GitBox


n3nash commented on a change in pull request #2044:
URL: https://github.com/apache/hudi/pull/2044#discussion_r478612897



##
File path: 
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/TimelineUtils.java
##
@@ -98,4 +101,31 @@
 
 }).distinct().filter(s -> !s.isEmpty()).collect(Collectors.toList());
   }
+
+  /**
+   * Get extra metadata for specified key from latest commit/deltacommit 
instant.
+   */
+  public static Option 
getExtraMetadataFromLatest(HoodieTableMetaClient metaClient, String 
extraMetadataKey) {
+return 
metaClient.getCommitsTimeline().filterCompletedInstants().getReverseOrderedInstants().findFirst().map(instant
 ->
+getMetadataValue(metaClient, extraMetadataKey, 
instant)).orElse(Option.empty());
+  }
+
+  /**
+   * Get extra metadata for specified key from all active commit/deltacommit 
instants.
+   */
+  public static Map> 
getExtraMetadataTimeline(HoodieTableMetaClient metaClient, String 
extraMetadataKey) {

Review comment:
   @satishkotha can you describe what this method is used for ? The naming 
is a little confusing..





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] nsivabalan edited a comment on issue #2029: Records seen with _hoodie_is_deleted set to true on non-existing record

2020-08-27 Thread GitBox


nsivabalan edited a comment on issue #2029:
URL: https://github.com/apache/hudi/issues/2029#issuecomment-682107789


   can you try 0.6.0. we had a release recently and you should be able to use 
mvn artifacts.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2042: org.apache.hudi.exception.HoodieIOException: IOException when reading log file

2020-08-27 Thread GitBox


bvaradar commented on issue #2042:
URL: https://github.com/apache/hudi/issues/2042#issuecomment-682102033


   @n3nash : Can you help take a look at this ?
   
   @sam-wmt : Can you please provide the full stack track of the corrupted log 
file exception ?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2043: [SUPPORT] hudi 0.6.0 async compaction not working with foreachBatch of spark structured stream

2020-08-27 Thread GitBox


bvaradar commented on issue #2043:
URL: https://github.com/apache/hudi/issues/2043#issuecomment-682100271


   @zherenyu831 : Can you model your query using pure structured streaming APIs 
and avoid foreachBatch. It looks like foreachBatch is triggering batch sink and 
not streaming sink APIs. We will have a blog shortly on the usage but you can 
reference the PR : 
https://github.com/apache/hudi/pull/1996/files#diff-cb5b78d0c2deafe117b643f5de250a17R50
   
   Also, please note that we have discovered an issue related to batch writes 
https://issues.apache.org/jira/browse/HUDI-1230
   I have sent an email to dev@ and users@ Mailing list on the config change to 
workaround. 
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on a change in pull request #1984: [HUDI-1200] Fix NullPointerException, CustomKeyGenerator does not work

2020-08-27 Thread GitBox


bhasudha commented on a change in pull request #1984:
URL: https://github.com/apache/hudi/pull/1984#discussion_r478582473



##
File path: hudi-spark/src/main/java/org/apache/hudi/keygen/KeyGenerator.java
##
@@ -41,7 +41,7 @@
   private static final String STRUCT_NAME = "hoodieRowTopLevelField";
   private static final String NAMESPACE = "hoodieRow";
 
-  protected transient TypedProperties config;
+  protected  TypedProperties config;

Review comment:
   I think so too. @liujinhui1994  can you try with Hudi 0.6.0 and see if 
that helps? We fixed some serialization issues there wrt KeyGenerators.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha commented on pull request #1597: [WIP] Added a MultiFormatTimestampBasedKeyGenerator that allows for multipl…

2020-08-27 Thread GitBox


bhasudha commented on pull request #1597:
URL: https://github.com/apache/hudi/pull/1597#issuecomment-682087937


   closing this PR in favor of https://github.com/apache/hudi/pull/1433



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bhasudha closed pull request #1597: [WIP] Added a MultiFormatTimestampBasedKeyGenerator that allows for multipl…

2020-08-27 Thread GitBox


bhasudha closed pull request #1597:
URL: https://github.com/apache/hudi/pull/1597


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2019: Leak in DiskBasedMap

2020-08-27 Thread GitBox


bvaradar commented on issue #2019:
URL: https://github.com/apache/hudi/issues/2019#issuecomment-682063177


   Closing this issue as we have a jira to track.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2019: Leak in DiskBasedMap

2020-08-27 Thread GitBox


bvaradar closed issue #2019:
URL: https://github.com/apache/hudi/issues/2019


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2031: [SUPPORT] java.lang.NoSuchMethodError: ExpressionEncoder.fromRow

2020-08-27 Thread GitBox


bvaradar closed issue #2031:
URL: https://github.com/apache/hudi/issues/2031


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2031: [SUPPORT] java.lang.NoSuchMethodError: ExpressionEncoder.fromRow

2020-08-27 Thread GitBox


bvaradar commented on issue #2031:
URL: https://github.com/apache/hudi/issues/2031#issuecomment-682062820


   @vinothsiva1989  : I am assuming this issue is resolved with scala version. 
Please reopen if this is a different issue. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #2034: [SUPPORT] DateType can't be transformed to right data by kafka avro

2020-08-27 Thread GitBox


bvaradar closed issue #2034:
URL: https://github.com/apache/hudi/issues/2034


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #2034: [SUPPORT] DateType can't be transformed to right data by kafka avro

2020-08-27 Thread GitBox


bvaradar commented on issue #2034:
URL: https://github.com/apache/hudi/issues/2034#issuecomment-682062239


   Thanks. Closing this issue as it is tracked in Jira



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar closed issue #1979: [SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

2020-08-27 Thread GitBox


bvaradar closed issue #1979:
URL: https://github.com/apache/hudi/issues/1979


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1979: [SUPPORT]: Is it possible to incrementally read only upserted rows where a material change has occurred?

2020-08-27 Thread GitBox


bvaradar commented on issue #1979:
URL: https://github.com/apache/hudi/issues/1979#issuecomment-682061419


   Will close the ticket for now. Please reopen if we need to discuss more on 
this topic.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] prashanthvg89 commented on issue #2029: Records seen with _hoodie_is_deleted set to true on non-existing record

2020-08-27 Thread GitBox


prashanthvg89 commented on issue #2029:
URL: https://github.com/apache/hudi/issues/2029#issuecomment-682045840


   I am using 0.5.2. Master is 0.6.1 right? What is the latest version this is 
fixed?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #1804: [HUDI-960] Implementation of the HFile base and log file format.

2020-08-27 Thread GitBox


vinothchandar commented on a change in pull request #1804:
URL: https://github.com/apache/hudi/pull/1804#discussion_r478483860



##
File path: 
hudi-client/src/main/java/org/apache/hudi/io/HoodieSortedMergeHandle.java
##
@@ -0,0 +1,126 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.io;
+
+import org.apache.hudi.client.SparkTaskContextSupplier;
+import org.apache.hudi.client.WriteStatus;
+import org.apache.hudi.common.model.HoodieBaseFile;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.model.HoodieRecordPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieUpsertException;
+import org.apache.hudi.table.HoodieTable;
+
+import org.apache.avro.generic.GenericRecord;
+
+import java.io.IOException;
+import java.util.Iterator;
+import java.util.Map;
+import java.util.PriorityQueue;
+import java.util.Queue;
+
+/**
+ * Hoodie merge handle which writes records (new inserts or updates) sorted by 
their key.
+ *
+ * The implementation performs a merge-sort by comparing the key of the record 
being written to the list of
+ * keys in newRecordKeys (sorted in-memory).
+ */
+public class HoodieSortedMergeHandle extends 
HoodieMergeHandle {
+
+  private Queue newRecordKeysSorted = new PriorityQueue<>();
+
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+   Iterator> recordItr, String partitionPath, String 
fileId, SparkTaskContextSupplier sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, recordItr, partitionPath, fileId, 
sparkTaskContextSupplier);
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Called by compactor code path.
+   */
+  public HoodieSortedMergeHandle(HoodieWriteConfig config, String instantTime, 
HoodieTable hoodieTable,
+  Map> keyToNewRecordsOrig, String partitionPath, 
String fileId,
+  HoodieBaseFile dataFileToBeMerged, SparkTaskContextSupplier 
sparkTaskContextSupplier) {
+super(config, instantTime, hoodieTable, keyToNewRecordsOrig, 
partitionPath, fileId, dataFileToBeMerged,
+sparkTaskContextSupplier);
+
+newRecordKeysSorted.addAll(keyToNewRecords.keySet());
+  }
+
+  /**
+   * Go through an old record. Here if we detect a newer version shows up, we 
write the new one to the file.
+   */
+  @Override
+  public void write(GenericRecord oldRecord) {
+String key = 
oldRecord.get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+
+// To maintain overall sorted order across updates and inserts, write any 
new inserts whose keys are less than
+// the oldRecord's key.
+while (!newRecordKeysSorted.isEmpty() && 
newRecordKeysSorted.peek().compareTo(key) <= 0) {
+  String keyToPreWrite = newRecordKeysSorted.remove();

Review comment:
   I am thinking we don't need the map in HoodieMergeHandle or the 
priorityQueue. The record which have changeed i.e. the input iterator is 
already sorted. lets call it `inputItr` 
   
   So , we can just compare the recordBeingWritten with inputItr.next() and 
write out the smallest one, if equal, we call the payload to merge. 
   
   This will avoid any kind of memory overhead 

##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -94,6 +100,16 @@ public Builder parquetPageSize(int pageSize) {
   return this;
 }
 
+public Builder hfileMaxFileSize(long maxFileSize) {
+  props.setProperty(HFILE_FILE_MAX_BYTES, String.valueOf(maxFileSize));

Review comment:
   not following. sorry. are you suggesting having a single config or two? 
   So, we need to have a config per usage of HFile. so we can control the base 
file size for data, metadata, record index separately. 
   
   We cannot have a generic base.file.size or hfile.size config here, at this 
level IMO. cc @prashantwason  

##
File path: 
hudi-client/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##
@@ -94,6 +100,16 @@ public Builder parquetPageSize(int pageSize) {
   return this;
 }
 
+public Builder hfileMaxFileSize(long maxFileSize) {
+  

[GitHub] [hudi] wangxianghu commented on a change in pull request #1946: [HUDI-1176]Support log4j2 config

2020-08-27 Thread GitBox


wangxianghu commented on a change in pull request #1946:
URL: https://github.com/apache/hudi/pull/1946#discussion_r47858



##
File path: hudi-utilities/src/test/resources/log4j2-surefire.properties
##
@@ -0,0 +1,51 @@
+###
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#  http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+###
+status = warn
+name = PropertiesConfig
+
+# CONSOLE is set to be a ConsoleAppender.
+appender.console.type = Console
+appender.console.name = consoleLogger
+# CONSOLE uses PatternLayout.
+appender.console.layout.type = PatternLayout
+appender.console.layout.pattern = %-4r [%t] %-5p %c %x - %m%n

Review comment:
   Hi @hddong , thanks for your contribution.
   Can you explain why you configured two formats?  If there is no special 
reason, keep them in the same format might be better.
   The rest  LGTM cc@yanghua 
   





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wangxianghu commented on a change in pull request #1900: [HUDI-531]Add java doc for hudi test suite general classes

2020-08-27 Thread GitBox


wangxianghu commented on a change in pull request #1900:
URL: https://github.com/apache/hudi/pull/1900#discussion_r478423255



##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java
##
@@ -43,22 +44,39 @@
  */
 public class GenericRecordFullPayloadGenerator implements Serializable {
 
-  public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; // 10 KB
+  /**
+   * 10 KB.
+   */
+  public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10;
   private static Logger log = 
LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class);
   protected final Random random = new Random();
-  // The source schema used to generate a payload
+  /**
+   * The source schema used to generate a payload.

Review comment:
   > Why we should change these comment styles for fields?
   
   my bad,  that`s the coding guidelines of Alibaba. 
   rolled back already :)





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1225) Avro Date logical type not handled correctly when converting to Spark Row

2020-08-27 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen reassigned HUDI-1225:


Assignee: Balaji Varadarajan  (was: cdmikechen)

Please review codes

> Avro Date logical type not handled correctly when converting to Spark Row
> -
>
> Key: HUDI-1225
> URL: https://issues.apache.org/jira/browse/HUDI-1225
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> [https://github.com/apache/hudi/issues/2034]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1225) Avro Date logical type not handled correctly when converting to Spark Row

2020-08-27 Thread cdmikechen (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cdmikechen updated HUDI-1225:
-
Status: In Progress  (was: Open)

> Avro Date logical type not handled correctly when converting to Spark Row
> -
>
> Key: HUDI-1225
> URL: https://issues.apache.org/jira/browse/HUDI-1225
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> [https://github.com/apache/hudi/issues/2034]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (HUDI-1225) Avro Date logical type not handled correctly when converting to Spark Row

2020-08-27 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1225:
-
Labels: pull-request-available  (was: )

> Avro Date logical type not handled correctly when converting to Spark Row
> -
>
> Key: HUDI-1225
> URL: https://issues.apache.org/jira/browse/HUDI-1225
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Reporter: Balaji Varadarajan
>Assignee: cdmikechen
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.1
>
>
> [https://github.com/apache/hudi/issues/2034]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] cdmikechen opened a new pull request #2047: [HUDI-1225] Fix: Avro Date logical type not handled correctly when converting to Spark Row

2020-08-27 Thread GitBox


cdmikechen opened a new pull request #2047:
URL: https://github.com/apache/hudi/pull/2047


   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   Fix: Avro Date logical type not handled correctly when converting to Spark 
Row
   jira: https://issues.apache.org/jira/browse/HUDI-1225
   and issue: https://github.com/apache/hudi/issues/2034
   
   ## Brief change log
   
 - *Modify `org.apache.hudi.AvroConversionHelper` to cast int to date type 
correctly*
   
   ## Verify this pull request
   
   This change added tests and can be verified as follows:
   
 - *Added `org.apache.hudi.TestAvroConversionHelper` to verify the change*
   
   ## Committer checklist
   
- [x] Has a corresponding JIRA in PR title & commit

- [x] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #1900: [HUDI-531]Add java doc for hudi test suite general classes

2020-08-27 Thread GitBox


yanghua commented on a change in pull request #1900:
URL: https://github.com/apache/hudi/pull/1900#discussion_r478341019



##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java
##
@@ -35,7 +35,13 @@
  */
 public class DeltaConfig implements Serializable {
 
+  /**
+   * Output destination type.

Review comment:
   IMO, we do not need this comment.

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java
##
@@ -43,22 +44,39 @@
  */
 public class GenericRecordFullPayloadGenerator implements Serializable {
 
-  public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; // 10 KB
+  /**
+   * 10 KB.
+   */
+  public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10;
   private static Logger log = 
LoggerFactory.getLogger(GenericRecordFullPayloadGenerator.class);
   protected final Random random = new Random();
-  // The source schema used to generate a payload
+  /**
+   * The source schema used to generate a payload.

Review comment:
   Why we should change these comment styles for fields?

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/configuration/DeltaConfig.java
##
@@ -35,7 +35,13 @@
  */
 public class DeltaConfig implements Serializable {
 
+  /**
+   * Output destination type.
+   */
   private final DeltaOutputMode deltaOutputMode;
+  /**
+   * Input data type.

Review comment:
   ditto

##
File path: 
hudi-integ-test/src/main/java/org/apache/hudi/integ/testsuite/generator/GenericRecordFullPayloadGenerator.java
##
@@ -43,22 +44,39 @@
  */
 public class GenericRecordFullPayloadGenerator implements Serializable {
 
-  public static final int DEFAULT_PAYLOAD_SIZE = 1024 * 10; // 10 KB
+  /**
+   * 10 KB.

Review comment:
   we may not change this.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Issue Comment Deleted] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators

2020-08-27 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui updated HUDI-1233:

Comment: was deleted

(was: Please help me to see, I think this function is quite good, can you give 
advice
[~vinoth])

> deltastreamer Kafka consumption delay reporting indicators
> --
>
> Key: HUDI-1233
> URL: https://issues.apache.org/jira/browse/HUDI-1233
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Minor
>
> currently hudi-deltastreamer does not report the indicator of Kafka data 
> consumption delay, I suggest that this function can be added



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators

2020-08-27 Thread liujinhui (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17185697#comment-17185697
 ] 

liujinhui commented on HUDI-1233:
-

Please help me to see, I think this function is quite good, can you give advice
[~vinoth]

> deltastreamer Kafka consumption delay reporting indicators
> --
>
> Key: HUDI-1233
> URL: https://issues.apache.org/jira/browse/HUDI-1233
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Minor
>
> currently hudi-deltastreamer does not report the indicator of Kafka data 
> consumption delay, I suggest that this function can be added



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators

2020-08-27 Thread liujinhui (Jira)
liujinhui created HUDI-1233:
---

 Summary: deltastreamer Kafka consumption delay reporting indicators
 Key: HUDI-1233
 URL: https://issues.apache.org/jira/browse/HUDI-1233
 Project: Apache Hudi
  Issue Type: Improvement
  Components: DeltaStreamer
Reporter: liujinhui


currently hudi-deltastreamer does not report the indicator of Kafka data 
consumption delay, I suggest that this function can be added



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1233) deltastreamer Kafka consumption delay reporting indicators

2020-08-27 Thread liujinhui (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liujinhui reassigned HUDI-1233:
---

Assignee: liujinhui

> deltastreamer Kafka consumption delay reporting indicators
> --
>
> Key: HUDI-1233
> URL: https://issues.apache.org/jira/browse/HUDI-1233
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: DeltaStreamer
>Reporter: liujinhui
>Assignee: liujinhui
>Priority: Minor
>
> currently hudi-deltastreamer does not report the indicator of Kafka data 
> consumption delay, I suggest that this function can be added



--
This message was sent by Atlassian Jira
(v8.3.4#803005)