[GitHub] [hudi] danny0405 commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…

2021-03-14 Thread GitBox


danny0405 commented on a change in pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#discussion_r594086761



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
##
@@ -16,14 +16,14 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.operator;
+package org.apache.hudi.configuration;
 
 import org.apache.hudi.common.model.HoodieTableType;
-import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.streamer.FlinkStreamerConfig;
 import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.streamer.FlinkStreamerConfig;

Review comment:
   `FlinkStreamerConfig` is only used for streamer, keep it under 
`streamer` package is more reasonable.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…

2021-03-14 Thread GitBox


danny0405 commented on a change in pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#discussion_r594085950



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamerV2.java
##
@@ -19,10 +19,10 @@
 package org.apache.hudi.streamer;
 
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.operator.FlinkOptions;
-import org.apache.hudi.operator.StreamWriteOperatorFactory;
-import org.apache.hudi.operator.partitioner.BucketAssignFunction;
-import org.apache.hudi.operator.transform.RowDataToHoodieFunction;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.sink.StreamWriteOperatorFactory;
+import org.apache.hudi.sink.partitioner.BucketAssignFunction;
+import org.apache.hudi.sink.transform.RowDataToHoodieFunction;
 import org.apache.hudi.util.AvroSchemaConverter;

Review comment:
   No, this class references Flink, i would prefer to keep the name.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…

2021-03-14 Thread GitBox


yanghua commented on a change in pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#discussion_r594077871



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java
##
@@ -16,14 +16,14 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.operator;
+package org.apache.hudi.configuration;
 
 import org.apache.hudi.common.model.HoodieTableType;
-import org.apache.hudi.config.HoodieWriteConfig;
-import org.apache.hudi.streamer.FlinkStreamerConfig;
 import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload;
+import org.apache.hudi.config.HoodieWriteConfig;
 import org.apache.hudi.keygen.SimpleAvroKeyGenerator;
 import org.apache.hudi.keygen.constant.KeyGeneratorOptions;
+import org.apache.hudi.streamer.FlinkStreamerConfig;

Review comment:
   We have a `configuration` subpackage, can we put `FlinkStreamerConfig ` 
into it?

##
File path: 
hudi-flink/src/test/java/org/apache/hudi/sink/StreamWriteOperatorCoordinatorTest.java
##
@@ -16,16 +16,16 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.operator;
+package org.apache.hudi.sink;
 
 import org.apache.hudi.client.WriteStatus;
 import org.apache.hudi.common.fs.FSUtils;
 import org.apache.hudi.common.model.HoodieWriteStat;
 import org.apache.hudi.common.table.HoodieTableMetaClient;
 import org.apache.hudi.exception.HoodieException;
-import org.apache.hudi.operator.event.BatchWriteSuccessEvent;
-import org.apache.hudi.operator.utils.TestConfigurations;
+import org.apache.hudi.sink.event.BatchWriteSuccessEvent;
 import org.apache.hudi.util.StreamerUtil;
+import org.apache.hudi.utils.TestConfigurations;
 

Review comment:
   `TestStreamWriteOperatorCoordinator` sounds better?

##
File path: 
hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamerV2.java
##
@@ -19,10 +19,10 @@
 package org.apache.hudi.streamer;
 
 import org.apache.hudi.common.model.HoodieRecord;
-import org.apache.hudi.operator.FlinkOptions;
-import org.apache.hudi.operator.StreamWriteOperatorFactory;
-import org.apache.hudi.operator.partitioner.BucketAssignFunction;
-import org.apache.hudi.operator.transform.RowDataToHoodieFunction;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.sink.StreamWriteOperatorFactory;
+import org.apache.hudi.sink.partitioner.BucketAssignFunction;
+import org.apache.hudi.sink.transform.RowDataToHoodieFunction;
 import org.apache.hudi.util.AvroSchemaConverter;

Review comment:
   We have some classes that follow these patterns, e.g. `Converter`, 
`Converters`. Can we choose one?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=h1) Report
   > Merging 
[#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=desc) (d477189) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/2fdae6835ce3fcad3111205d2373a69b34788483?el=desc)
 (2fdae68) will **decrease** coverage by `42.34%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2374   +/-   ##
   
   - Coverage 51.87%   9.52%   -42.35% 
   + Complexity 3556  48 -3508 
   
 Files   465  53  -412 
 Lines 221651963-20202 
 Branches   2357 235 -2122 
   
   - Hits  11498 187-11311 
   + Misses 96671763 -7904 
   + Partials   1000  13  -987 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.52% <0.00%> (-59.96%)` | `0.00 <0.00> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `0.00% <0.00%> (-70.00%)` | `0.00 <0.00> (-52.00)` | |
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFy

[GitHub] [hudi] codecov-io edited a comment on pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-799061373


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=h1) Report
   > Merging 
[#2673](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=desc) (a993277) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e93c6a569310ce55c5a0fc0655328e7fd32a9da2?el=desc)
 (e93c6a5) will **increase** coverage by `17.44%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2673/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree)
   
   ```diff
   @@  Coverage Diff  @@
   ## master#2673   +/-   ##
   =
   + Coverage 51.99%   69.43%   +17.44% 
   + Complexity 3580  363 -3217 
   =
 Files   466   53  -413 
 Lines 22275 1963-20312 
 Branches   2374  235 -2139 
   =
   - Hits  11581 1363-10218 
   + Misses 9686  466 -9220 
   + Partials   1008  134  -874 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (-0.06%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=)
 | `70.00% <0.00%> (-0.35%)` | `52.00% <0.00%> (-1.00%)` | |
   | 
[...che/hudi/common/model/HoodiePartitionMetadata.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVBhcnRpdGlvbk1ldGFkYXRhLmphdmE=)
 | | | |
   | 
[...apache/hudi/common/fs/inline/InLineFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9JbkxpbmVGaWxlU3lzdGVtLmphdmE=)
 | | | |
   | 
[.../hadoop/utils/HoodieRealtimeRecordReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lUmVjb3JkUmVhZGVyVXRpbHMuamF2YQ==)
 | | | |
   | 
[...di-cli/src/main/java/org/apache/hudi/cli/Main.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL01haW4uamF2YQ==)
 | | | |
   | 
[...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh)
 | | | |
   | 
[.../common/table/log/block/HoodieLogBlockVersion.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVMb2dCbG9ja1ZlcnNpb24uamF2YQ==)
 | | | |
   | 
[...e/hudi/exception/HoodieSerializationException.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVNlcmlhbGl6YXRpb25FeGNlcHRpb24uamF2YQ==)
 | | | |
   | 
[...e/hudi/exception/HoodieCorruptedDataException.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUNvcnJ1cHRlZERhdGFFeGNlcHRpb24uamF2YQ==)
 | | | |
   | 
[.../org/apache/hudi/common/engine/EngineProperty.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9FbmdpbmVQcm9wZXJ0eS5qYXZh)
 | | | |
   | ... and [403 
more](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree-more) | |
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, pl

[GitHub] [hudi] maxiaoniu commented on issue #2639: [SUPPORT] Spark 3.0.1 upgrade cause severe increase in Hudi write time

2021-03-14 Thread GitBox


maxiaoniu commented on issue #2639:
URL: https://github.com/apache/hudi/issues/2639#issuecomment-799088028


   Might be related with this
   ```
   Important
   Amazon EMR 6.1.0 and 6.2.0 include a performance issue that can critically 
affect all Hudi insert, upsert, and delete operations. If you plan to use Hudi 
with Amazon EMR 6.1.0 or 6.2.0, you should contact AWS support to obtain a 
patched Hudi RPM.
   ```
   https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1690) Fix StackOverflowError while running clustering with large number of partitions

2021-03-14 Thread Rong Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rong Ma updated HUDI-1690:
--
Description: 
We are testing clustering on a hudi table with about 3000 partitions. The spark 
driver throws StackOverflowError before all the partitions sorted:

21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed
 java.lang.StackOverflowError
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
 at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254)
 at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:477)
 at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

...

 

I see similar issue here:

[https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error]

Setting the driver's stack size to 100M still has this error. So this is 
probably because the rdd.union has been called too many times and the result of 
rdd lineage is too large. I think we should use JavaSparkContext.union instead 
RDD.union here 
[https://github.com/apache/hudi/blob/e93c6a569310ce55c5a0fc0655328e7fd32a9da2/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java#L96]

  was:
We are testing clustering on a hudi table with about 3000 partitions. The spark 
driver throws StackOverflowError before all the partitions sorted:

21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed
java.lang.StackOverflowError
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
 at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254)
 at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Met

[jira] [Created] (HUDI-1690) Fix StackOverflowError while running clustering with large number of partitions

2021-03-14 Thread Rong Ma (Jira)
Rong Ma created HUDI-1690:
-

 Summary: Fix StackOverflowError while running clustering with 
large number of partitions
 Key: HUDI-1690
 URL: https://issues.apache.org/jira/browse/HUDI-1690
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Reporter: Rong Ma
 Fix For: 0.8.0


We are testing clustering on a hudi table with about 3000 partitions. The spark 
driver throws StackOverflowError before all the partitions sorted:

21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed
java.lang.StackOverflowError
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
 at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254)
 at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:477)
 at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
 at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
 at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)

...

 

I see similar issue here:

https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error

Setting the driver's stack size to 100M still has this error. 

We are testing clustering on a hudi table with about 3000 partitions. The spark 
driver throws StackOverflowError before all the partitions sorted:

21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed
java.lang.StackOverflowError
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118)
 at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136)
 at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
 at 
org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)
 at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254)
 at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
 at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
 at j

[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-14 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055


   > @pengzhiwei2018 first of all, thanks for these great contributions.
   > 
   > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands 
for inputSchema, input is what is being written, right? We can probably just 
leave it as is. and introduce new `tableSchema` variables as you have in the 
`HoodieWriteHandle` class.?
   > 
   > Like someone else pointed out as well, so far, we are using read and write 
schemas consistently. Love to not introduce a new input schema, unless its 
absolutely necessary .
   
   Hi @vinothchandar ,thanks for your reply on this issue.
   Yes, in most case ,the `writeSchema` is the same with the `inputSchema`  
which can stands for the `inputSchema` . But in the case in this PR (test case 
in 
[TestCOWDataSource](https://github.com/apache/hudi/pull/2334/files#diff-9429f5bc432f70ea4801e306dd817416b76e6ab68d41a278e222c989ce5c9824))
 we write the table twice:
   First, we write a "id: long" to the table. The input schema is "a:long", the 
table schema is "a:long". 
   Second, we write a "id:int" to the table. The input schema is "a:int", but 
the table schema is "a:long" as the previous write. The write schema should be 
the same with the table schema, or else an Exception would throw out which is 
the problem we want to solve in this PR.
   So in this case, we need to distinguish the difference between the 
`inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's 
schema, but the `writeSchema` is always the `tableSchema`. 
   
   - The `inputSchema` is used to parser the record from the incoming data.
   - The `tableSchema` is used to write and read record from the table. When we 
want to write or read record to the table, we use the `tableSchema`.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-14 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055


   > @pengzhiwei2018 first of all, thanks for these great contributions.
   > 
   > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands 
for inputSchema, input is what is being written, right? We can probably just 
leave it as is. and introduce new `tableSchema` variables as you have in the 
`HoodieWriteHandle` class.?
   > 
   > Like someone else pointed out as well, so far, we are using read and write 
schemas consistently. Love to not introduce a new input schema, unless its 
absolutely necessary .
   
   Hi @vinothchandar ,thanks for your reply on this issue.
   Yes, in most case ,the `writeSchema` is the same with the `inputSchema`  
which can stands for the `inputSchema` . But in the case in this PR (test case 
in TestCOWDataSource) we write the table twice:
   First, we write a "id: long" to the table. The input schema is "a:long", the 
table schema is "a:long". 
   Second, we write a "id:int" to the table. The input schema is "a:int", but 
the table schema is "a:long" as the previous write. The write schema should be 
the same with the table schema, or else an Exception would throw out which is 
the problem we want to solve in this PR.
   So in this case, we need to distinguish the difference between the 
`inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's 
schema, but the `writeSchema` is always the `tableSchema`. 
   
   - The `inputSchema` is used to parser the record from the incoming data.
   - The `tableSchema` is used to write and read record from the table. When we 
want to write or read record to the table, we use the `tableSchema`.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-14 Thread GitBox


pengzhiwei2018 edited a comment on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055


   > @pengzhiwei2018 first of all, thanks for these great contributions.
   > 
   > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands 
for inputSchema, input is what is being written, right? We can probably just 
leave it as is. and introduce new `tableSchema` variables as you have in the 
`HoodieWriteHandle` class.?
   > 
   > Like someone else pointed out as well, so far, we are using read and write 
schemas consistently. Love to not introduce a new input schema, unless its 
absolutely necessary .
   
   Hi @vinothchandar ,thanks for your reply on this issue.
   Yes, in most case ,the `writeSchema` is the same with the `inputSchema`  
which can stands for the `inputSchema` . But in the case in this PR, we write 
the table twice:
   First, we write a "id: long" to the table. The input schema is "a:long", the 
table schema is "a:long". 
   Second, we write a "id:int" to the table. The input schema is "a:int", but 
the table schema is "a:long" as the previous write. The write schema should be 
the same with the table schema, or else an Exception would throw out which is 
the problem we want to solve in this PR.
   So in this case, we need to distinguish the difference between the 
`inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's 
schema, but the `writeSchema` is always the `tableSchema`. 
   
   - The `inputSchema` is used to parser the record from the incoming data.
   - The `tableSchema` is used to write and read record from the table. When we 
want to write or read record to the table, we use the `tableSchema`.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-14 Thread GitBox


pengzhiwei2018 commented on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055


   > @pengzhiwei2018 first of all, thanks for these great contributions.
   > 
   > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands 
for inputSchema, input is what is being written, right? We can probably just 
leave it as is. and introduce new `tableSchema` variables as you have in the 
`HoodieWriteHandle` class.?
   > 
   > Like someone else pointed out as well, so far, we are using read and write 
schemas consistently. Love to not introduce a new input schema, unless its 
absolutely necessary .
   
   Hi @vinothchandar ,thanks for your reply on this issue.
   Yes, in most case ,the `writeSchema` is the same with the `inputSchema`  
which can stands for the `inputSchema` . But in the case in this PR, we write 
the table twice:
   First, we write a "id: long" to the table. The input schema is "a:long", the 
table schema is "a:long". 
   Second, we write a "id:int" to the table. The input schema is "a:int", but 
the table schema is "a:long" as the previous write. The write schema should be 
the same with the table schema, or else an Exception would throw out which is 
the problem we want to solve in this PR.
   So in this case, we need to distinguish the difference between the 
`inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's 
schema, but the `writeSchema` is always the `tableSchema`. 
   **Here's the summary**
   - The `inputSchema` is used to parser the record from the incoming data.
   - The `tableSchema` is used to write and read record from the table. When we 
want to write or read record to the table, we use the `tableSchema`.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io commented on pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished

2021-03-14 Thread GitBox


codecov-io commented on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-799061373


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=h1) Report
   > Merging 
[#2673](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=desc) (e391c24) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/e93c6a569310ce55c5a0fc0655328e7fd32a9da2?el=desc)
 (e93c6a5) will **decrease** coverage by `42.46%`.
   > The diff coverage is `n/a`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2673/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master   #2673   +/-   ##
   
   - Coverage 51.99%   9.52%   -42.47% 
   + Complexity 3580  48 -3532 
   
 Files   466  53  -413 
 Lines 222751963-20312 
 Branches   2374 235 -2139 
   
   - Hits  11581 187-11394 
   + Misses 96861763 -7923 
   + Partials   1008  13  -995 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `?` | `?` | |
   | hudiclient | `?` | `?` | |
   | hudicommon | `?` | `?` | |
   | hudiflink | `?` | `?` | |
   | hudihadoopmr | `?` | `?` | |
   | hudisparkdatasource | `?` | `?` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `9.52% <ø> (-59.96%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | |
   | 
[...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | |
   | 
[...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | |
   | 
[...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | |
   | 
[...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | |
   | 
[...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | |
   | 
[...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh)
 | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | |
   | 
[...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlc

[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…

2021-03-14 Thread GitBox


pengzhiwei2018 commented on a change in pull request #2651:
URL: https://github.com/apache/hudi/pull/2651#discussion_r594023224



##
File path: 
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala
##
@@ -79,39 +82,58 @@ class DefaultSource extends RelationProvider
 val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths
 
 val fs = FSUtils.getFs(allPaths.head, 
sqlContext.sparkContext.hadoopConfiguration)
-val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs)
-
-val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray)
+// Use the HoodieFileIndex only if the 'path' has specified with no "*" 
contains.
+// And READ_PATHS_OPT_KEY has not specified.
+// Or else we use the original way to read hoodie table.

Review comment:
   A Jira has opened at  
[HUDI-1689](https://issues.apache.org/jira/browse/HUDI-1689) 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] xiarixiaoyao commented on pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished

2021-03-14 Thread GitBox


xiarixiaoyao commented on pull request #2673:
URL: https://github.com/apache/hudi/pull/2673#issuecomment-799059684


   cc @garyli1019 , could you help me review this pr, thanks



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (HUDI-1688) hudi write should uncache rdd, when the write operation is finnished

2021-03-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-1688:
-
Labels: pull-request-available  (was: )

> hudi write should uncache rdd, when the write operation is finnished
> 
>
> Key: HUDI-1688
> URL: https://issues.apache.org/jira/browse/HUDI-1688
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: Spark Integration
>Affects Versions: 0.7.0
>Reporter: tao meng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.8.0
>
>
> now, hudi improve write performance by cache necessary rdds; however when the 
> write operation is finnished, those cached rdds have not been uncached which 
> waste lots of memory.
> [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115]
> https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214
> In our environment:
> step1: insert 100GB data into hudi table by spark   (ok)
> step2: insert another 100GB data into hudi table by spark again (oom ) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] xiarixiaoyao opened a new pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished

2021-03-14 Thread GitBox


xiarixiaoyao opened a new pull request #2673:
URL: https://github.com/apache/hudi/pull/2673


   
   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contributing.html before opening a 
pull request.*
   
   ## What is the purpose of the pull request
   
   fix the bug that hudi cannot uncache rdds correctly
   now, hudi improve write performance by cache necessary rdds; however when 
the write operation is finnished, those cached rdds have not been uncached 
which waste lots of memory.
   
   
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115
   
   
https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214
   
   In our environment:
   
   step1: insert 100GB data into hudi table by spark   (ok)
   
   step2: insert another 100GB data into hudi table by spark again (oom ) 
   ## Brief change log
   
   uncached rdds, when the write operation is finished
   
   ## Verify this pull request
   
   Existing UT
   
   ## Committer checklist
   
- [ ] Has a corresponding JIRA in PR title & commit

- [ ] Commit message is descriptive of the change

- [ ] CI is green
   
- [ ] Necessary doc changes done or have another open PR
  
- [ ] For large changes, please consider breaking it into sub-tasks under 
an umbrella JIRA.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Assigned] (HUDI-1689) Support Multipath query for HoodieFileIndex

2021-03-14 Thread pengzhiwei (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

pengzhiwei reassigned HUDI-1689:


Assignee: pengzhiwei

> Support Multipath query for HoodieFileIndex
> ---
>
> Key: HUDI-1689
> URL: https://issues.apache.org/jira/browse/HUDI-1689
> Project: Apache Hudi
>  Issue Type: Improvement
>  Components: Spark Integration
>Reporter: pengzhiwei
>Assignee: pengzhiwei
>Priority: Major
>
> Support Multipath query for the HoodieFileIndex to benefit from the partition 
> prune.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HUDI-1689) Support Multipath query for HoodieFileIndex

2021-03-14 Thread pengzhiwei (Jira)
pengzhiwei created HUDI-1689:


 Summary: Support Multipath query for HoodieFileIndex
 Key: HUDI-1689
 URL: https://issues.apache.org/jira/browse/HUDI-1689
 Project: Apache Hudi
  Issue Type: Improvement
  Components: Spark Integration
Reporter: pengzhiwei


Support Multipath query for the HoodieFileIndex to benefit from the partition 
prune.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] codecov-io edited a comment on pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#issuecomment-797515929


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=h1) Report
   > Merging 
[#2669](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=desc) (f632e7a) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/20786ab8a2a1e7735ab846e92802fb9f4449adc9?el=desc)
 (20786ab) will **decrease** coverage by `0.04%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2669/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2669  +/-   ##
   
   - Coverage 52.00%   51.96%   -0.05% 
   + Complexity 3579 3578   -1 
   
 Files   465  466   +1 
 Lines 2226822275   +7 
 Branches   2375 2374   -1 
   
   - Hits  1158111575   -6 
   - Misses 9676 9690  +14 
   + Partials   1011 1010   -1 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.44% <ø> (-0.08%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `53.57% <100.00%> (ø)` | `0.00 <2.00> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `49.62% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.43% <ø> (-0.11%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh)
 | `85.49% <ø> (ø)` | `6.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/schema/FilebasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvRmlsZWJhc2VkU2NoZW1hUHJvdmlkZXIuamF2YQ==)
 | `29.16% <ø> (ø)` | `2.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/sink/CommitSink.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0NvbW1pdFNpbmsuamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/sink/InstantGenerateOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0luc3RhbnRHZW5lcmF0ZU9wZXJhdG9yLmphdmE=)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/sink/KeyedWriteProcessFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzRnVuY3Rpb24uamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/sink/KeyedWriteProcessOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzT3BlcmF0b3IuamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/sink/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlRnVuY3Rpb24uamF2YQ==)
 | `84.00% <ø> (ø)` | `22.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/sink/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3IuamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh)
 | `69.13% <ø> (ø)` | `32.00 <0.00> (?)` | |
   | 
[...g/apache/hudi/sink/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JGYWN0b3J5LmphdmE=)
 | 

[GitHub] [hudi] codecov-io edited a comment on pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#issuecomment-797515929


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=h1) Report
   > Merging 
[#2669](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=desc) (f632e7a) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/20786ab8a2a1e7735ab846e92802fb9f4449adc9?el=desc)
 (20786ab) will **decrease** coverage by `0.19%`.
   > The diff coverage is `100.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2669/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2669  +/-   ##
   
   - Coverage 52.00%   51.81%   -0.20% 
   + Complexity 3579 3388 -191 
   
 Files   465  445  -20 
 Lines 2226820764-1504 
 Branches   2375 2229 -146 
   
   - Hits  1158110759 -822 
   + Misses 9676 9070 -606 
   + Partials   1011  935  -76 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.44% <ø> (-0.08%)` | `0.00 <ø> (ø)` | |
   | hudiflink | `53.57% <100.00%> (ø)` | `0.00 <2.00> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.43% <ø> (-0.11%)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh)
 | `85.49% <ø> (ø)` | `6.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/schema/FilebasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvRmlsZWJhc2VkU2NoZW1hUHJvdmlkZXIuamF2YQ==)
 | `29.16% <ø> (ø)` | `2.00 <0.00> (ø)` | |
   | 
[...src/main/java/org/apache/hudi/sink/CommitSink.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0NvbW1pdFNpbmsuamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/sink/InstantGenerateOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0luc3RhbnRHZW5lcmF0ZU9wZXJhdG9yLmphdmE=)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/sink/KeyedWriteProcessFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzRnVuY3Rpb24uamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...rg/apache/hudi/sink/KeyedWriteProcessOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzT3BlcmF0b3IuamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/sink/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlRnVuY3Rpb24uamF2YQ==)
 | `84.00% <ø> (ø)` | `22.00 <0.00> (?)` | |
   | 
[...java/org/apache/hudi/sink/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3IuamF2YQ==)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | 
[...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh)
 | `69.13% <ø> (ø)` | `32.00 <0.00> (?)` | |
   | 
[...g/apache/hudi/sink/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JGYWN0b3J5LmphdmE=)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | |
   | ...

[jira] [Created] (HUDI-1688) hudi write should uncache rdd, when the write operation is finnished

2021-03-14 Thread tao meng (Jira)
tao meng created HUDI-1688:
--

 Summary: hudi write should uncache rdd, when the write operation 
is finnished
 Key: HUDI-1688
 URL: https://issues.apache.org/jira/browse/HUDI-1688
 Project: Apache Hudi
  Issue Type: Bug
  Components: Spark Integration
Affects Versions: 0.7.0
Reporter: tao meng
 Fix For: 0.8.0


now, hudi improve write performance by cache necessary rdds; however when the 
write operation is finnished, those cached rdds have not been uncached which 
waste lots of memory.

[https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115]

https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214

In our environment:

step1: insert 100GB data into hudi table by spark   (ok)

step2: insert another 100GB data into hudi table by spark again (oom ) 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] danny0405 commented on pull request #2309: [HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o…

2021-03-14 Thread GitBox


danny0405 commented on pull request #2309:
URL: https://github.com/apache/hudi/pull/2309#issuecomment-799047482


   > @n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but 
can someone please summarize its state? also this needs a rebase with only the 
necessary changes.
   
   The changes overall looks good from my side, but this PR has to do a rebase 
because it introduces many conflicts commits from master branch.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] danny0405 commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and re-organize the pa…

2021-03-14 Thread GitBox


danny0405 commented on a change in pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#discussion_r594008808



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/source/StreamWriteFunction.java
##
@@ -16,7 +16,7 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.operator;
+package org.apache.hudi.source;

Review comment:
   Nope, thanks for the reminder ~





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] root18039532923 commented on issue #2623: org.apache.hudi.exception.HoodieDependentSystemUnavailableException:System HBASE unavailable.

2021-03-14 Thread GitBox


root18039532923 commented on issue #2623:
URL: https://github.com/apache/hudi/issues/2623#issuecomment-799044094


   @n3nash 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] yanghua commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and re-organize the pa…

2021-03-14 Thread GitBox


yanghua commented on a change in pull request #2669:
URL: https://github.com/apache/hudi/pull/2669#discussion_r593996516



##
File path: 
hudi-flink/src/main/java/org/apache/hudi/source/StreamWriteFunction.java
##
@@ -16,7 +16,7 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.operator;
+package org.apache.hudi.source;

Review comment:
   Should it be put into the `source` package? 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci

2021-03-14 Thread GitBox


hudi-bot edited a comment on pull request #2643:
URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481


   
   ## CI report:
   
   * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN
   * 7a11c73f54424450987d56a49caf42eea092fce0 Azure: 
[SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=125)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci

2021-03-14 Thread GitBox


hudi-bot edited a comment on pull request #2643:
URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481


   
   ## CI report:
   
   * c93712492faf19e818194762ec7a05976c79659e Azure: 
[SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123)
 
   * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN
   * 7a11c73f54424450987d56a49caf42eea092fce0 Azure: 
[PENDING](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=125)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci

2021-03-14 Thread GitBox


hudi-bot edited a comment on pull request #2643:
URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481


   
   ## CI report:
   
   * c93712492faf19e818194762ec7a05976c79659e Azure: 
[SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123)
 
   * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN
   * 7a11c73f54424450987d56a49caf42eea092fce0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci

2021-03-14 Thread GitBox


hudi-bot edited a comment on pull request #2643:
URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481


   
   ## CI report:
   
   * c93712492faf19e818194762ec7a05976c79659e Azure: 
[SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123)
 
   * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] codecov-io edited a comment on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#issuecomment-767956391


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=h1) Report
   > Merging 
[#2494](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=desc) (5557e1d) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/d8af24d8a2fdbead4592a36df1bd9dda333f1513?el=desc)
 (d8af24d) will **increase** coverage by `0.38%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2494/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2494  +/-   ##
   
   + Coverage 51.53%   51.92%   +0.38% 
   - Complexity 3491 3579  +88 
   
 Files   462  466   +4 
 Lines 2188122295 +414 
 Branches   2327 2378  +51 
   
   + Hits  1127711576 +299 
   - Misses 9624 9710  +86 
   - Partials980 1009  +29 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.33% <0.00%> (-0.15%)` | `0.00 <0.00> (ø)` | |
   | hudiflink | `53.57% <ø> (+3.22%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `49.62% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudiutilities | `69.48% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/cli/commands/MetadataCommand.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL01ldGFkYXRhQ29tbWFuZC5qYXZh)
 | `1.07% <0.00%> (ø)` | `1.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/config/HoodieMetadataConfig.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVNZXRhZGF0YUNvbmZpZy5qYXZh)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../hudi/common/table/view/FileSystemViewManager.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvRmlsZVN5c3RlbVZpZXdNYW5hZ2VyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/io/storage/HoodieHFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVIRmlsZVJlYWRlci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...pache/hudi/metadata/HoodieBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllQmFja2VkVGFibGVNZXRhZGF0YS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/metadata/HoodieTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllVGFibGVNZXRhZGF0YS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...hudi/source/format/cow/CopyOnWriteInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvZm9ybWF0L2Nvdy9Db3B5T25Xcml0ZUlucHV0Rm9ybWF0LmphdmE=)
 | `56.08% <0.00%> (-29.22%)` | `20.00% <0.00%> (+14.00%)` | :arrow_down: |
   | 
[...java/org/apache/hudi/source/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvSG9vZGllVGFibGVTb3VyY2UuamF2YQ==)
 | `67.54% <0.00%> (-8.11%)` | `28.00% <0.00%> (-6.00%)` | |
   | 
[...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=)
 | `66.09% <0.00%> (-1.77%)` | `23.00% <0.00%> (+1.00%)` | :arrow_down: |
   | 
[...ies/sources/helpers/DatePartitionPathSelector.java](https://codecov.io/

[GitHub] [hudi] codecov-io edited a comment on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-03-14 Thread GitBox


codecov-io edited a comment on pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#issuecomment-767956391


   # [Codecov](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=h1) Report
   > Merging 
[#2494](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=desc) (5557e1d) 
into 
[master](https://codecov.io/gh/apache/hudi/commit/d8af24d8a2fdbead4592a36df1bd9dda333f1513?el=desc)
 (d8af24d) will **increase** coverage by `0.23%`.
   > The diff coverage is `0.00%`.
   
   [![Impacted file tree 
graph](https://codecov.io/gh/apache/hudi/pull/2494/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree)
   
   ```diff
   @@ Coverage Diff  @@
   ## master#2494  +/-   ##
   
   + Coverage 51.53%   51.77%   +0.23% 
   + Complexity 3491 3389 -102 
   
 Files   462  445  -17 
 Lines 2188120784-1097 
 Branches   2327 2233  -94 
   
   - Hits  1127710760 -517 
   + Misses 9624 9090 -534 
   + Partials980  934  -46 
   ```
   
   | Flag | Coverage Δ | Complexity Δ | |
   |---|---|---|---|
   | hudicli | `37.01% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudicommon | `51.33% <0.00%> (-0.15%)` | `0.00 <0.00> (ø)` | |
   | hudiflink | `53.57% <ø> (+3.22%)` | `0.00 <ø> (ø)` | |
   | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | |
   | hudisync | `?` | `?` | |
   | huditimelineservice | `?` | `?` | |
   | hudiutilities | `69.48% <ø> (ø)` | `0.00 <ø> (ø)` | |
   
   Flags with carried forward coverage won't be shown. [Click 
here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment)
 to find out more.
   
   | [Impacted 
Files](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree) | Coverage Δ 
| Complexity Δ | |
   |---|---|---|---|
   | 
[.../org/apache/hudi/cli/commands/MetadataCommand.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL01ldGFkYXRhQ29tbWFuZC5qYXZh)
 | `1.07% <0.00%> (ø)` | `1.00 <0.00> (ø)` | |
   | 
[...pache/hudi/common/config/HoodieMetadataConfig.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVNZXRhZGF0YUNvbmZpZy5qYXZh)
 | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../hudi/common/table/view/FileSystemViewManager.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvRmlsZVN5c3RlbVZpZXdNYW5hZ2VyLmphdmE=)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/io/storage/HoodieHFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVIRmlsZVJlYWRlci5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...pache/hudi/metadata/HoodieBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllQmFja2VkVGFibGVNZXRhZGF0YS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[.../org/apache/hudi/metadata/HoodieTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllVGFibGVNZXRhZGF0YS5qYXZh)
 | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | |
   | 
[...hudi/source/format/cow/CopyOnWriteInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvZm9ybWF0L2Nvdy9Db3B5T25Xcml0ZUlucHV0Rm9ybWF0LmphdmE=)
 | `56.08% <0.00%> (-29.22%)` | `20.00% <0.00%> (+14.00%)` | :arrow_down: |
   | 
[...java/org/apache/hudi/source/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvSG9vZGllVGFibGVTb3VyY2UuamF2YQ==)
 | `67.54% <0.00%> (-8.11%)` | `28.00% <0.00%> (-6.00%)` | |
   | 
[...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=)
 | `66.09% <0.00%> (-1.77%)` | `23.00% <0.00%> (+1.00%)` | :arrow_down: |
   | 
[...ies/sources/helpers/DatePartitionPathSelector.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#dif

[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci

2021-03-14 Thread GitBox


hudi-bot edited a comment on pull request #2643:
URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481


   
   ## CI report:
   
   * c93712492faf19e818194762ec7a05976c79659e Azure: 
[SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-03-14 Thread GitBox


vinothchandar commented on pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#issuecomment-798971624


   @prashantwason We can remove the reuse configuration - i.e no need to have 
this behavior be user controlled. 
   
   but ultimately, we still need to close everything out,  where metadata table 
is opened from executors. I am going to just introduce a boolean variable 
within `HoodieBackedTableMetadata` 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci

2021-03-14 Thread GitBox


hudi-bot edited a comment on pull request #2643:
URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481


   
   ## CI report:
   
   * e6557a7ab08c1867dfee54360a0c76adfaf3d233 Azure: 
[SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=115)
 
   * c93712492faf19e818194762ec7a05976c79659e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run travis` re-run the last Travis build
- `@hudi-bot run azure` re-run the last Azure build
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (HUDI-1673) Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex

2021-03-14 Thread shenh062326 (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenh062326 resolved HUDI-1673.
---
Resolution: Fixed

> Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex
> 
>
> Key: HUDI-1673
> URL: https://issues.apache.org/jira/browse/HUDI-1673
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: shenh062326
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (HUDI-1673) Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex

2021-03-14 Thread shenh062326 (Jira)


 [ 
https://issues.apache.org/jira/browse/HUDI-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shenh062326 reassigned HUDI-1673:
-

Assignee: shenh062326

> Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex
> 
>
> Key: HUDI-1673
> URL: https://issues.apache.org/jira/browse/HUDI-1673
> Project: Apache Hudi
>  Issue Type: Sub-task
>Reporter: shenh062326
>Assignee: shenh062326
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [hudi] vinothchandar commented on pull request #2612: [HUDI-1563] Adding hudi file sizing/ small file management blog

2021-03-14 Thread GitBox


vinothchandar commented on pull request #2612:
URL: https://github.com/apache/hudi/pull/2612#issuecomment-798876972


   @nsivabalan my changes are in. Please feel free to land, once you take a 
pass. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-03-14 Thread GitBox


n3nash edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296


   @vinothchandar It's possible to allow backfills using spark-sql but there 
are some corner cases. Consider the following:
   
   1. Ingestion job running with commit c4 (checkpoint = c3)
   2. Ingestion job finishes with commit c4 (checkpoint = c3)
   3. Someone runs a spark-sql job to backfill some data in an older partition, 
commit c5. Since this spark-sql job (unlike deltastreamer) does not handle 
checkpoint copying from prev metadata to next, it would be the client's job to 
do this. 
   4. If they fail to do this, deltastreamer next ingestion c6 will read no 
checkpoint from c5. 
   
   I've made the following changes:
   
   1) To make this manageable, I've added the following config : 
`hoodie.write.meta.key.prefixes`. One can set this config to ensure that during 
the critical section, if this config is set, it will copy over all the metadata 
for the keys that match with the prefix set in this config from the latest 
metadata to the current commit.
   2) Made changes and added these multi-writer tests to `HoodieDeltaStreamer` 
as well. Technically, one can do the backfill using `HoodieDeltaStreamer` or 
`Spark-SQL`. For `HoodieDeltaStreamer` they would have to set some custom 
checkpoint or mark it to null to ensure that the job just picks the data from 
the backfill location, for Spark-SQL it would not matter. 
   
   Yes, I am going to add documents on best practices / things to watch out in 
the other PR I opened for documentation. I will do that after resolving any 
further comments and landing this PR in the next couple of days.
   
   NOTE: The test may be failing because of some thread.sleep related code that 
I'm trying to remove. Will update tomorrow.  
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-03-14 Thread GitBox


n3nash edited a comment on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296


   @vinothchandar It's possible to allow backfills using spark-sql but there 
are some corner cases. Consider the following:
   
   1. Ingestion job running with commit c4 (checkpoint = c3)
   2. Ingestion job finishes with commit c4 (checkpoint = c3)
   3. Someone runs a spark-sql job to backfill some data in an older partition, 
commit c5. Since this spark-sql job (unlike deltastreamer) does not handle 
checkpoint copying from prev metadata to next, it would be the client's job to 
do this. 
   4. If they fail to do this, deltastreamer next ingestion c6 will read no 
checkpoint from c5. 
   
   I've made the following changes:
   
   1) To make this manageable, I've added the following config : 
`hoodie.write.meta.key.prefixes`. One can set this config to ensure that during 
the critical section, if this config is set, it will copy over all the metadata 
for the keys that match with the prefix set in this config from the latest 
metadata to the current commit.
   2) Made changes and added these multi-writer tests to `HoodieDeltaStreamer` 
as well. 
   
   NOTE: The test may be failing because of some thread.sleep related code that 
I'm trying to remove. Will update tomorrow.  
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] n3nash commented on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers

2021-03-14 Thread GitBox


n3nash commented on pull request #2374:
URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296


   @vinothchandar It's possible to allow backfills using spark-sql but there 
are some corner cases. Consider the following:
   
   1. Ingestion job running with commit c4 (checkpoint = c3)
   2. Ingestion job finishes with commit c4 (checkpoint = c3)
   3. Someone runs a spark-sql job to backfill some data in an older partition, 
commit c5. Since this spark-sql job (unlike deltastreamer) does not handle 
checkpoint copying from prev metadata to next, it would be the client's job to 
do this. 
   4. If they fail to do this, deltastreamer next ingestion c6 will read no 
checkpoint from c5. 
   
   I've made the following changes:
   
   1) To make this manageable, I've added the following config : 
`hoodie.write.meta.key.prefixes`. One can set this config to ensure that during 
the critical section, if this config is set, it will copy over all the metadata 
for the keys that match with the prefix set in this config from the latest 
metadata to the current commit.
   2) Added these multi-writer tests to `HoodieDeltaStreamer` as well. 
   
   NOTE: The test may be failing because of some thread.sleep related code that 
I'm trying to remove. Will update tomorrow.  
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[hudi] branch master updated (f5e31be -> e93c6a5)

2021-03-14 Thread vinoth
This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git.


from f5e31be  [HUDI-1685] keep updating current date for every batch (#2671)
 add e93c6a5  [HUDI-1496] Fixing input stream detection of GCS FileSystem 
(#2500)

No new revisions were added by this update.

Summary of changes:
 .../java/org/apache/hudi/common/fs/FSUtils.java| 11 ++--
 ...uard.java => SchemeAwareFSDataInputStream.java} | 35 ++-
 .../hudi/common/table/log/HoodieLogFileReader.java | 71 --
 .../common/table/log/block/HoodieLogBlock.java | 26 +---
 4 files changed, 81 insertions(+), 62 deletions(-)
 copy 
hudi-common/src/main/java/org/apache/hudi/common/fs/{NoOpConsistencyGuard.java 
=> SchemeAwareFSDataInputStream.java} (52%)



[GitHub] [hudi] vinothchandar merged pull request #2500: [HUDI-1496] Fixing detection of GCS FileSystem

2021-03-14 Thread GitBox


vinothchandar merged pull request #2500:
URL: https://github.com/apache/hudi/pull/2500


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.

2021-03-14 Thread GitBox


vinothchandar commented on a change in pull request #2494:
URL: https://github.com/apache/hudi/pull/2494#discussion_r593866290



##
File path: 
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
##
@@ -147,82 +150,91 @@ private void initIfNeeded() {
 }
   }
   timings.add(timer.endTimer());
-  LOG.info(String.format("Metadata read for key %s took [open, 
baseFileRead, logMerge] %s ms", key, timings));
+  LOG.info(String.format("Metadata read for key %s took [baseFileRead, 
logMerge] %s ms", key, timings));
   return Option.ofNullable(hoodieRecord);
 } catch (IOException ioe) {
   throw new HoodieIOException("Error merging records from metadata table 
for key :" + key, ioe);
-} finally {

Review comment:
   this close is actually needed, when opening the metadata table from the 
executors. Otherwise, we will leak and suffer the same issues as before. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2580: [HUDI 1623] Introduce start & end commit times to timeline

2021-03-14 Thread GitBox


vinothchandar commented on pull request #2580:
URL: https://github.com/apache/hudi/pull/2580#issuecomment-798871487


   @n3nash Adding the transtion time to the end of the current timeline files, 
separated by a dot . i.e t1.commit.requested.t2 seems okay. But unsure how the 
code would handle the cases where state=completed. i.e we only have 
t1.commit.t2 and somehow the parsing has to intelligently handle that t2 is a 
timestamp and not one of the states right. I think we have to do a 
upgrade/downgrade step here for sure, whichever way we go.
   
   did you have more high level thoughts on this?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on a change in pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-14 Thread GitBox


vinothchandar commented on a change in pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#discussion_r593862114



##
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java
##
@@ -152,9 +155,9 @@ public void write() {
 final String key = keyIterator.next();
 HoodieRecord record = recordMap.get(key);
 if (useWriterSchema) {
-  write(record, 
record.getData().getInsertValue(writerSchemaWithMetafields));
+  write(record, 
record.getData().getInsertValue(inputSchemaWithMetaFields));

Review comment:
   I actually disagree completely :) . Leaking such call hierarchy into a 
lower level class will lead to more confusions, if say one more code path uses 
this code. 





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-03-14 Thread GitBox


wosow commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798864560


   > @wosow Were you able to resolve your issue ?
   
   no



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-03-14 Thread GitBox


wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after 
upgrade? in other words, older Hudi version you were able to run successfully 
and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
   ```
   There is no impact on the production environment, only the problem occurred 
in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
  In addition, I have another question. I use sqoop to import the data 
in mysql to HDFS, and then use Spark to read and write the Hudi table. The 
table type is MOR. If I want to use asynchronous compaction, what parameters 
need to be configured,  asynchronous Is Compaction automatic? Need manual 
intervention? Or is it necessary to manually intervene Compaction after opening 
asynchronous Compaction? If it is necessary to manually execute Compaction 
manually on a regular basis, what parameters need to be configured for manual 
Compaction, and what are the commands for manual Compaction? Looking forward to 
your answer! ! !
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-03-14 Thread GitBox


wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after 
upgrade? in other words, older Hudi version you were able to run successfully 
and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
   There is no impact on the production environment, only the problem 
occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
  In addition, I have another question. I use sqoop to import the data 
in mysql to HDFS, and then use Spark to read and write the Hudi table. The 
table type is MOR. If I want to use asynchronous compaction, what parameters 
need to be configured,  asynchronous Is Compaction automatic? Need manual 
intervention? Or is it necessary to manually intervene Compaction after opening 
asynchronous Compaction? If it is necessary to manually execute Compaction 
manually on a regular basis, what parameters need to be configured for manual 
Compaction, and what are the commands for manual Compaction? Looking forward to 
your answer! ! !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-03-14 Thread GitBox


wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after 
upgrade? in other words, older Hudi version you were able to run successfully 
and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
   There is no impact on the production environment, only the problem 
occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
   
  In addition, I have another question. I use sqoop to import the data 
in mysql to HDFS, and then use Spark to read and write the Hudi table. The 
table type is MOR. If I want to use asynchronous compaction, what parameters 
need to be configured,  asynchronous Is Compaction automatic? Need manual 
intervention? Or is it necessary to manually intervene Compaction after opening 
asynchronous Compaction? If it is necessary to manually execute Compaction 
manually on a regular basis, what parameters need to be configured for manual 
Compaction, and what are the commands for manual Compaction? Looking forward to 
your answer! ! !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-03-14 Thread GitBox


wosow edited a comment on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after 
upgrade? in other words, older Hudi version you were able to run successfully 
and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
There is no impact on the production environment, only the problem 
occurred in the test 0.6.0, and the test was not 
   performed in the 0.7.0.
   
In addition, I have another question. I use sqoop to import the data in 
mysql to HDFS, and then use Spark to read and write 
   the Hudi table. The table type is MOR. If I want to use asynchronous 
compaction, what parameters need to be configured, 
   asynchronous Is Compaction automatic? Need manual intervention? Or is it 
necessary to manually intervene Compaction after 
   opening asynchronous Compaction? If it is necessary to manually execute 
Compaction manually on a regular basis, what 
   parameters need to be configured for manual Compaction, and what are the 
commands for manual Compaction? Looking 
   forward to your answer! ! !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables

2021-03-14 Thread GitBox


wosow commented on issue #2409:
URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222


   > @wosow : also, few quick questions as we triage the issue.
   > 
   > * Were you running older version of Hudi and encountered this after 
upgrade? in other words, older Hudi version you were able to run successfully 
and with 0.7.0 there is a bug.
   > * Is this affecting your production? trying to gauge the severity.
   > * Or you are trying out a POC ? and this is the first time trying out Hudi.
   
There is no impact on the production environment, only the problem 
occurred in the test 0.6.0, and the test was not performed in the 0.7.0.
   
In addition, I have another question. I use sqoop to import the data in 
mysql to HDFS, and then use Spark to read and write the Hudi table. The table 
type is MOR. If I want to use asynchronous compaction, what parameters need to 
be configured, asynchronous Is Compaction automatic? Need manual intervention? 
Or is it necessary to manually intervene Compaction after opening asynchronous 
Compaction? If it is necessary to manually execute Compaction manually on a 
regular basis, what parameters need to be configured for manual Compaction, and 
what are the commands for manual Compaction? Looking forward to your answer! ! !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] vinothchandar commented on pull request #2309: [HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o…

2021-03-14 Thread GitBox


vinothchandar commented on pull request #2309:
URL: https://github.com/apache/hudi/pull/2309#issuecomment-798858167


   @n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but 
can someone please summarize its state? also this needs a rebase with only the 
necessary changes.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org