[GitHub] [hudi] danny0405 commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…
danny0405 commented on a change in pull request #2669: URL: https://github.com/apache/hudi/pull/2669#discussion_r594086761 ## File path: hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java ## @@ -16,14 +16,14 @@ * limitations under the License. */ -package org.apache.hudi.operator; +package org.apache.hudi.configuration; import org.apache.hudi.common.model.HoodieTableType; -import org.apache.hudi.config.HoodieWriteConfig; -import org.apache.hudi.streamer.FlinkStreamerConfig; import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.keygen.SimpleAvroKeyGenerator; import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.streamer.FlinkStreamerConfig; Review comment: `FlinkStreamerConfig` is only used for streamer, keep it under `streamer` package is more reasonable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…
danny0405 commented on a change in pull request #2669: URL: https://github.com/apache/hudi/pull/2669#discussion_r594085950 ## File path: hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamerV2.java ## @@ -19,10 +19,10 @@ package org.apache.hudi.streamer; import org.apache.hudi.common.model.HoodieRecord; -import org.apache.hudi.operator.FlinkOptions; -import org.apache.hudi.operator.StreamWriteOperatorFactory; -import org.apache.hudi.operator.partitioner.BucketAssignFunction; -import org.apache.hudi.operator.transform.RowDataToHoodieFunction; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.sink.StreamWriteOperatorFactory; +import org.apache.hudi.sink.partitioner.BucketAssignFunction; +import org.apache.hudi.sink.transform.RowDataToHoodieFunction; import org.apache.hudi.util.AvroSchemaConverter; Review comment: No, this class references Flink, i would prefer to keep the name. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…
yanghua commented on a change in pull request #2669: URL: https://github.com/apache/hudi/pull/2669#discussion_r594077871 ## File path: hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java ## @@ -16,14 +16,14 @@ * limitations under the License. */ -package org.apache.hudi.operator; +package org.apache.hudi.configuration; import org.apache.hudi.common.model.HoodieTableType; -import org.apache.hudi.config.HoodieWriteConfig; -import org.apache.hudi.streamer.FlinkStreamerConfig; import org.apache.hudi.common.model.OverwriteWithLatestAvroPayload; +import org.apache.hudi.config.HoodieWriteConfig; import org.apache.hudi.keygen.SimpleAvroKeyGenerator; import org.apache.hudi.keygen.constant.KeyGeneratorOptions; +import org.apache.hudi.streamer.FlinkStreamerConfig; Review comment: We have a `configuration` subpackage, can we put `FlinkStreamerConfig ` into it? ## File path: hudi-flink/src/test/java/org/apache/hudi/sink/StreamWriteOperatorCoordinatorTest.java ## @@ -16,16 +16,16 @@ * limitations under the License. */ -package org.apache.hudi.operator; +package org.apache.hudi.sink; import org.apache.hudi.client.WriteStatus; import org.apache.hudi.common.fs.FSUtils; import org.apache.hudi.common.model.HoodieWriteStat; import org.apache.hudi.common.table.HoodieTableMetaClient; import org.apache.hudi.exception.HoodieException; -import org.apache.hudi.operator.event.BatchWriteSuccessEvent; -import org.apache.hudi.operator.utils.TestConfigurations; +import org.apache.hudi.sink.event.BatchWriteSuccessEvent; import org.apache.hudi.util.StreamerUtil; +import org.apache.hudi.utils.TestConfigurations; Review comment: `TestStreamWriteOperatorCoordinator` sounds better? ## File path: hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamerV2.java ## @@ -19,10 +19,10 @@ package org.apache.hudi.streamer; import org.apache.hudi.common.model.HoodieRecord; -import org.apache.hudi.operator.FlinkOptions; -import org.apache.hudi.operator.StreamWriteOperatorFactory; -import org.apache.hudi.operator.partitioner.BucketAssignFunction; -import org.apache.hudi.operator.transform.RowDataToHoodieFunction; +import org.apache.hudi.configuration.FlinkOptions; +import org.apache.hudi.sink.StreamWriteOperatorFactory; +import org.apache.hudi.sink.partitioner.BucketAssignFunction; +import org.apache.hudi.sink.transform.RowDataToHoodieFunction; import org.apache.hudi.util.AvroSchemaConverter; Review comment: We have some classes that follow these patterns, e.g. `Converter`, `Converters`. Can we choose one? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers
codecov-io edited a comment on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-750782300 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=h1) Report > Merging [#2374](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=desc) (d477189) into [master](https://codecov.io/gh/apache/hudi/commit/2fdae6835ce3fcad3111205d2373a69b34788483?el=desc) (2fdae68) will **decrease** coverage by `42.34%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2374/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master #2374 +/- ## - Coverage 51.87% 9.52% -42.35% + Complexity 3556 48 -3508 Files 465 53 -412 Lines 221651963-20202 Branches 2357 235 -2122 - Hits 11498 187-11311 + Misses 96671763 -7904 + Partials 1000 13 -987 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.52% <0.00%> (-59.96%)` | `0.00 <0.00> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2374?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `0.00% <0.00%> (-70.00%)` | `0.00 <0.00> (-52.00)` | | | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2374/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFy
[GitHub] [hudi] codecov-io edited a comment on pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished
codecov-io edited a comment on pull request #2673: URL: https://github.com/apache/hudi/pull/2673#issuecomment-799061373 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=h1) Report > Merging [#2673](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=desc) (a993277) into [master](https://codecov.io/gh/apache/hudi/commit/e93c6a569310ce55c5a0fc0655328e7fd32a9da2?el=desc) (e93c6a5) will **increase** coverage by `17.44%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2673/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master#2673 +/- ## = + Coverage 51.99% 69.43% +17.44% + Complexity 3580 363 -3217 = Files 466 53 -413 Lines 22275 1963-20312 Branches 2374 235 -2139 = - Hits 11581 1363-10218 + Misses 9686 466 -9220 + Partials 1008 134 -874 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.43% <ø> (-0.06%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...apache/hudi/utilities/deltastreamer/DeltaSync.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL2RlbHRhc3RyZWFtZXIvRGVsdGFTeW5jLmphdmE=) | `70.00% <0.00%> (-0.35%)` | `52.00% <0.00%> (-1.00%)` | | | [...che/hudi/common/model/HoodiePartitionMetadata.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZVBhcnRpdGlvbk1ldGFkYXRhLmphdmE=) | | | | | [...apache/hudi/common/fs/inline/InLineFileSystem.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2ZzL2lubGluZS9JbkxpbmVGaWxlU3lzdGVtLmphdmE=) | | | | | [.../hadoop/utils/HoodieRealtimeRecordReaderUtils.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1oYWRvb3AtbXIvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaGFkb29wL3V0aWxzL0hvb2RpZVJlYWx0aW1lUmVjb3JkUmVhZGVyVXRpbHMuamF2YQ==) | | | | | [...di-cli/src/main/java/org/apache/hudi/cli/Main.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL01haW4uamF2YQ==) | | | | | [...n/java/org/apache/hudi/common/model/HoodieKey.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL21vZGVsL0hvb2RpZUtleS5qYXZh) | | | | | [.../common/table/log/block/HoodieLogBlockVersion.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9ibG9jay9Ib29kaWVMb2dCbG9ja1ZlcnNpb24uamF2YQ==) | | | | | [...e/hudi/exception/HoodieSerializationException.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZVNlcmlhbGl6YXRpb25FeGNlcHRpb24uamF2YQ==) | | | | | [...e/hudi/exception/HoodieCorruptedDataException.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvZXhjZXB0aW9uL0hvb2RpZUNvcnJ1cHRlZERhdGFFeGNlcHRpb24uamF2YQ==) | | | | | [.../org/apache/hudi/common/engine/EngineProperty.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2VuZ2luZS9FbmdpbmVQcm9wZXJ0eS5qYXZh) | | | | | ... and [403 more](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree-more) | | This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, pl
[GitHub] [hudi] maxiaoniu commented on issue #2639: [SUPPORT] Spark 3.0.1 upgrade cause severe increase in Hudi write time
maxiaoniu commented on issue #2639: URL: https://github.com/apache/hudi/issues/2639#issuecomment-799088028 Might be related with this ``` Important Amazon EMR 6.1.0 and 6.2.0 include a performance issue that can critically affect all Hudi insert, upsert, and delete operations. If you plan to use Hudi with Amazon EMR 6.1.0 or 6.2.0, you should contact AWS support to obtain a patched Hudi RPM. ``` https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1690) Fix StackOverflowError while running clustering with large number of partitions
[ https://issues.apache.org/jira/browse/HUDI-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rong Ma updated HUDI-1690: -- Description: We are testing clustering on a hudi table with about 3000 partitions. The spark driver throws StackOverflowError before all the partitions sorted: 21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed java.lang.StackOverflowError at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254) at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:477) at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) ... I see similar issue here: [https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error] Setting the driver's stack size to 100M still has this error. So this is probably because the rdd.union has been called too many times and the result of rdd lineage is too large. I think we should use JavaSparkContext.union instead RDD.union here [https://github.com/apache/hudi/blob/e93c6a569310ce55c5a0fc0655328e7fd32a9da2/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/cluster/SparkExecuteClusteringCommitActionExecutor.java#L96] was: We are testing clustering on a hudi table with about 3000 partitions. The spark driver throws StackOverflowError before all the partitions sorted: 21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed java.lang.StackOverflowError at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254) at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Met
[jira] [Created] (HUDI-1690) Fix StackOverflowError while running clustering with large number of partitions
Rong Ma created HUDI-1690: - Summary: Fix StackOverflowError while running clustering with large number of partitions Key: HUDI-1690 URL: https://issues.apache.org/jira/browse/HUDI-1690 Project: Apache Hudi Issue Type: Bug Components: Spark Integration Reporter: Rong Ma Fix For: 0.8.0 We are testing clustering on a hudi table with about 3000 partitions. The spark driver throws StackOverflowError before all the partitions sorted: 21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed java.lang.StackOverflowError at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254) at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:477) at sun.reflect.GeneratedMethodAccessor51.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) ... I see similar issue here: https://stackoverflow.com/questions/30522564/spark-when-union-a-lot-of-rdd-throws-stack-overflow-error Setting the driver's stack size to 100M still has this error. We are testing clustering on a hudi table with about 3000 partitions. The spark driver throws StackOverflowError before all the partitions sorted: 21/03/11 19:51:20 ERROR [main] UtilHelpers: Cluster failed java.lang.StackOverflowError at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteReplace(ObjectStreamClass.java:1118) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1136) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at org.apache.spark.RangePartitioner.$anonfun$writeObject$1(Partitioner.scala:261) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) at org.apache.spark.RangePartitioner.writeObject(Partitioner.scala:254) at sun.reflect.GeneratedMethodAccessor201.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) at j
[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 edited a comment on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055 > @pengzhiwei2018 first of all, thanks for these great contributions. > > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands for inputSchema, input is what is being written, right? We can probably just leave it as is. and introduce new `tableSchema` variables as you have in the `HoodieWriteHandle` class.? > > Like someone else pointed out as well, so far, we are using read and write schemas consistently. Love to not introduce a new input schema, unless its absolutely necessary . Hi @vinothchandar ,thanks for your reply on this issue. Yes, in most case ,the `writeSchema` is the same with the `inputSchema` which can stands for the `inputSchema` . But in the case in this PR (test case in [TestCOWDataSource](https://github.com/apache/hudi/pull/2334/files#diff-9429f5bc432f70ea4801e306dd817416b76e6ab68d41a278e222c989ce5c9824)) we write the table twice: First, we write a "id: long" to the table. The input schema is "a:long", the table schema is "a:long". Second, we write a "id:int" to the table. The input schema is "a:int", but the table schema is "a:long" as the previous write. The write schema should be the same with the table schema, or else an Exception would throw out which is the problem we want to solve in this PR. So in this case, we need to distinguish the difference between the `inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's schema, but the `writeSchema` is always the `tableSchema`. - The `inputSchema` is used to parser the record from the incoming data. - The `tableSchema` is used to write and read record from the table. When we want to write or read record to the table, we use the `tableSchema`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 edited a comment on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055 > @pengzhiwei2018 first of all, thanks for these great contributions. > > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands for inputSchema, input is what is being written, right? We can probably just leave it as is. and introduce new `tableSchema` variables as you have in the `HoodieWriteHandle` class.? > > Like someone else pointed out as well, so far, we are using read and write schemas consistently. Love to not introduce a new input schema, unless its absolutely necessary . Hi @vinothchandar ,thanks for your reply on this issue. Yes, in most case ,the `writeSchema` is the same with the `inputSchema` which can stands for the `inputSchema` . But in the case in this PR (test case in TestCOWDataSource) we write the table twice: First, we write a "id: long" to the table. The input schema is "a:long", the table schema is "a:long". Second, we write a "id:int" to the table. The input schema is "a:int", but the table schema is "a:long" as the previous write. The write schema should be the same with the table schema, or else an Exception would throw out which is the problem we want to solve in this PR. So in this case, we need to distinguish the difference between the `inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's schema, but the `writeSchema` is always the `tableSchema`. - The `inputSchema` is used to parser the record from the incoming data. - The `tableSchema` is used to write and read record from the table. When we want to write or read record to the table, we use the `tableSchema`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 edited a comment on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 edited a comment on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055 > @pengzhiwei2018 first of all, thanks for these great contributions. > > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands for inputSchema, input is what is being written, right? We can probably just leave it as is. and introduce new `tableSchema` variables as you have in the `HoodieWriteHandle` class.? > > Like someone else pointed out as well, so far, we are using read and write schemas consistently. Love to not introduce a new input schema, unless its absolutely necessary . Hi @vinothchandar ,thanks for your reply on this issue. Yes, in most case ,the `writeSchema` is the same with the `inputSchema` which can stands for the `inputSchema` . But in the case in this PR, we write the table twice: First, we write a "id: long" to the table. The input schema is "a:long", the table schema is "a:long". Second, we write a "id:int" to the table. The input schema is "a:int", but the table schema is "a:long" as the previous write. The write schema should be the same with the table schema, or else an Exception would throw out which is the problem we want to solve in this PR. So in this case, we need to distinguish the difference between the `inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's schema, but the `writeSchema` is always the `tableSchema`. - The `inputSchema` is used to parser the record from the incoming data. - The `tableSchema` is used to write and read record from the table. When we want to write or read record to the table, we use the `tableSchema`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 commented on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055 > @pengzhiwei2018 first of all, thanks for these great contributions. > > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands for inputSchema, input is what is being written, right? We can probably just leave it as is. and introduce new `tableSchema` variables as you have in the `HoodieWriteHandle` class.? > > Like someone else pointed out as well, so far, we are using read and write schemas consistently. Love to not introduce a new input schema, unless its absolutely necessary . Hi @vinothchandar ,thanks for your reply on this issue. Yes, in most case ,the `writeSchema` is the same with the `inputSchema` which can stands for the `inputSchema` . But in the case in this PR, we write the table twice: First, we write a "id: long" to the table. The input schema is "a:long", the table schema is "a:long". Second, we write a "id:int" to the table. The input schema is "a:int", but the table schema is "a:long" as the previous write. The write schema should be the same with the table schema, or else an Exception would throw out which is the problem we want to solve in this PR. So in this case, we need to distinguish the difference between the `inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's schema, but the `writeSchema` is always the `tableSchema`. **Here's the summary** - The `inputSchema` is used to parser the record from the incoming data. - The `tableSchema` is used to write and read record from the table. When we want to write or read record to the table, we use the `tableSchema`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io commented on pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished
codecov-io commented on pull request #2673: URL: https://github.com/apache/hudi/pull/2673#issuecomment-799061373 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=h1) Report > Merging [#2673](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=desc) (e391c24) into [master](https://codecov.io/gh/apache/hudi/commit/e93c6a569310ce55c5a0fc0655328e7fd32a9da2?el=desc) (e93c6a5) will **decrease** coverage by `42.46%`. > The diff coverage is `n/a`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2673/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master #2673 +/- ## - Coverage 51.99% 9.52% -42.47% + Complexity 3580 48 -3532 Files 466 53 -413 Lines 222751963-20312 Branches 2374 235 -2139 - Hits 11581 187-11394 + Misses 96861763 -7923 + Partials 1008 13 -995 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `?` | `?` | | | hudiclient | `?` | `?` | | | hudicommon | `?` | `?` | | | hudiflink | `?` | `?` | | | hudihadoopmr | `?` | `?` | | | hudisparkdatasource | `?` | `?` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `9.52% <ø> (-59.96%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2673?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/utilities/IdentitySplitter.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL0lkZW50aXR5U3BsaXR0ZXIuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-2.00%)` | | | [...va/org/apache/hudi/utilities/schema/SchemaSet.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFTZXQuamF2YQ==) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-3.00%)` | | | [...a/org/apache/hudi/utilities/sources/RowSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUm93U291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [.../org/apache/hudi/utilities/sources/AvroSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQXZyb1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [.../org/apache/hudi/utilities/sources/JsonSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvblNvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-1.00%)` | | | [...rg/apache/hudi/utilities/sources/CsvDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvQ3N2REZTU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-10.00%)` | | | [...g/apache/hudi/utilities/sources/JsonDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkRGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-4.00%)` | | | [...apache/hudi/utilities/sources/JsonKafkaSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvSnNvbkthZmthU291cmNlLmphdmE=) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-6.00%)` | | | [...pache/hudi/utilities/sources/ParquetDFSSource.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NvdXJjZXMvUGFycXVldERGU1NvdXJjZS5qYXZh) | `0.00% <0.00%> (-100.00%)` | `0.00% <0.00%> (-5.00%)` | | | [...lities/schema/SchemaProviderWithPostProcessor.java](https://codecov.io/gh/apache/hudi/pull/2673/diff?src=pr&el=tree#diff-aHVkaS11dGlsaXRpZXMvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvdXRpbGl0aWVzL3NjaGVtYS9TY2hlbWFQcm92aWRlc
[GitHub] [hudi] pengzhiwei2018 commented on a change in pull request #2651: [HUDI-1591] [RFC-26] Improve Hoodie Table Query Performance And Ease Of Use Fo…
pengzhiwei2018 commented on a change in pull request #2651: URL: https://github.com/apache/hudi/pull/2651#discussion_r594023224 ## File path: hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala ## @@ -79,39 +82,58 @@ class DefaultSource extends RelationProvider val allPaths = path.map(p => Seq(p)).getOrElse(Seq()) ++ readPaths val fs = FSUtils.getFs(allPaths.head, sqlContext.sparkContext.hadoopConfiguration) -val globPaths = HoodieSparkUtils.checkAndGlobPathIfNecessary(allPaths, fs) - -val tablePath = DataSourceUtils.getTablePath(fs, globPaths.toArray) +// Use the HoodieFileIndex only if the 'path' has specified with no "*" contains. +// And READ_PATHS_OPT_KEY has not specified. +// Or else we use the original way to read hoodie table. Review comment: A Jira has opened at [HUDI-1689](https://issues.apache.org/jira/browse/HUDI-1689) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] xiarixiaoyao commented on pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished
xiarixiaoyao commented on pull request #2673: URL: https://github.com/apache/hudi/pull/2673#issuecomment-799059684 cc @garyli1019 , could you help me review this pr, thanks This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (HUDI-1688) hudi write should uncache rdd, when the write operation is finnished
[ https://issues.apache.org/jira/browse/HUDI-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-1688: - Labels: pull-request-available (was: ) > hudi write should uncache rdd, when the write operation is finnished > > > Key: HUDI-1688 > URL: https://issues.apache.org/jira/browse/HUDI-1688 > Project: Apache Hudi > Issue Type: Bug > Components: Spark Integration >Affects Versions: 0.7.0 >Reporter: tao meng >Priority: Major > Labels: pull-request-available > Fix For: 0.8.0 > > > now, hudi improve write performance by cache necessary rdds; however when the > write operation is finnished, those cached rdds have not been uncached which > waste lots of memory. > [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115] > https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214 > In our environment: > step1: insert 100GB data into hudi table by spark (ok) > step2: insert another 100GB data into hudi table by spark again (oom ) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] xiarixiaoyao opened a new pull request #2673: [HUDI-1688]hudi write should uncache rdd, when the write operation is finished
xiarixiaoyao opened a new pull request #2673: URL: https://github.com/apache/hudi/pull/2673 ## *Tips* - *Thank you very much for contributing to Apache Hudi.* - *Please review https://hudi.apache.org/contributing.html before opening a pull request.* ## What is the purpose of the pull request fix the bug that hudi cannot uncache rdds correctly now, hudi improve write performance by cache necessary rdds; however when the write operation is finnished, those cached rdds have not been uncached which waste lots of memory. https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115 https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214 In our environment: step1: insert 100GB data into hudi table by spark (ok) step2: insert another 100GB data into hudi table by spark again (oom ) ## Brief change log uncached rdds, when the write operation is finished ## Verify this pull request Existing UT ## Committer checklist - [ ] Has a corresponding JIRA in PR title & commit - [ ] Commit message is descriptive of the change - [ ] CI is green - [ ] Necessary doc changes done or have another open PR - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Assigned] (HUDI-1689) Support Multipath query for HoodieFileIndex
[ https://issues.apache.org/jira/browse/HUDI-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] pengzhiwei reassigned HUDI-1689: Assignee: pengzhiwei > Support Multipath query for HoodieFileIndex > --- > > Key: HUDI-1689 > URL: https://issues.apache.org/jira/browse/HUDI-1689 > Project: Apache Hudi > Issue Type: Improvement > Components: Spark Integration >Reporter: pengzhiwei >Assignee: pengzhiwei >Priority: Major > > Support Multipath query for the HoodieFileIndex to benefit from the partition > prune. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (HUDI-1689) Support Multipath query for HoodieFileIndex
pengzhiwei created HUDI-1689: Summary: Support Multipath query for HoodieFileIndex Key: HUDI-1689 URL: https://issues.apache.org/jira/browse/HUDI-1689 Project: Apache Hudi Issue Type: Improvement Components: Spark Integration Reporter: pengzhiwei Support Multipath query for the HoodieFileIndex to benefit from the partition prune. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] codecov-io edited a comment on pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…
codecov-io edited a comment on pull request #2669: URL: https://github.com/apache/hudi/pull/2669#issuecomment-797515929 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=h1) Report > Merging [#2669](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=desc) (f632e7a) into [master](https://codecov.io/gh/apache/hudi/commit/20786ab8a2a1e7735ab846e92802fb9f4449adc9?el=desc) (20786ab) will **decrease** coverage by `0.04%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2669/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master#2669 +/- ## - Coverage 52.00% 51.96% -0.05% + Complexity 3579 3578 -1 Files 465 466 +1 Lines 2226822275 +7 Branches 2375 2374 -1 - Hits 1158111575 -6 - Misses 9676 9690 +14 + Partials 1011 1010 -1 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.44% <ø> (-0.08%)` | `0.00 <ø> (ø)` | | | hudiflink | `53.57% <100.00%> (ø)` | `0.00 <2.00> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `49.62% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.43% <ø> (-0.11%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh) | `85.49% <ø> (ø)` | `6.00 <0.00> (?)` | | | [...rg/apache/hudi/schema/FilebasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvRmlsZWJhc2VkU2NoZW1hUHJvdmlkZXIuamF2YQ==) | `29.16% <ø> (ø)` | `2.00 <0.00> (ø)` | | | [...src/main/java/org/apache/hudi/sink/CommitSink.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0NvbW1pdFNpbmsuamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/sink/InstantGenerateOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0luc3RhbnRHZW5lcmF0ZU9wZXJhdG9yLmphdmE=) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...rg/apache/hudi/sink/KeyedWriteProcessFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzRnVuY3Rpb24uamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...rg/apache/hudi/sink/KeyedWriteProcessOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzT3BlcmF0b3IuamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...java/org/apache/hudi/sink/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlRnVuY3Rpb24uamF2YQ==) | `84.00% <ø> (ø)` | `22.00 <0.00> (?)` | | | [...java/org/apache/hudi/sink/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3IuamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh) | `69.13% <ø> (ø)` | `32.00 <0.00> (?)` | | | [...g/apache/hudi/sink/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JGYWN0b3J5LmphdmE=) |
[GitHub] [hudi] codecov-io edited a comment on pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pa…
codecov-io edited a comment on pull request #2669: URL: https://github.com/apache/hudi/pull/2669#issuecomment-797515929 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=h1) Report > Merging [#2669](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=desc) (f632e7a) into [master](https://codecov.io/gh/apache/hudi/commit/20786ab8a2a1e7735ab846e92802fb9f4449adc9?el=desc) (20786ab) will **decrease** coverage by `0.19%`. > The diff coverage is `100.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2669/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master#2669 +/- ## - Coverage 52.00% 51.81% -0.20% + Complexity 3579 3388 -191 Files 465 445 -20 Lines 2226820764-1504 Branches 2375 2229 -146 - Hits 1158110759 -822 + Misses 9676 9070 -606 + Partials 1011 935 -76 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.44% <ø> (-0.08%)` | `0.00 <ø> (ø)` | | | hudiflink | `53.57% <100.00%> (ø)` | `0.00 <2.00> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.43% <ø> (-0.11%)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2669?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [...va/org/apache/hudi/configuration/FlinkOptions.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9jb25maWd1cmF0aW9uL0ZsaW5rT3B0aW9ucy5qYXZh) | `85.49% <ø> (ø)` | `6.00 <0.00> (?)` | | | [...rg/apache/hudi/schema/FilebasedSchemaProvider.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zY2hlbWEvRmlsZWJhc2VkU2NoZW1hUHJvdmlkZXIuamF2YQ==) | `29.16% <ø> (ø)` | `2.00 <0.00> (ø)` | | | [...src/main/java/org/apache/hudi/sink/CommitSink.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0NvbW1pdFNpbmsuamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/sink/InstantGenerateOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0luc3RhbnRHZW5lcmF0ZU9wZXJhdG9yLmphdmE=) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...rg/apache/hudi/sink/KeyedWriteProcessFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzRnVuY3Rpb24uamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...rg/apache/hudi/sink/KeyedWriteProcessOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL0tleWVkV3JpdGVQcm9jZXNzT3BlcmF0b3IuamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...java/org/apache/hudi/sink/StreamWriteFunction.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlRnVuY3Rpb24uamF2YQ==) | `84.00% <ø> (ø)` | `22.00 <0.00> (?)` | | | [...java/org/apache/hudi/sink/StreamWriteOperator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3IuamF2YQ==) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | [...ache/hudi/sink/StreamWriteOperatorCoordinator.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JDb29yZGluYXRvci5qYXZh) | `69.13% <ø> (ø)` | `32.00 <0.00> (?)` | | | [...g/apache/hudi/sink/StreamWriteOperatorFactory.java](https://codecov.io/gh/apache/hudi/pull/2669/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zaW5rL1N0cmVhbVdyaXRlT3BlcmF0b3JGYWN0b3J5LmphdmE=) | `0.00% <ø> (ø)` | `0.00 <0.00> (?)` | | | ...
[jira] [Created] (HUDI-1688) hudi write should uncache rdd, when the write operation is finnished
tao meng created HUDI-1688: -- Summary: hudi write should uncache rdd, when the write operation is finnished Key: HUDI-1688 URL: https://issues.apache.org/jira/browse/HUDI-1688 Project: Apache Hudi Issue Type: Bug Components: Spark Integration Affects Versions: 0.7.0 Reporter: tao meng Fix For: 0.8.0 now, hudi improve write performance by cache necessary rdds; however when the write operation is finnished, those cached rdds have not been uncached which waste lots of memory. [https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L115] https://github.com/apache/hudi/blob/master/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java#L214 In our environment: step1: insert 100GB data into hudi table by spark (ok) step2: insert another 100GB data into hudi table by spark again (oom ) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] danny0405 commented on pull request #2309: [HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o…
danny0405 commented on pull request #2309: URL: https://github.com/apache/hudi/pull/2309#issuecomment-799047482 > @n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but can someone please summarize its state? also this needs a rebase with only the necessary changes. The changes overall looks good from my side, but this PR has to do a rebase because it introduces many conflicts commits from master branch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and re-organize the pa…
danny0405 commented on a change in pull request #2669: URL: https://github.com/apache/hudi/pull/2669#discussion_r594008808 ## File path: hudi-flink/src/main/java/org/apache/hudi/source/StreamWriteFunction.java ## @@ -16,7 +16,7 @@ * limitations under the License. */ -package org.apache.hudi.operator; +package org.apache.hudi.source; Review comment: Nope, thanks for the reminder ~ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] root18039532923 commented on issue #2623: org.apache.hudi.exception.HoodieDependentSystemUnavailableException:System HBASE unavailable.
root18039532923 commented on issue #2623: URL: https://github.com/apache/hudi/issues/2623#issuecomment-799044094 @n3nash This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] yanghua commented on a change in pull request #2669: [HUDI-1684] Tweak hudi-flink-bundle module pom and re-organize the pa…
yanghua commented on a change in pull request #2669: URL: https://github.com/apache/hudi/pull/2669#discussion_r593996516 ## File path: hudi-flink/src/main/java/org/apache/hudi/source/StreamWriteFunction.java ## @@ -16,7 +16,7 @@ * limitations under the License. */ -package org.apache.hudi.operator; +package org.apache.hudi.source; Review comment: Should it be put into the `source` package? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci
hudi-bot edited a comment on pull request #2643: URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481 ## CI report: * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN * 7a11c73f54424450987d56a49caf42eea092fce0 Azure: [SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=125) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci
hudi-bot edited a comment on pull request #2643: URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481 ## CI report: * c93712492faf19e818194762ec7a05976c79659e Azure: [SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123) * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN * 7a11c73f54424450987d56a49caf42eea092fce0 Azure: [PENDING](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=125) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci
hudi-bot edited a comment on pull request #2643: URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481 ## CI report: * c93712492faf19e818194762ec7a05976c79659e Azure: [SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123) * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN * 7a11c73f54424450987d56a49caf42eea092fce0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci
hudi-bot edited a comment on pull request #2643: URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481 ## CI report: * c93712492faf19e818194762ec7a05976c79659e Azure: [SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123) * 9831a6c50e9f49f8a71c02fc6ac50ae1446f7c1f UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] codecov-io edited a comment on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
codecov-io edited a comment on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-767956391 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=h1) Report > Merging [#2494](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=desc) (5557e1d) into [master](https://codecov.io/gh/apache/hudi/commit/d8af24d8a2fdbead4592a36df1bd9dda333f1513?el=desc) (d8af24d) will **increase** coverage by `0.38%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2494/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master#2494 +/- ## + Coverage 51.53% 51.92% +0.38% - Complexity 3491 3579 +88 Files 462 466 +4 Lines 2188122295 +414 Branches 2327 2378 +51 + Hits 1127711576 +299 - Misses 9624 9710 +86 - Partials980 1009 +29 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.33% <0.00%> (-0.15%)` | `0.00 <0.00> (ø)` | | | hudiflink | `53.57% <ø> (+3.22%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `49.62% <ø> (ø)` | `0.00 <ø> (ø)` | | | huditimelineservice | `64.36% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudiutilities | `69.48% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../org/apache/hudi/cli/commands/MetadataCommand.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL01ldGFkYXRhQ29tbWFuZC5qYXZh) | `1.07% <0.00%> (ø)` | `1.00 <0.00> (ø)` | | | [...pache/hudi/common/config/HoodieMetadataConfig.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVNZXRhZGF0YUNvbmZpZy5qYXZh) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | | | [.../hudi/common/table/view/FileSystemViewManager.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvRmlsZVN5c3RlbVZpZXdNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/io/storage/HoodieHFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVIRmlsZVJlYWRlci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...pache/hudi/metadata/HoodieBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllQmFja2VkVGFibGVNZXRhZGF0YS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/metadata/HoodieTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllVGFibGVNZXRhZGF0YS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...hudi/source/format/cow/CopyOnWriteInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvZm9ybWF0L2Nvdy9Db3B5T25Xcml0ZUlucHV0Rm9ybWF0LmphdmE=) | `56.08% <0.00%> (-29.22%)` | `20.00% <0.00%> (+14.00%)` | :arrow_down: | | [...java/org/apache/hudi/source/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvSG9vZGllVGFibGVTb3VyY2UuamF2YQ==) | `67.54% <0.00%> (-8.11%)` | `28.00% <0.00%> (-6.00%)` | | | [...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=) | `66.09% <0.00%> (-1.77%)` | `23.00% <0.00%> (+1.00%)` | :arrow_down: | | [...ies/sources/helpers/DatePartitionPathSelector.java](https://codecov.io/
[GitHub] [hudi] codecov-io edited a comment on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
codecov-io edited a comment on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-767956391 # [Codecov](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=h1) Report > Merging [#2494](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=desc) (5557e1d) into [master](https://codecov.io/gh/apache/hudi/commit/d8af24d8a2fdbead4592a36df1bd9dda333f1513?el=desc) (d8af24d) will **increase** coverage by `0.23%`. > The diff coverage is `0.00%`. [![Impacted file tree graph](https://codecov.io/gh/apache/hudi/pull/2494/graphs/tree.svg?width=650&height=150&src=pr&token=VTTXabwbs2)](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree) ```diff @@ Coverage Diff @@ ## master#2494 +/- ## + Coverage 51.53% 51.77% +0.23% + Complexity 3491 3389 -102 Files 462 445 -17 Lines 2188120784-1097 Branches 2327 2233 -94 - Hits 1127710760 -517 + Misses 9624 9090 -534 + Partials980 934 -46 ``` | Flag | Coverage Δ | Complexity Δ | | |---|---|---|---| | hudicli | `37.01% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | hudiclient | `100.00% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudicommon | `51.33% <0.00%> (-0.15%)` | `0.00 <0.00> (ø)` | | | hudiflink | `53.57% <ø> (+3.22%)` | `0.00 <ø> (ø)` | | | hudihadoopmr | `33.44% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisparkdatasource | `69.84% <ø> (ø)` | `0.00 <ø> (ø)` | | | hudisync | `?` | `?` | | | huditimelineservice | `?` | `?` | | | hudiutilities | `69.48% <ø> (ø)` | `0.00 <ø> (ø)` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags#carryforward-flags-in-the-pull-request-comment) to find out more. | [Impacted Files](https://codecov.io/gh/apache/hudi/pull/2494?src=pr&el=tree) | Coverage Δ | Complexity Δ | | |---|---|---|---| | [.../org/apache/hudi/cli/commands/MetadataCommand.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jbGkvc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY2xpL2NvbW1hbmRzL01ldGFkYXRhQ29tbWFuZC5qYXZh) | `1.07% <0.00%> (ø)` | `1.00 <0.00> (ø)` | | | [...pache/hudi/common/config/HoodieMetadataConfig.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL2NvbmZpZy9Ib29kaWVNZXRhZGF0YUNvbmZpZy5qYXZh) | `0.00% <ø> (ø)` | `0.00 <0.00> (ø)` | | | [.../hudi/common/table/view/FileSystemViewManager.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL3ZpZXcvRmlsZVN5c3RlbVZpZXdNYW5hZ2VyLmphdmE=) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/io/storage/HoodieHFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvaW8vc3RvcmFnZS9Ib29kaWVIRmlsZVJlYWRlci5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...pache/hudi/metadata/HoodieBackedTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllQmFja2VkVGFibGVNZXRhZGF0YS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [.../org/apache/hudi/metadata/HoodieTableMetadata.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvbWV0YWRhdGEvSG9vZGllVGFibGVNZXRhZGF0YS5qYXZh) | `0.00% <0.00%> (ø)` | `0.00 <0.00> (ø)` | | | [...hudi/source/format/cow/CopyOnWriteInputFormat.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvZm9ybWF0L2Nvdy9Db3B5T25Xcml0ZUlucHV0Rm9ybWF0LmphdmE=) | `56.08% <0.00%> (-29.22%)` | `20.00% <0.00%> (+14.00%)` | :arrow_down: | | [...java/org/apache/hudi/source/HoodieTableSource.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1mbGluay9zcmMvbWFpbi9qYXZhL29yZy9hcGFjaGUvaHVkaS9zb3VyY2UvSG9vZGllVGFibGVTb3VyY2UuamF2YQ==) | `67.54% <0.00%> (-8.11%)` | `28.00% <0.00%> (-6.00%)` | | | [...che/hudi/common/table/log/HoodieLogFileReader.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#diff-aHVkaS1jb21tb24vc3JjL21haW4vamF2YS9vcmcvYXBhY2hlL2h1ZGkvY29tbW9uL3RhYmxlL2xvZy9Ib29kaWVMb2dGaWxlUmVhZGVyLmphdmE=) | `66.09% <0.00%> (-1.77%)` | `23.00% <0.00%> (+1.00%)` | :arrow_down: | | [...ies/sources/helpers/DatePartitionPathSelector.java](https://codecov.io/gh/apache/hudi/pull/2494/diff?src=pr&el=tree#dif
[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci
hudi-bot edited a comment on pull request #2643: URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481 ## CI report: * c93712492faf19e818194762ec7a05976c79659e Azure: [SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=123) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on pull request #2494: URL: https://github.com/apache/hudi/pull/2494#issuecomment-798971624 @prashantwason We can remove the reuse configuration - i.e no need to have this behavior be user controlled. but ultimately, we still need to close everything out, where metadata table is opened from executors. I am going to just introduce a boolean variable within `HoodieBackedTableMetadata` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] hudi-bot edited a comment on pull request #2643: DO NOT MERGE (Azure CI) test branch ci
hudi-bot edited a comment on pull request #2643: URL: https://github.com/apache/hudi/pull/2643#issuecomment-792368481 ## CI report: * e6557a7ab08c1867dfee54360a0c76adfaf3d233 Azure: [SUCCESS](https://dev.azure.com/XUSH0012/0ef433cc-d4b4-47cc-b6a1-03d032ef546c/_build/results?buildId=115) * c93712492faf19e818194762ec7a05976c79659e UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run travis` re-run the last Travis build - `@hudi-bot run azure` re-run the last Azure build This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (HUDI-1673) Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex
[ https://issues.apache.org/jira/browse/HUDI-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenh062326 resolved HUDI-1673. --- Resolution: Fixed > Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex > > > Key: HUDI-1673 > URL: https://issues.apache.org/jira/browse/HUDI-1673 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: shenh062326 >Assignee: shenh062326 >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (HUDI-1673) Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex
[ https://issues.apache.org/jira/browse/HUDI-1673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenh062326 reassigned HUDI-1673: - Assignee: shenh062326 > Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex > > > Key: HUDI-1673 > URL: https://issues.apache.org/jira/browse/HUDI-1673 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: shenh062326 >Assignee: shenh062326 >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [hudi] vinothchandar commented on pull request #2612: [HUDI-1563] Adding hudi file sizing/ small file management blog
vinothchandar commented on pull request #2612: URL: https://github.com/apache/hudi/pull/2612#issuecomment-798876972 @nsivabalan my changes are in. Please feel free to land, once you take a pass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers
n3nash edited a comment on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296 @vinothchandar It's possible to allow backfills using spark-sql but there are some corner cases. Consider the following: 1. Ingestion job running with commit c4 (checkpoint = c3) 2. Ingestion job finishes with commit c4 (checkpoint = c3) 3. Someone runs a spark-sql job to backfill some data in an older partition, commit c5. Since this spark-sql job (unlike deltastreamer) does not handle checkpoint copying from prev metadata to next, it would be the client's job to do this. 4. If they fail to do this, deltastreamer next ingestion c6 will read no checkpoint from c5. I've made the following changes: 1) To make this manageable, I've added the following config : `hoodie.write.meta.key.prefixes`. One can set this config to ensure that during the critical section, if this config is set, it will copy over all the metadata for the keys that match with the prefix set in this config from the latest metadata to the current commit. 2) Made changes and added these multi-writer tests to `HoodieDeltaStreamer` as well. Technically, one can do the backfill using `HoodieDeltaStreamer` or `Spark-SQL`. For `HoodieDeltaStreamer` they would have to set some custom checkpoint or mark it to null to ensure that the job just picks the data from the backfill location, for Spark-SQL it would not matter. Yes, I am going to add documents on best practices / things to watch out in the other PR I opened for documentation. I will do that after resolving any further comments and landing this PR in the next couple of days. NOTE: The test may be failing because of some thread.sleep related code that I'm trying to remove. Will update tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash edited a comment on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers
n3nash edited a comment on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296 @vinothchandar It's possible to allow backfills using spark-sql but there are some corner cases. Consider the following: 1. Ingestion job running with commit c4 (checkpoint = c3) 2. Ingestion job finishes with commit c4 (checkpoint = c3) 3. Someone runs a spark-sql job to backfill some data in an older partition, commit c5. Since this spark-sql job (unlike deltastreamer) does not handle checkpoint copying from prev metadata to next, it would be the client's job to do this. 4. If they fail to do this, deltastreamer next ingestion c6 will read no checkpoint from c5. I've made the following changes: 1) To make this manageable, I've added the following config : `hoodie.write.meta.key.prefixes`. One can set this config to ensure that during the critical section, if this config is set, it will copy over all the metadata for the keys that match with the prefix set in this config from the latest metadata to the current commit. 2) Made changes and added these multi-writer tests to `HoodieDeltaStreamer` as well. NOTE: The test may be failing because of some thread.sleep related code that I'm trying to remove. Will update tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] n3nash commented on pull request #2374: [HUDI-845] Added locking capability to allow multiple writers
n3nash commented on pull request #2374: URL: https://github.com/apache/hudi/pull/2374#issuecomment-798875296 @vinothchandar It's possible to allow backfills using spark-sql but there are some corner cases. Consider the following: 1. Ingestion job running with commit c4 (checkpoint = c3) 2. Ingestion job finishes with commit c4 (checkpoint = c3) 3. Someone runs a spark-sql job to backfill some data in an older partition, commit c5. Since this spark-sql job (unlike deltastreamer) does not handle checkpoint copying from prev metadata to next, it would be the client's job to do this. 4. If they fail to do this, deltastreamer next ingestion c6 will read no checkpoint from c5. I've made the following changes: 1) To make this manageable, I've added the following config : `hoodie.write.meta.key.prefixes`. One can set this config to ensure that during the critical section, if this config is set, it will copy over all the metadata for the keys that match with the prefix set in this config from the latest metadata to the current commit. 2) Added these multi-writer tests to `HoodieDeltaStreamer` as well. NOTE: The test may be failing because of some thread.sleep related code that I'm trying to remove. Will update tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[hudi] branch master updated (f5e31be -> e93c6a5)
This is an automated email from the ASF dual-hosted git repository. vinoth pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/hudi.git. from f5e31be [HUDI-1685] keep updating current date for every batch (#2671) add e93c6a5 [HUDI-1496] Fixing input stream detection of GCS FileSystem (#2500) No new revisions were added by this update. Summary of changes: .../java/org/apache/hudi/common/fs/FSUtils.java| 11 ++-- ...uard.java => SchemeAwareFSDataInputStream.java} | 35 ++- .../hudi/common/table/log/HoodieLogFileReader.java | 71 -- .../common/table/log/block/HoodieLogBlock.java | 26 +--- 4 files changed, 81 insertions(+), 62 deletions(-) copy hudi-common/src/main/java/org/apache/hudi/common/fs/{NoOpConsistencyGuard.java => SchemeAwareFSDataInputStream.java} (52%)
[GitHub] [hudi] vinothchandar merged pull request #2500: [HUDI-1496] Fixing detection of GCS FileSystem
vinothchandar merged pull request #2500: URL: https://github.com/apache/hudi/pull/2500 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2494: [HUDI-1552] Improve performance of key lookups from base file in Metadata Table.
vinothchandar commented on a change in pull request #2494: URL: https://github.com/apache/hudi/pull/2494#discussion_r593866290 ## File path: hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java ## @@ -147,82 +150,91 @@ private void initIfNeeded() { } } timings.add(timer.endTimer()); - LOG.info(String.format("Metadata read for key %s took [open, baseFileRead, logMerge] %s ms", key, timings)); + LOG.info(String.format("Metadata read for key %s took [baseFileRead, logMerge] %s ms", key, timings)); return Option.ofNullable(hoodieRecord); } catch (IOException ioe) { throw new HoodieIOException("Error merging records from metadata table for key :" + key, ioe); -} finally { Review comment: this close is actually needed, when opening the metadata table from the executors. Otherwise, we will leak and suffer the same issues as before. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2580: [HUDI 1623] Introduce start & end commit times to timeline
vinothchandar commented on pull request #2580: URL: https://github.com/apache/hudi/pull/2580#issuecomment-798871487 @n3nash Adding the transtion time to the end of the current timeline files, separated by a dot . i.e t1.commit.requested.t2 seems okay. But unsure how the code would handle the cases where state=completed. i.e we only have t1.commit.t2 and somehow the parsing has to intelligently handle that t2 is a timestamp and not one of the states right. I think we have to do a upgrade/downgrade step here for sure, whichever way we go. did you have more high level thoughts on this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on a change in pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
vinothchandar commented on a change in pull request #2334: URL: https://github.com/apache/hudi/pull/2334#discussion_r593862114 ## File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCreateHandle.java ## @@ -152,9 +155,9 @@ public void write() { final String key = keyIterator.next(); HoodieRecord record = recordMap.get(key); if (useWriterSchema) { - write(record, record.getData().getInsertValue(writerSchemaWithMetafields)); + write(record, record.getData().getInsertValue(inputSchemaWithMetaFields)); Review comment: I actually disagree completely :) . Leaking such call hierarchy into a lower level class will lead to more confusions, if say one more code path uses this code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow commented on issue #2409: URL: https://github.com/apache/hudi/issues/2409#issuecomment-798864560 > @wosow Were you able to resolve your issue ? no This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow edited a comment on issue #2409: URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222 > @wosow : also, few quick questions as we triage the issue. > > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug. > * Is this affecting your production? trying to gauge the severity. > * Or you are trying out a POC ? and this is the first time trying out Hudi. ``` There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0. In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! ! ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow edited a comment on issue #2409: URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222 > @wosow : also, few quick questions as we triage the issue. > > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug. > * Is this affecting your production? trying to gauge the severity. > * Or you are trying out a POC ? and this is the first time trying out Hudi. There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0. In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow edited a comment on issue #2409: URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222 > @wosow : also, few quick questions as we triage the issue. > > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug. > * Is this affecting your production? trying to gauge the severity. > * Or you are trying out a POC ? and this is the first time trying out Hudi. There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0. In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow edited a comment on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow edited a comment on issue #2409: URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222 > @wosow : also, few quick questions as we triage the issue. > > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug. > * Is this affecting your production? trying to gauge the severity. > * Or you are trying out a POC ? and this is the first time trying out Hudi. There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0. In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] wosow commented on issue #2409: [SUPPORT] Spark structured Streaming writes to Hudi and synchronizes Hive to create only read-optimized tables without creating real-time tables
wosow commented on issue #2409: URL: https://github.com/apache/hudi/issues/2409#issuecomment-798858222 > @wosow : also, few quick questions as we triage the issue. > > * Were you running older version of Hudi and encountered this after upgrade? in other words, older Hudi version you were able to run successfully and with 0.7.0 there is a bug. > * Is this affecting your production? trying to gauge the severity. > * Or you are trying out a POC ? and this is the first time trying out Hudi. There is no impact on the production environment, only the problem occurred in the test 0.6.0, and the test was not performed in the 0.7.0. In addition, I have another question. I use sqoop to import the data in mysql to HDFS, and then use Spark to read and write the Hudi table. The table type is MOR. If I want to use asynchronous compaction, what parameters need to be configured, asynchronous Is Compaction automatic? Need manual intervention? Or is it necessary to manually intervene Compaction after opening asynchronous Compaction? If it is necessary to manually execute Compaction manually on a regular basis, what parameters need to be configured for manual Compaction, and what are the commands for manual Compaction? Looking forward to your answer! ! ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] vinothchandar commented on pull request #2309: [HUDI-1441] - HoodieAvroUtils - rewrite() is not handling evolution o…
vinothchandar commented on pull request #2309: URL: https://github.com/apache/hudi/pull/2309#issuecomment-798858167 @n3nash @nbalajee @prashantwason @nsivabalan this PR sounds important, but can someone please summarize its state? also this needs a rebase with only the necessary changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org